Job
- Level
- Erfahren
- Job Feld
- Software, Data
- Anstellung
- Vollzeit
- Vertragsart
- Unbefristetes Dienstverhältnis
- Ort
- München
- Arbeitsmodell
- Onsite
Job Zusammenfassung
In dieser Position entwickelst du die Infrastruktur für die verteilte Trainings-, Bereitstellungs- und Experimentierumgebung, wobei du Technologien wie Kubernetes und PyTorch verwendest, um ML-Modelle effizient in die Produktion zu überführen.
Job Technologien
Deine Rolle im Team
- The AI Research Division of Agile Robots is looking for an ML Platform Engineer, who will build and operate the distributed training, deployment, and experimentation infrastructure that research, data, and robotics teams depend on to move models from prototype to production.
- Design and scale distributed training workflows for large models using tools such as PyTorch Distributed, DeepSpeed, and cluster schedulers like SLURM or Kubernetes.
- Build and maintain containerised ML environments that support reproducible experimentation and benchmarking.
- Develop and maintain CI/CD pipelines for machine learning systems to enable reliable testing, training, and deployment of models.
- Implement experiment tracking, model versioning, and reproducibility workflows using tools such as ClearML or Weights & Biases.
- Set up monitoring systems such as Prometheus and Grafana to track model performance and system health and detect drift in production.
- Work with research, data, and robotics teams to connect new models to robust production systems.
Unsere Erwartungen an dich
Ausbildung
- Degree in Computer Science, Software Engineering, or a related field, with professional experience building and operating ML or software infrastructure in production.
Qualifikationen
- Familiarity with infrastructure-as-code tools such as Terraform.
- Exposure to high-performance or distributed compute environments.
Erfahrung
- Experience designing and operating distributed training systems on Kubernetes and Docker, using PyTorch Distributed, DeepSpeed, and schedulers such as SLURM.
- Experience building CI/CD pipelines that support reliable model testing, training, and deployment.
- Experience operating ML workloads on cloud infrastructure, preferably AWS.
- Hands-on experience with experiment tracking and model versioning using tools such as MLflow or Weights & Biases.
- Experience with monitoring and drift detection using tools such as Prometheus and Grafana.
- Python and system design skills, with experience building and operating ML systems beyond the prototype stage.
- Experience with large-scale or multimodal ML systems such as vision-language-action models.
- Experience with ML pipeline and orchestration tools.
Unser Angebot
- Dynamic high-tech company combined with financial soundness and world class investors.
- Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.
- Lots of development opportunities in the context of our continued growth.
- Challenging tasks and impactful projects alongside experts that enable professional and personal growth.
- Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.
- Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.
Benefits
Gesundheit, Fitness & Fun
Work-Life-Integration
Themen mit denen du dich im Job beschäftigst
Job Standorte
Das ist dein Arbeitgeber
Agile Robots Ag
Agile Robots SE, gegründet von führenden Robotik-Forschern, fokussiert sich auf die Entwicklung von KI-gesteuerten Robotern und hat sich als Vorreiter in der Automatisierung etabliert.
Description
- Unternehmenstyp
- Etablierte Firma
- Arbeitsmodell
- Onsite
- Branche
- Elektronik, Automatisation