Logo SpAItial

Machine Learning Systems & Infrastructure Engineer

Job

  • Level
    Erfahren
  • Job Feld
    Software, Data
  • Anstellung
    Vollzeit
  • Vertragsart
    Unbefristetes Dienstverhältnis
  • Ort
    München
  • Arbeitsmodell
    Onsite
  • Job Zusammenfassung

    In dieser Rolle baust du robuste ML-Systeme und Infrastruktur, um reale Daten in trainierte 3D-Weltmodelle zu transformieren und automatisierte Produktionsendpunkte zu schaffen, während du enge Zusammenarbeit mit dem Forschungsteam pflegst.

    Job Technologien

    Deine Rolle im Team

    • SpAItial is pioneering the next generation of World Models, pushing the boundaries of generative AI, computer vision, and simulation.
    • We are moving beyond 2D pixels to build models that natively understand the physics and geometry of our world.
    • Our mission is to redefine how industries, from robotics and AR/VR to gaming and cinema, generate and interact with physically-grounded 3D environments.
    • We're looking for bold, innovative individuals driven by a passion for tackling hard problems in generative 3D AI.
    • You should thrive in an environment where creativity meets technical challenge, take pride in craft, and collaborate closely with a small team building frontier systems.
    • We are seeking a Machine Learning Systems & Infrastructure Engineer to build and own the systems that turn raw real-world data into trained world models and reliable production endpoints.
    • You will design, implement, and operate scalable training stacks, data ingestion pipelines, experiment orchestration, and model serving for large diffusion-based generative models.
    • The role is hands-on and code-heavy - you will work inside the same monorepo as the research team, mostly in Python, and should be as comfortable refactoring a trainer class or a dataset loader as you are writing Terraform.
    • Own and evolve the ML systems that enable training, evaluation, and serving of large foundation models - trainer, dataset loaders, checkpointing, and experiment orchestration code.
    • Distributed training enablement: Improve high-throughput training stacks (e.g., PyTorch DDP/FSDP, NCCL) for performance, stability, and reproducibility, including preemption-safe and sharded checkpointing.
    • Data systems and pipelines: Build end-to-end Python pipelines that turn third-party capture sources into clean, versioned training datasets - including scraping (e.g., Playwright) and preprocessing - and optimize the underlying storage at petabyte scale (object storage, fuse mounts, caching layers, shared filesystems, and relational / analytical / embedded metadata stores).
    • ML workflow orchestration and serving: Operate the systems researchers use to launch experiments, data jobs, and production endpoints - workflow engines (e.g., Kubeflow Pipelines, Airflow), GPU schedulers (e.g., Volcano, Slurm), experiment trackers (e.g., MLflow, Weights & Biases), and managed-inference platforms (e.g., Modal, Triton) - and maintain a launcher SDK for one-command runs.
    • Containerization and packaging: Ship workloads with Docker and Kubernetes; maintain IaC (Terraform) for the surfaces you own and CI/CD pipelines, including self-hosted GPU runners.
    • Observability and reliability: Monitoring, logging, and alerting for job performance, data-pipeline health, and cost (e.g., Prometheus/Grafana, OpenTelemetry); define SLOs and incident response for the systems you own.
    • Security and access: Manage secrets, IAM, and network boundaries (e.g., Tailscale, cloud VPC) for the systems you own.
    • Collaboration: Partner with ML researchers, engineers, and the platform team to unblock training and data work and improve developer experience.

    Unsere Erwartungen an dich

    Qualifikationen

    • Hands-on with modern ML training stacks (PyTorch; DDP/FSDP or comparable); have personally debugged distributed jobs across many GPUs and nodes.
    • Have shipped non-trivial end-to-end data pipelines at scale - ingestion, transformation, validation, versioning, republish - ideally including real-world sources with rate limits, auth, or undocumented APIs.
    • Hands-on GPU compute and performance debugging (CUDA/NCCL, GPU utilization, networking bottlenecks, profiling).
    • Working knowledge of cloud environments (AWS, GCP, or Azure), including object storage, IAM, and cost awareness.
    • Proficient with containers (Docker, Kubernetes) and comfortable reading and writing IaC (Terraform) for the surfaces you ship.
    • Strong working knowledge of how to store and query large datasets at scale: SQL fundamentals; relational (e.g., Postgres), analytical (e.g., BigQuery, Snowflake), and embedded (e.g., SQLite) stores; and object storage with caching layers.
    • Familiarity with ML workflow orchestration and experiment tracking (e.g., Kubeflow Pipelines, MLflow).

    Erfahrung

    • 3+ years writing production-quality Python in a large, multi-author codebase, with strong SWE fundamentals (ML systems experience strongly preferred).
    • Experience with monitoring and observability tooling (e.g., Prometheus/Grafana, OpenTelemetry) and CI/CD for infra and ML workflows (e.g., GitHub Actions).

    Unser Angebot

    • SpAItial is committed to creating a diverse and inclusive workplace.
    • We welcome applications from people of all backgrounds, experiences, and perspectives.
    • We are an equal opportunity employer and ensure all candidates are treated fairly throughout the recruitment process.

    Themen mit denen du dich im Job beschäftigst

    Job Standorte

    • Standort München

      Bayern

      Deutschland

    Das ist dein Arbeitgeber

    SpAItial

    SpAItial

    SpAItial ist ein innovatives KI-Startup aus München, das sich auf die Entwicklung von Spatial Foundation Models konzentriert. Diese Modelle erlauben die Generierung physikalisch präziser 3D-Umgebungen aus unterschiedlichen Eingaben und werden in Bereichen wie Gaming, Robotik sowie AR/VR eingesetzt. Gegründet von Prof. Matthias Niessner und seinem Team, vereint das Unternehmen Forschungsergebnisse aus München und London.

    Description

  • Unternehmenstyp
    Startup
  • Arbeitsmodell
    Onsite
  • Branche
    Internet, IT, Telekom
  • Logo SpAItial

    Machine Learning Systems & Infrastructure Engineer

    Ort
    München
    Arbeitsmodell
    Onsite
    Diversität
    Für alle Personen geeignet (m/w/d)
    Nur Englisch
    Nur Englisch erforderlich

    Weitere Jobs