Job Description
Role Overview
We are looking for a highly skilled Senior Platform & Infrastructure Engineer to design, build, and maintain our foundational infrastructure platform tailored for high-performance AI/ML workloads and LLMOps. In this role, you will be the driving force behind our bare-metal-to-cloud ecosystem, managing everything from physical server provisioning to advanced container orchestration and GPU management.
If you have a deep understanding of the Canonical stack (MAAS, Juju), software-defined storage (Ceph), and are an expert in managing Kubernetes for scalable AI infrastructure, we want you on our team. You will work collaboratively to deliver scalable, reliable storage and compute infrastructure while directly enabling cutting-edge AI/ML workload development.
Key Responsibilities
- AI/ML & LLMOps Infrastructure: Provision, optimize, and manage GPU-accelerated Kubernetes nodes. Support the deployment and scaling of AI/ML pipelines, LLM serving infrastructure (e.g., vLLM, Ollama, Ray), and model orchestration tools.
- Kubernetes Orchestration: Deploy, manage, and scale highly available Kubernetes (K8s) clusters. Ensure reliability and security, specifically tuning for heavy compute and storage-intensive AI workloads using CSI drivers and StatefulSets.
- Storage Architecture: Design and implement software-defined storage solutions using Ceph. Optimize IOPS, latency, and throughput for diverse AI requirements, including block, object, and file storage modes.
- Bare-Metal & IaC Automation: Utilize Canonical MAAS to automate the lifecycle of physical servers. Architect and maintain infrastructure using Juju and Terraform to deploy and integrate complex application components at scale.
- Application Deployment: Design, author, and maintain custom Helm charts to streamline application packaging and deployment across production environments.
- Performance Optimization: Monitor and troubleshoot infrastructure issues using Prometheus and Grafana; implement improvements based on deep-dive performance analysis of GPU and storage utilization.
Minimum Qualifications
- Experience: 3+ years of experience in infrastructure engineering, with a focused background in data center technologies, storage, and container orchestration.
- Expert-Level Kubernetes: Proven experience managing production K8s clusters, including networking, hardware accelerators (NVIDIA device plugins), and storage integration (Persistent Volumes).
- Software-Defined Storage: Solid practical experience deploying and managing Ceph storage clusters optimized for large-scale AI datasets and parallel access patterns.
- Canonical Ecosystem & IaC: Hands-on experience with MAAS for bare-metal provisioning and Juju (Charms) or Terraform for Infrastructure as Code.
- Linux Systems Mastery: Strong foundation in Linux system administration (Ubuntu preferred), hardware management, and storage protocols (iSCSI, NFS, RBD, S3).
- Helm Proficiency: Strong ability to construct, manage, and troubleshoot complex Helm charts from scratch.
Nice-to-Haves
- AI/ML Lifecycle Tools: Experience with specific frameworks such as Kubeflow, MLflow, or Ray.
- Advanced Automation: Proficiency in Golang for Kubernetes plugin development or Python scripting for automation frameworks like Ansible.
- Certifications: CKA (Certified Kubernetes Administrator), CKAD, or professional Linux/Storage certifications.
- Observability: Experience building custom Grafana dashboards tailored for tracking GPU saturation and AI model inference performance.
- Networking: Understanding of software-defined networking (SDN) and container networking interfaces (CNI) in multi-cluster environments.