Senior Infrastructure Engineer

Maistorage Technology Sdn Bhd

AI-generated summary

beta

This job is for a Senior Infrastructure Engineer who designs and maintains cutting-edge systems for AI. You might like this job because you'll work with advanced technologies like Kubernetes and GPU management, directly impacting AI development!

Undisclosed

Bandar Puteri Puchong, Selangor

Full-Time

1 week ago

Job Description

Role Overview

We are looking for a highly skilled Senior Platform & Infrastructure Engineer to design, build, and maintain our foundational infrastructure platform tailored for high-performance AI/ML workloads and LLMOps. In this role, you will be the driving force behind our bare-metal-to-cloud ecosystem, managing everything from physical server provisioning to advanced container orchestration and GPU management.

If you have a deep understanding of the Canonical stack (MAAS, Juju), software-defined storage (Ceph), and are an expert in managing Kubernetes for scalable AI infrastructure, we want you on our team. You will work collaboratively to deliver scalable, reliable storage and compute infrastructure while directly enabling cutting-edge AI/ML workload development.

Key Responsibilities

AI/ML & LLMOps Infrastructure: Provision, optimize, and manage GPU-accelerated Kubernetes nodes. Support the deployment and scaling of AI/ML pipelines, LLM serving infrastructure (e.g., vLLM, Ollama, Ray), and model orchestration tools.
Kubernetes Orchestration: Deploy, manage, and scale highly available Kubernetes (K8s) clusters. Ensure reliability and security, specifically tuning for heavy compute and storage-intensive AI workloads using CSI drivers and StatefulSets.
Storage Architecture: Design and implement software-defined storage solutions using Ceph. Optimize IOPS, latency, and throughput for diverse AI requirements, including block, object, and file storage modes.
Bare-Metal & IaC Automation: Utilize Canonical MAAS to automate the lifecycle of physical servers. Architect and maintain infrastructure using Juju and Terraform to deploy and integrate complex application components at scale.
Application Deployment: Design, author, and maintain custom Helm charts to streamline application packaging and deployment across production environments.
Performance Optimization: Monitor and troubleshoot infrastructure issues using Prometheus and Grafana; implement improvements based on deep-dive performance analysis of GPU and storage utilization.

Minimum Qualifications

Experience: 3+ years of experience in infrastructure engineering, with a focused background in data center technologies, storage, and container orchestration.
Expert-Level Kubernetes: Proven experience managing production K8s clusters, including networking, hardware accelerators (NVIDIA device plugins), and storage integration (Persistent Volumes).
Software-Defined Storage: Solid practical experience deploying and managing Ceph storage clusters optimized for large-scale AI datasets and parallel access patterns.
Canonical Ecosystem & IaC: Hands-on experience with MAAS for bare-metal provisioning and Juju (Charms) or Terraform for Infrastructure as Code.
Linux Systems Mastery: Strong foundation in Linux system administration (Ubuntu preferred), hardware management, and storage protocols (iSCSI, NFS, RBD, S3).
Helm Proficiency: Strong ability to construct, manage, and troubleshoot complex Helm charts from scratch.

Nice-to-Haves

AI/ML Lifecycle Tools: Experience with specific frameworks such as Kubeflow, MLflow, or Ray.
Advanced Automation: Proficiency in Golang for Kubernetes plugin development or Python scripting for automation frameworks like Ansible.
Certifications: CKA (Certified Kubernetes Administrator), CKAD, or professional Linux/Storage certifications.
Observability: Experience building custom Grafana dashboards tailored for tracking GPU saturation and AI model inference performance.
Networking: Understanding of software-defined networking (SDN) and container networking interfaces (CNI) in multi-cluster environments.

Job Requirements

Required Qualifications:

5-7 years of experience in infrastructure engineering with focus on storage and data center technologies
Strong hands-on experience with Ceph storage (block, object, and file storage modes)
Proficiency with Kubernetes including storage integration (CSI drivers, persistent volumes, StatefulSets)
Solid experience with Infrastructure as Code using Terraform and/or Juju
Understanding of storage performance concepts: IOPS, throughput, latency tuning, and capacity planning
Experience with Linux system administration and storage protocols (iSCSI, NFS, RBD, S3)
Practical deployment experience: server configuration, storage setup, and system integration
Good communication skills with ability to document architectures and explain technical concepts
Relevant certifications a plus: CKA (Certified Kubernetes Administrator), Linux certifications, or vendor-specific storage certifications

Preferred Qualifications:

Experience supporting AI/ML workloads and understanding their storage requirements (high IOPS, parallel access patterns)
Knowledge of storage backends for AI frameworks (distributed storage, GPU-optimized data pipelines)
Familiarity with monitoring and observability tools (Prometheus, Grafana)
Experience with automation frameworks (Ansible, Python scripting)
Experience with Golang for Kubernetes plugin
Understanding of software-defined networking (SDN) and container networking
Exposure to hybrid cloud or multi-cluster environments

Skills

Infrastructure as Code (IaC)

Ceph (Software)

Kubernetes

Linux

Company Benefits

Competitive Salary

13th-Month Salary, Performance Bonus

Comprehensive Medical Coverage

OPC, GHS, GTL, Optical & Dental Care Subsidy

Training Program

3-6 months of Comprehensive Training

Additional Info

Company Activity

Last active - few hours ago

Experience Level

4 - 7 Years of Experience

Career Level

Manager / Team Lead

Job Specialisation

Computer Engineering

Company Profile

Maistorage Technology Sdn Bhd

Maistorage Technology Sdn. Bhd. was established in June 2024. As a new IC startup company located in Puchong, Selangor, Malaysia, MaiStorage replicates the unique business model of its parent company, Phison. It also acts as the principal hub, regional business operations center and management seat for strategic planning, decision-making, and business development.