company-logo-image

Site Reliability Engineer

ashley-avatar-image

AI-generated summary

beta

This job is for a Site Reliability Engineer at Finexus, where you ensure the tech behind Malaysia's finance runs smoothly. You might like this job because it involves solving complex problems, automating systems, and working with cutting-edge technologies!

Undisclosed

Kuala Lumpur Malaysia, Kuala Lumpur

Job Description

Site Reliability Engineer

Why This Role Matters

Finexus operates mission-critical platforms that support Malaysia’s financial ecosystem. Our systems power high-volume transaction processing, cloud services, data centre operations, and regulatory-compliant environments.

As a Site Reliability Engineer (SRE), you ensure that these platforms remain highly available, resilient, secure, and scalable. Your work directly impacts system uptime, client experience, regulatory compliance, incident response effectiveness, and Finexus’ ability to deliver reliable financial technology at scale.

About the Role

The Site Reliability Engineer blends software engineering, system administration, automation, and operations to build and operate reliable, observable, and resilient production systems.

You will manage hybrid infrastructure across on-premise data centres and cloud platforms, implement automation and Infrastructure-as-Code, strengthen observability, and lead incident response practices. The role partners closely with Application, Cloud, Network, Security, Database, and Compliance teams to ensure operational excellence in regulated environments.

Mission / Expected Outcomes

  • Ensure high availability, reliability, and performance of mission-critical systems and platforms.
  • Reduce operational risk through automation, observability, and proactive resilience engineering.
  • Strengthen incident response, recovery time, and operational discipline.
  • Support regulatory compliance and audit readiness across infrastructure and operations.
  • Drive continuous improvement in reliability, scalability, and operational efficiency.

Key Responsibilities

System Reliability & Operations

  • Ensure high availability and reliability of IT systems, applications, and Uptime Tier-III certified data centres supporting internal and client-facing platforms.
  • Perform system administration for Linux and Windows servers including installation, configuration, patching, monitoring, performance tuning, and hardening.
  • Manage data storage, backup, and disaster recovery to ensure data integrity, resilience, and compliance with regulatory standards.
  • Perform capacity planning and lifecycle management of infrastructure resources to ensure scalability and optimal utilisation.
  • Define, implement, and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve system reliability.
  • Conduct chaos testing and fault-injection exercises to proactively identify weaknesses and improve resilience.
  • Design and optimise observability and alerting platforms to provide actionable insights and minimise alert fatigue.

Cloud, Containerization & Automation

  • Support and optimise hybrid cloud environments (AWS, Azure, GCP) aligned with Finexus’ cloud strategy, performance, and cost efficiency goals.
  • Deploy, configure, and maintain Kubernetes clusters and container platforms (e.g. SUSE Rancher Prime) for scalable and resilient workloads.
  • Build and maintain CI/CD pipelines to enable automated deployment, testing, and operational efficiency.
  • Automate configuration, patching, and system management using tools such as Ansible, Puppet, or equivalent.
  • Implement Infrastructure-as-Code using Terraform or equivalent to ensure consistent, auditable, and repeatable provisioning.
  • Develop auto-healing and self-recovery mechanisms to reduce manual intervention and improve mean time to recovery (MTTR).
  • Implement performance and cost optimisation for cloud and container workloads.

Security, Compliance & Risk Management

  • Implement and maintain system and network security controls including firewall management, VPN, identity and access management, and endpoint security.
  • Ensure compliance with regulatory and security frameworks including: 
    • PCI DSS and PCI 3DS
    • ISO/IEC 27001
    • BNM RMiT
    • DCRA and TVRA
    • SOC2
  • Support internal and external audits by preparing evidence, system reports, and compliance documentation.
  • Manage system logs and integrate with SIEM platforms to strengthen monitoring, detection, and incident response.
  • Coordinate vulnerability management activities with Security Operations teams, ensuring timely patching and remediation.
  • Participate in risk assessments and security architecture reviews to align SRE practices with regulatory requirements.

Networking & Core Infrastructure Services

  • Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services.
  • Support virtualization platforms and physical server infrastructure within Finexus data centres.
  • Integrate network observability tools to provide real-time visibility into latency, bandwidth, and routing anomalies.
  • Collaborate on zero-trust segmentation, service mesh integration, and secure connectivity models.

Monitoring, Incident Response & Operations Excellence

  • Provide on-call production support on a rotational basis, ensuring rapid incident resolution and minimal downtime.
  • Lead and participate in major incident management, ensuring structured response and communication.
  • Conduct post-incident reviews (PIRs) and blameless retrospectives to identify root causes and preventive actions.
  • Develop and maintain runbooks, SOPs, and operational documentation to standardise response and improve knowledge transfer.
  • Leverage AIOps and event-correlation tools to improve proactive detection and reduce false positives.

Collaboration & Continuous Improvement

  • Work closely with Application, Database, Network, Cloud, and Security teams to deliver reliable, compliant, and high-performance platforms.
  • Promote SRE best practices including automation, observability, reliability engineering, and operational discipline.
  • Continuously evaluate tools, technologies, and practices to improve system reliability and operational maturity.

Job Requirements

Key Requirements

Academic Qualification

  • Bachelor’s or Master’s Degree in Computer Science, Information Technology, Engineering, or related disciplines.

Experience

  • Minimum 5 years of experience in Site Reliability Engineering, System Administration, or IT Infrastructure within enterprise or production environments.
  • Proven experience operating mission-critical and regulated systems.
  • Experience supporting audits, compliance, and financial services environments is highly advantageous.
  • Experience in database management, operations, and performance tuning is highly advantageous.

Technical Skills

Strong proficiency in:

  • Linux and Windows system administration
  • Monitoring, alerting, and observability platforms
  • Incident management and operational troubleshooting

Hands-on experience with:

  • Cloud platforms: AWS, Azure, GCP
  • Container orchestration: Kubernetes, Rancher
  • CI/CD pipelines and automation workflows
  • Infrastructure-as-Code: Terraform or equivalent
  • Configuration management: Ansible, Puppet, or equivalent

Strong knowledge of:

  • Networking fundamentals, firewalls, DNS, DHCP, VPN
  • Virtualization platforms and data centre infrastructure
  • Database operations including backup, tuning, and recovery

Experience with SRE toolchains such as:

  • Prometheus, Grafana, ELK
  • Jenkins, GitLab CI/CD
  • AIOps and event-correlation platforms

Compliance & Security Knowledge:

  • Good understanding of: 
    • PCI DSS and PCI-3DS
    • ISO/IEC 27001
    • BNM RMiT
    • BNM RMiT
    • DCRA and TVRA
    • SOC2
  • Experience supporting audits, vulnerability management, and regulated operations is strongly preferred.

Competencies & Personal Attributes

  • Strong problem-solving and analytical mindset.
  • Ability to operate effectively in high-pressure, mission-critical environments.
  • Clear communication skills in English and Malay, written and spoken.
  • Strong ownership, accountability, and operational discipline.
  • Ability to work independently and collaboratively in cross-functional teams.
  • Commitment to continuous learning and reliability engineering best practices.

Certifications (Advantageous)

  • AWS Certified SysOps Administrator
  • Red Hat Certified Engineer (RHCE)
  • Certified Kubernetes Administrator (CKA)
  • ISO 27001 Implementer or equivalent

Right to Work Requirements

  • Candidates with an existing right to work in the country are preferred
    • Local citizens of this country
    • Permanent residents (PR) of this country
    • Candidates who already have a work permit for this country

Working Arrangement

  • On Site

Skills

Containerization
Automation
Compliance Risk
IT Infrastructure
Incident Response
Continuous Improvement Process

Company Benefits

Medical & Insurance Coverage

Medical and Insurance coverage for all employees

Personal Development

Fast learning curve, work hard and get rewarded

Easily Accessible

Next to Titiwangsa Sentral LRT, Monorail and Bus Station.

Employee Provident Fund

Higher Employer EPF Contribution

Gym

Indoor Gym

Bonus

Year-end Performance Bonus


Additional Info

Company Activity

Last active - few hours ago

Career Level

Senior Executive

Job Specialisation


Company Profile

Finexus Sdn Bhd-logo-image

Finexus Sdn Bhd

FINEXUS is a dynamic and fast growing FinTech company specialising in payment services. Today, our services cover MyXaaS Innovation Platform with building blocks to build FinTech services, Capital Lending Services, SSO Tech for Payment Processing and Business Processing and RegTech.Our customers span from start-ups to medium growth companies, e-commerce businesses, banks and financial institutions.