Job Description
Site Reliability Engineer
Why This Role Matters
Finexus operates mission-critical platforms that support Malaysia’s financial ecosystem. Our systems power high-volume transaction processing, cloud services, data centre operations, and regulatory-compliant environments.
As a Site Reliability Engineer (SRE), you ensure that these platforms remain highly available, resilient, secure, and scalable. Your work directly impacts system uptime, client experience, regulatory compliance, incident response effectiveness, and Finexus’ ability to deliver reliable financial technology at scale.
About the Role
The Site Reliability Engineer blends software engineering, system administration, automation, and operations to build and operate reliable, observable, and resilient production systems.
You will manage hybrid infrastructure across on-premise data centres and cloud platforms, implement automation and Infrastructure-as-Code, strengthen observability, and lead incident response practices. The role partners closely with Application, Cloud, Network, Security, Database, and Compliance teams to ensure operational excellence in regulated environments.
Mission / Expected Outcomes
- Ensure high availability, reliability, and performance of mission-critical systems and platforms.
- Reduce operational risk through automation, observability, and proactive resilience engineering.
- Strengthen incident response, recovery time, and operational discipline.
- Support regulatory compliance and audit readiness across infrastructure and operations.
- Drive continuous improvement in reliability, scalability, and operational efficiency.
Key Responsibilities
System Reliability & Operations
- Ensure high availability and reliability of IT systems, applications, and Uptime Tier-III certified data centres supporting internal and client-facing platforms.
- Perform system administration for Linux and Windows servers including installation, configuration, patching, monitoring, performance tuning, and hardening.
- Manage data storage, backup, and disaster recovery to ensure data integrity, resilience, and compliance with regulatory standards.
- Perform capacity planning and lifecycle management of infrastructure resources to ensure scalability and optimal utilisation.
- Define, implement, and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve system reliability.
- Conduct chaos testing and fault-injection exercises to proactively identify weaknesses and improve resilience.
- Design and optimise observability and alerting platforms to provide actionable insights and minimise alert fatigue.
Cloud, Containerization & Automation
- Support and optimise hybrid cloud environments (AWS, Azure, GCP) aligned with Finexus’ cloud strategy, performance, and cost efficiency goals.
- Deploy, configure, and maintain Kubernetes clusters and container platforms (e.g. SUSE Rancher Prime) for scalable and resilient workloads.
- Build and maintain CI/CD pipelines to enable automated deployment, testing, and operational efficiency.
- Automate configuration, patching, and system management using tools such as Ansible, Puppet, or equivalent.
- Implement Infrastructure-as-Code using Terraform or equivalent to ensure consistent, auditable, and repeatable provisioning.
- Develop auto-healing and self-recovery mechanisms to reduce manual intervention and improve mean time to recovery (MTTR).
- Implement performance and cost optimisation for cloud and container workloads.
Security, Compliance & Risk Management
- Implement and maintain system and network security controls including firewall management, VPN, identity and access management, and endpoint security.
- Ensure compliance with regulatory and security frameworks including:
- PCI DSS and PCI 3DS
- ISO/IEC 27001
- BNM RMiT
- DCRA and TVRA
- SOC2
- Support internal and external audits by preparing evidence, system reports, and compliance documentation.
- Manage system logs and integrate with SIEM platforms to strengthen monitoring, detection, and incident response.
- Coordinate vulnerability management activities with Security Operations teams, ensuring timely patching and remediation.
- Participate in risk assessments and security architecture reviews to align SRE practices with regulatory requirements.
Networking & Core Infrastructure Services
- Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services.
- Support virtualization platforms and physical server infrastructure within Finexus data centres.
- Integrate network observability tools to provide real-time visibility into latency, bandwidth, and routing anomalies.
- Collaborate on zero-trust segmentation, service mesh integration, and secure connectivity models.
Monitoring, Incident Response & Operations Excellence
- Provide on-call production support on a rotational basis, ensuring rapid incident resolution and minimal downtime.
- Lead and participate in major incident management, ensuring structured response and communication.
- Conduct post-incident reviews (PIRs) and blameless retrospectives to identify root causes and preventive actions.
- Develop and maintain runbooks, SOPs, and operational documentation to standardise response and improve knowledge transfer.
- Leverage AIOps and event-correlation tools to improve proactive detection and reduce false positives.
Collaboration & Continuous Improvement
- Work closely with Application, Database, Network, Cloud, and Security teams to deliver reliable, compliant, and high-performance platforms.
- Promote SRE best practices including automation, observability, reliability engineering, and operational discipline.
- Continuously evaluate tools, technologies, and practices to improve system reliability and operational maturity.