Job Description
How You’ll Make a Difference
Incident & Alert Handling
- Serve as the first cloud support contact for incidents escalated by DC Operations / CSM Ops.
- Acknowledge, validate, classify, and log incidents based on severity and business impact.
- Execute predefined Standard Operating Procedures (SOPs) for common alerts and known issues.
- Escalate SEV1/SEV2 incidents to L2 Cloud Engineers and CSM Ops as per the escalation matrix.
- Implement minor configuration changes and patches. Analyze performance metrics to identify potential bottlenecks.
- Collaborate with L3/Application Support/CSM teams on post-incident reviews and problem management.
Advanced Incident Management
- Act as the primary technical responder for escalated incidents from L1.
- Perform deep-dive root cause analysis for Huawei and AWS cloud services, network, and platform performance issues.
- To escalate to application team for application database related issue.
Initial Troubleshooting & Recovery
- Perform first-level diagnostics using runbooks and system dashboards/logs.
- Collect required logs/evidence to support faster L2 investigation.
- Perform basic recovery actions (e.g., restart services, basic network checks, VPN status validation) as permitted under SOP.
- Troubleshoot and resolve issues with critical components: VISA VCX on AWS, HSM as a Service, and the core card processing platform on Huawei Cloud.
On-Call Duty (when as required)
- Participate in a 24x7 on-call rotation for P1/P2 critical incidents. Lead the technical bridge during major incidents until resolution.
Operational Support, Governance, & Administration
- Fulfil routine BAU operational tasks, including:
- User access fulfilment based on approved RBAC matrix
- Environment start/stop commands
- Log retrieval and housekeeping tasks
- Provide initial support for foundational cloud services (e.g., ECS/EC2, RDS for MSSQL & PostgreSQL, Storage Services), and escalate application-related database issues to the Application Support team.
- Create and refine complex troubleshooting runbooks for L1 use. Document solutions and contribute to the team's knowledge base.
- Maintain accurate incident and task records in the ITSM platform (e.g., CSMS).
- Update runbooks/SOPs when gaps, improvements, or new scenarios are identified.