Position Overview
We are looking for a Senior Site Reliability Engineer (SRE) who will take ownership of the reliability, performance, and scalability of our production systems. You will design, automate, and operate mission-critical environments that include Kubernetes clusters, database disaster recovery, workflow orchestration, and multi-region networking.
This role suits engineers who think deeply about systems — combining infrastructure, automation, and diagnostic reasoning to drive operational excellence.
Primary Responsibilities
Reliability, Availability & Infrastructure
- Maintain and evolve multi-region cloud infrastructure using Terraform-based Infrastructure as Code (IaC).
- Operate and optimize Kubernetes (OKE) clusters running microservices, data pipelines, and workflow orchestration.
- Manage SQL Server backup/restore pipelines, DR testing, and performance optimization.
- Ensure high availability for .NET and Python applications hosted behind load balancers and WAF.
- Design and maintain cross-network connectivity (DRGs, LPGs, VCNs, subnets, and NSGs).
Observability & Automation
- Build and maintain a centralized orchestration platform integrated with alerting and notification systems.
- Develop self-healing, monitoring, and auto-remediation scripts for infrastructure and databases.
- Implement logging, metrics, and tracing pipelines
- Automate recurring operational tasks using Python, Bash, and PowerShell to reduce manual effort and improve reliability.
DevOps, CI/CD & Security
- Manage GitHub Actions and Octopus Deploy pipelines for backend and data services.
- Apply strong security principles — least privilege, network segmentation, secure credentials, and encrypted communications.
- Promote GitOps and Infrastructure-as-Code practices to ensure repeatable and traceable deployments.
- Collaborate with developers to embed reliability and resilience into every release
Collaboration & Incident Management
- Lead incident response, run blameless post-mortems, and turn findings into lasting improvements.
- Partner closely with engineering teams to drive design and code-level reliability improvements.
- Conduct capacity planning, cost optimization, and system tuning for performance and scalability.
- Mentor engineers in automation, observability, and root-cause analysis best practices
Troubleshooting Mindset & Diagnostic Thinking
We value engineers who:
- Approach issues systematically and validate assumptions with data.
- Treat incidents as opportunities to improve design and automation.
- Rely on metrics, logs, and tracing rather than guesswork.
- Communicate findings clearly and document learnings for future reference.
- Continuously refine how problems are detected, escalated, and resolved.