Software Development

Sr. Site Reliability Engineer (SRE) (Remote)

Remote
Work Type: Full Time
About SiFi: SiFi is a rapidly growing B2B Fin-Tech company transforming expense management for businesses in Saudi Arabia. As a licensed EMI from the Saudi Central Bank, we empower companies with innovative tools to simplify finance management.

Position Overview
We are looking for a Senior Site Reliability Engineer (SRE) who will take ownership of the reliability, performance, and scalability of our production systems. You will design, automate, and operate mission-critical environments that include Kubernetes clusters, database disaster recovery, workflow orchestration, and multi-region networking.
This role suits engineers who think deeply about systems — combining infrastructure, automation, and diagnostic reasoning to drive operational excellence.


Primary Responsibilities

Reliability, Availability & Infrastructure
  • Maintain and evolve multi-region cloud infrastructure using Terraform-based Infrastructure as Code (IaC).
  • Operate and optimize Kubernetes (OKE) clusters running microservices, data pipelines, and workflow orchestration.
  • Manage SQL Server backup/restore pipelines, DR testing, and performance optimization.
  • Ensure high availability for .NET and Python applications hosted behind load balancers and WAF.
  • Design and maintain cross-network connectivity (DRGs, LPGs, VCNs, subnets, and NSGs).
Observability & Automation
  • Build and maintain a centralized orchestration platform integrated with alerting and notification systems.
  • Develop self-healing, monitoring, and auto-remediation scripts for infrastructure and databases.
  • Implement logging, metrics, and tracing pipelines
  • Automate recurring operational tasks using Python, Bash, and PowerShell to reduce manual effort and improve reliability.
DevOps, CI/CD & Security
  • Manage GitHub Actions and Octopus Deploy pipelines for backend and data services.
  • Apply strong security principles — least privilege, network segmentation, secure credentials, and encrypted communications.
  • Promote GitOps and Infrastructure-as-Code practices to ensure repeatable and traceable deployments.
  • Collaborate with developers to embed reliability and resilience into every release
Collaboration & Incident Management
  • Lead incident response, run blameless post-mortems, and turn findings into lasting improvements.
  • Partner closely with engineering teams to drive design and code-level reliability improvements.
  • Conduct capacity planning, cost optimization, and system tuning for performance and scalability.
  • Mentor engineers in automation, observability, and root-cause analysis best practices
Troubleshooting Mindset & Diagnostic Thinking
We value engineers who:
  1. Approach issues systematically and validate assumptions with data.
  2. Treat incidents as opportunities to improve design and automation.
  3. Rely on metrics, logs, and tracing rather than guesswork.
  4. Communicate findings clearly and document learnings for future reference.
  5. Continuously refine how problems are detected, escalated, and resolved.
Qualifications
  • 5+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering.
    Deep experience with:
  • Solid understanding of networking, load balancing, and DNS.
  • Proven ability to analyze incidents and automate resolution.
  • Experience integrating alerting and monitoring systems with communication tools (e.g., Microsoft Teams or Slack).
  • Oracle Cloud Infrastructure (OCI) (compute, networking, storage, monitoring)
  • Kubernetes (OKE) — deployments, ingress controllers, autoscaling
  • Microsoft SQL Server — backup/restore automation, DR planning, performance tuning
  • Terraform — multi-region and cross-tenant infrastructure automation
  • Python & PowerShell — automation and system scripting
Experience Level:
Mid - Senior Level
 
Sub Department:
Technology
 

Submit Your Application

You have successfully applied
  • You have errors in applying