About SiFi: SiFi is a rapidly growing B2B Fin-Tech company transforming expense management for businesses in Saudi Arabia. As a licensed EMI from the Saudi Central Bank, we empower companies with innovative tools to simplify finance management.

Position Overview
We are looking for a Senior Site Reliability Engineer (SRE) who will take ownership of the reliability, performance, and scalability of our production systems. You will design, automate, and operate mission-critical environments that include Kubernetes clusters, database disaster recovery, workflow orchestration, and multi-region networking.
This role suits engineers who think deeply about systems — combining infrastructure, automation, and diagnostic reasoning to drive operational excellence.

Primary Responsibilities

Reliability, Availability & Infrastructure

Maintain and evolve multi-region cloud infrastructure using Terraform-based Infrastructure as Code (IaC).
Operate and optimize Kubernetes (OKE) clusters running microservices, data pipelines, and workflow orchestration.
Manage SQL Server backup/restore pipelines, DR testing, and performance optimization.
Ensure high availability for .NET and Python applications hosted behind load balancers and WAF.
Design and maintain cross-network connectivity (DRGs, LPGs, VCNs, subnets, and NSGs).

Observability & Automation

Build and maintain a centralized orchestration platform integrated with alerting and notification systems.
Develop self-healing, monitoring, and auto-remediation scripts for infrastructure and databases.
Implement logging, metrics, and tracing pipelines
Automate recurring operational tasks using Python, Bash, and PowerShell to reduce manual effort and improve reliability.

DevOps, CI/CD & Security

Manage GitHub Actions and Octopus Deploy pipelines for backend and data services.
Apply strong security principles — least privilege, network segmentation, secure credentials, and encrypted communications.
Promote GitOps and Infrastructure-as-Code practices to ensure repeatable and traceable deployments.
Collaborate with developers to embed reliability and resilience into every release

Collaboration & Incident Management

Lead incident response, run blameless post-mortems, and turn findings into lasting improvements.
Partner closely with engineering teams to drive design and code-level reliability improvements.
Conduct capacity planning, cost optimization, and system tuning for performance and scalability.
Mentor engineers in automation, observability, and root-cause analysis best practices

Troubleshooting Mindset & Diagnostic Thinking
We value engineers who:

Approach issues systematically and validate assumptions with data.
Treat incidents as opportunities to improve design and automation.
Rely on metrics, logs, and tracing rather than guesswork.
Communicate findings clearly and document learnings for future reference.
Continuously refine how problems are detected, escalated, and resolved.

Qualifications

5+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering.
Experience working in regulated environments (FinTech, banking, compliance-driven systems) is a MUST.
Experience managing Active Directory / Domain Controllers (authentication (SSO/MFA), policies, integrations, troubleshooting) is a MUST.
Strong hands-on experience with Oracle Cloud Infrastructure (OCI) is a MUST.
Strong Linux system administration skills (performance tuning, storage, networking, systems, troubleshooting).
Solid networking fundamentals (TCP/IP, DNS, routing, load balancing).
Hands-on experience with databases, specifically: Microsoft SQL Server & PostgreSQL.
Experience administering and troubleshooting Kubernetes workloads (pods, jobs, networking, secrets).
Strong experience with Infrastructure as Code, preferably Terraform.
Experience building and maintaining CI/CD pipelines and automation workflows.
Comfortable supporting production systems, including incident response and troubleshooting under pressure.
Experience managing Active Directory / Domain Controllers (authentication (SSO/MFA), policies, integrations, troubleshooting).

Experience Level:

Mid - Senior Level

Sub Department:

Technology

Sr. Site Reliability Engineer (SRE) (Remote)

Submit Your Application