Senior Site Reliability Engineer

ovo

The Role

Overview

Ensuring reliable, scalable, cost‑efficient GCP services through monitoring, automation & incident management.

Key Responsibilities

  • design review
  • cost optimisation
  • capacity planning
  • incident management
  • slo tracking
  • monitoring automation

Tasks

-Design Review Input:Take part in design reviews, offering guidance on how to improve reliability, scalability and day-to-day operability within system architecture. -Community of Practice: Actively contribute to your Community of Practice - leading discussions, sharing experiences, mentoring others and helping shape content and capability growth within your area of expertise. -Collaboration and Knowledge Sharing: Work closely with engineering, QA, security and product teams to embed reliability practices, document key processes and mentor peers to support collective learning and growth. -Cost Optimisation Implementation: Assess infrastructure usage and apply approved strategies to optimise cloud costs - balancing resource efficiency with performance and reliability. -Capacity Planning, Performance Tuning & Resilience: Using monitoring and load testing data, you’ll support capacity planning, recommend performance improvements and help implement resilience best practices across systems. -Incident Management and Post-Mortem Analysis: Play a key role in resolving complex production incidents, leading or supporting technical response efforts. Following incidents, you’ll conduct blameless post-mortems to uncover root causes and drive lasting improvements. -Managing SLOs/SLIs and Improving Incident Response: Define and track SLOs and SLIs for key services, contributing to better reliability insights. You'll also help refine incident response processes, support on-call operations, and improve tooling and communication during incidents. -Developing, Refining, and Automating Monitoring Systems: Design, manage and enhance monitoring, alerting and observability systems - such as Datadog, Prometheus and Grafana - ensuring they deliver meaningful insights and effective alerting. You'll also automate repetitive monitoring tasks to improve efficiency.

Requirements

  • sre
  • datadog
  • incident management
  • linux
  • python
  • automation

What You Bring

-Have proven expertise in site reliability practices, including defining and managing SLOs and SLIs to support service performance and availability. -Possess a strong background in advanced monitoring, alert automation, and the use of tools such as Datadog to drive observability and operational insights. -Are skilled in incident management and post-mortem analysis, with a focus on continuous improvement and proactive issue resolution. -Are a clear and effective communicator, capable of translating complex technical issues to a range of audiences. -Have a strong analytical mindset with a commitment to continuous learning, ownership, and resilience under pressure. -Have hands-on experience with automation for capacity planning, performance tuning, and cost optimisation in cloud infrastructure. -Are proficient in Linux system administration and scripting languages such as Bash and Python.

The Company

About ovo

Nope

Sector Specialisms

sectorSpecialismFormat1

sectorSpecialismFormat2

Visa Sponsorship

-sponsorship is not available for this role.