Senior Site Reliability Engineer
ovo
The Role
Overview
Ensuring reliable, scalable, cost‑efficient GCP services through monitoring, automation & incident management.
Key Responsibilities
- design review
- cost optimisation
- capacity planning
- incident management
- slo tracking
- monitoring automation
Tasks
-Design Review Input:Take part in design reviews, offering guidance on how to improve reliability, scalability and day-to-day operability within system architecture. -Community of Practice: Actively contribute to your Community of Practice - leading discussions, sharing experiences, mentoring others and helping shape content and capability growth within your area of expertise. -Collaboration and Knowledge Sharing: Work closely with engineering, QA, security and product teams to embed reliability practices, document key processes and mentor peers to support collective learning and growth. -Cost Optimisation Implementation: Assess infrastructure usage and apply approved strategies to optimise cloud costs - balancing resource efficiency with performance and reliability. -Capacity Planning, Performance Tuning & Resilience: Using monitoring and load testing data, you’ll support capacity planning, recommend performance improvements and help implement resilience best practices across systems. -Incident Management and Post-Mortem Analysis: Play a key role in resolving complex production incidents, leading or supporting technical response efforts. Following incidents, you’ll conduct blameless post-mortems to uncover root causes and drive lasting improvements. -Managing SLOs/SLIs and Improving Incident Response: Define and track SLOs and SLIs for key services, contributing to better reliability insights. You'll also help refine incident response processes, support on-call operations, and improve tooling and communication during incidents. -Developing, Refining, and Automating Monitoring Systems: Design, manage and enhance monitoring, alerting and observability systems - such as Datadog, Prometheus and Grafana - ensuring they deliver meaningful insights and effective alerting. You'll also automate repetitive monitoring tasks to improve efficiency.
Requirements
- sre
- datadog
- incident management
- linux
- python
- automation
What You Bring
-Have proven expertise in site reliability practices, including defining and managing SLOs and SLIs to support service performance and availability. -Possess a strong background in advanced monitoring, alert automation, and the use of tools such as Datadog to drive observability and operational insights. -Are skilled in incident management and post-mortem analysis, with a focus on continuous improvement and proactive issue resolution. -Are a clear and effective communicator, capable of translating complex technical issues to a range of audiences. -Have a strong analytical mindset with a commitment to continuous learning, ownership, and resilience under pressure. -Have hands-on experience with automation for capacity planning, performance tuning, and cost optimisation in cloud infrastructure. -Are proficient in Linux system administration and scripting languages such as Bash and Python.
People Also Searched For
Driver jobs in Hinkley Point , Somerset , UK
Field Engineer jobs in Hinkley Point , Somerset , UK
Installation Technician jobs in Hinkley Point , Somerset , UK
Driver jobs in Somerset , UK
Field Engineer jobs in Somerset , UK
Installation Technician jobs in Somerset , UK
Driver jobs in Hinkley Point , UK
Field Engineer jobs in Hinkley Point , UK
Installation Technician jobs in Hinkley Point , UK
The Company
About ovo
Nope
Sector Specialisms
sectorSpecialismFormat1
sectorSpecialismFormat2
Visa Sponsorship
-sponsorship is not available for this role.
