Job Description: The Site Reliability Engineer will work on-site in London for three days a week, focusing on managing cloud infrastructure and ensuring the reliability of production systems. This role involves operational responsibilities, automation, and collaboration across teams to build scalable systems. The position requires participation in on-call rotations and proactive monitoring of systems. The contract is for an initial duration of six months and is classified as inside IR35.
Key Responsibilities:
- Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security.
- Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures.
- Handle change requests, track recurring issues, and work on long-term fixes to improve system stability.
- Implement and maintain observability solutions using Prometheus, Grafana, and Splunk.
- Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics.
- Manage and optimize CI/CD pipelines for automated testing, deployment, and rollback strategies.
- Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks.
- Utilize Git-based workflows for infrastructure changes, version control, and automated deployments.
- Operate, troubleshoot, and optimize Kubernetes clusters and containerized workloads.
- Participate in a rotating on-call schedule to ensure 24/7 availability of production systems.
Skills Required:
- Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level.
- Incident, change & problem management experience.
- Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana, and Splunk, including usage of PromQL.
- Proficient in one or more languages of Python, Go, Bash, SQL.
- Familiar with GitHub/GitOps/container orchestration/Kubernetes operations.
- Working configuration and deployment management experience with CI/CD.
- Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation (desirable).
- Strong knowledge of Splunk for log analysis and troubleshooting (desirable).
- Strong problem-solving skills and analytical thinking (desirable).
Salary (Rate): undetermined
City: City of London
Country: UK
Working Arrangements: on-site
IR35 Status: inside IR35
Seniority Level: undetermined
Industry: IT
Site Reliability Engineer
Whitehall Resources require a Site Reliability Engineer to work with a key client on a 6 month initial contract.
*This role will involve on site work in London 3 days per week.
*Inside IR35.
*This role will require some on-call work.
Site Reliability Engineer
The Role
As a Site Reliability/DevOps Engineer, you will play a critical role in managing cloud infrastructure, ensuring the reliability of production systems, and improving end-to-end deployment pipelines. This role combines deep operational responsibilities with a strong focus on automation, observability, and continuous improvement. You will be responsible for maintaining high system availability, enabling rapid delivery through CI/CD, and supporting development teams with robust infrastructure and tooling. A key part of the role includes proactive monitoring using Prometheus, Grafana, and Splunk, as well as participating in on-call rotations to respond to live incidents. Collaboration across engineering, security, and product teams is essential to build scalable and resilient systems.
Your responsibilities:
1. Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security.
2. Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures.
3. Handle change requests, track recurring issues, and work on long-term fixes to improve system stability.
4. Implement and maintain observability solutions using Prometheus, Grafana, and Splunk.
5. Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics.
6. Manage and optimize CI/CD pipelines for automated testing, deployment, and rollback strategies.
7. Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks.
8. Utilize Git-based workflows for infrastructure changes, version control, and automated deployments.
9. Operate, troubleshoot, and optimize Kubernetes clusters and containerized workloads.
10. Participate in a rotating on-call schedule to ensure 24/7 availability of production systems.
Your Profile
Essential skills/knowledge/experience:
1. Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level
2. Incident, change & problem management experience. This role is heavily operation-oriented, including on-call requirements
3. Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana and Splunk, including usage of PromQL
4. Proficient in one or more languages of Python, Go, Bash, SQL
5. Familiar with GitHub/GitOps/container orchestration/Kubernetes operations
6. Working configuration and deployment management experience with CI/CD
Desirable skills/knowledge/experience:
1. Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation.
2. Strong knowledge of Splunk for log analysis and troubleshooting.
3. Strong problem-solving skills and analytical thinking.
All of our opportunities require that applicants are eligible to work in the specified country/location, unless otherwise stated in the job description.
Whitehall Resources are an equal opportunities employer who value a diverse and inclusive working environment. All qualified applicants will receive consideration for employment without regard to race, religion, gender identity or expression, sexual orientation, national origin, pregnancy, disability, age, veteran status, or other characteristics.
Negotiable
City of London, UK
Inside
Onsite
IT
Not Specified
Job Description: The Site Reliability Engineer will work on-site in London for three days a week, focusing on managing cloud infrastructure and ensuring the reliability of production systems. This role involves operational responsibilities, automation, and collaboration across teams to build scalable systems. The position requires participation in on-call rotations and proactive monitoring of systems. The contract is for an initial duration of six months and is classified as inside IR35.
Key Responsibilities:
- Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security.
- Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures.
- Handle change requests, track recurring issues, and work on long-term fixes to improve system stability.
- Implement and maintain observability solutions using Prometheus, Grafana, and Splunk.
- Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics.
- Manage and optimize CI/CD pipelines for automated testing, deployment, and rollback strategies.
- Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks.
- Utilize Git-based workflows for infrastructure changes, version control, and automated deployments.
- Operate, troubleshoot, and optimize Kubernetes clusters and containerized workloads.
- Participate in a rotating on-call schedule to ensure 24/7 availability of production systems.
Skills Required:
- Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level.
- Incident, change & problem management experience.
- Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana, and Splunk, including usage of PromQL.
- Proficient in one or more languages of Python, Go, Bash, SQL.
- Familiar with GitHub/GitOps/container orchestration/Kubernetes operations.
- Working configuration and deployment management experience with CI/CD.
- Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation (desirable).
- Strong knowledge of Splunk for log analysis and troubleshooting (desirable).
- Strong problem-solving skills and analytical thinking (desirable).
Salary (Rate): undetermined
City: City of London
Country: UK
Working Arrangements: on-site
IR35 Status: inside IR35
Seniority Level: undetermined
Industry: IT
Site Reliability Engineer
Whitehall Resources require a Site Reliability Engineer to work with a key client on a 6 month initial contract.
*This role will involve on site work in London 3 days per week.
*Inside IR35.
*This role will require some on-call work.
Site Reliability Engineer
The Role
As a Site Reliability/DevOps Engineer, you will play a critical role in managing cloud infrastructure, ensuring the reliability of production systems, and improving end-to-end deployment pipelines. This role combines deep operational responsibilities with a strong focus on automation, observability, and continuous improvement. You will be responsible for maintaining high system availability, enabling rapid delivery through CI/CD, and supporting development teams with robust infrastructure and tooling. A key part of the role includes proactive monitoring using Prometheus, Grafana, and Splunk, as well as participating in on-call rotations to respond to live incidents. Collaboration across engineering, security, and product teams is essential to build scalable and resilient systems.
Your responsibilities:
1. Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security.
2. Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures.
3. Handle change requests, track recurring issues, and work on long-term fixes to improve system stability.
4. Implement and maintain observability solutions using Prometheus, Grafana, and Splunk.
5. Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics.
6. Manage and optimize CI/CD pipelines for automated testing, deployment, and rollback strategies.
7. Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks.
8. Utilize Git-based workflows for infrastructure changes, version control, and automated deployments.
9. Operate, troubleshoot, and optimize Kubernetes clusters and containerized workloads.
10. Participate in a rotating on-call schedule to ensure 24/7 availability of production systems.
Your Profile
Essential skills/knowledge/experience:
1. Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level
2. Incident, change & problem management experience. This role is heavily operation-oriented, including on-call requirements
3. Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana and Splunk, including usage of PromQL
4. Proficient in one or more languages of Python, Go, Bash, SQL
5. Familiar with GitHub/GitOps/container orchestration/Kubernetes operations
6. Working configuration and deployment management experience with CI/CD
Desirable skills/knowledge/experience:
1. Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation.
2. Strong knowledge of Splunk for log analysis and troubleshooting.
3. Strong problem-solving skills and analytical thinking.
All of our opportunities require that applicants are eligible to work in the specified country/location, unless otherwise stated in the job description.
Whitehall Resources are an equal opportunities employer who value a diverse and inclusive working environment. All qualified applicants will receive consideration for employment without regard to race, religion, gender identity or expression, sexual orientation, national origin, pregnancy, disability, age, veteran status, or other characteristics.

Create a free account to view the take-home pay for this contract
Sign Up