Job Summary
The Site Reliability Engineer is responsible for ensuring the
reliability, availability, performance, and scalability of infrastructure and applications. The role emphasizes
automation, monitoring, incident management, and continuous improvement, working closely with development and operations teams.
- Key Responsibilities
- Ensure high availability, reliability, and performance of production systems
- Design, implement, and maintain monitoring, alerting, and observability solutions
- Automate infrastructure provisioning, deployments, and operational tasks
- Lead incident response, troubleshooting, and root cause analysis (RCA)
- Optimize system performance, scalability, and capacity planning
- Collaborate with development teams to improve application reliability and operability
- Define, track, and improve SLAs
- Reduce operational toil through automation and process improvement
- Ensure security, compliance, and best operational practices
- Participate in on-call rotations and providing production support
- Required Skills / Must-Have
Technical Skills
- Linux/Unix system administration
- Kubernetes / OpenShift administration and troubleshooting
- Cloud platforms: AWS / Azure
- Monitoring & observability: Prometheus, Grafana, ELK, Datadog
- Scripting: Shell, Python, or Go
- Infrastructure as Code: Terraform, Ansible, Helm
- CI/CD pipelines and DevOps practices
Experience
- Experience in SRE / DevOps / Platform Engineering / Production Support
- Experience managing production-grade distributed systems
- Nice-to-Have / Preferred Skills
- Service mesh experience (Istio, Linkerd)
- Messaging systems: Kafka, ActiveMQ, RabbitMQ
- Performance testing and load testing tools
- Security and compliance experience in regulated environments
- Exposure to Google SRE principles and practices
- Education & Qualifications
- Primary / Preferred Education
- Bachelor’s degree in Computer Science, Information Technology, or related field (preferred)
- Certifications / Licenses
Preferred (Not Mandatory)
- Red Hat OpenShift certification
- Skills Grouping & Synonyms (for AI Matching)
Operations & Reliability
- Site Reliability Engineering / Production Support / Platform Engineering
- Incident management / Major incident / RCA / Postmortem
Cloud & Containers
- Kubernetes / OpenShift / Container orchestration
- Cloud infrastructure / IaaS / PaaS
Automation & DevOps
- Infrastructure as Code / IaC / Terraform / Ansible
- CI/CD / Continuous delivery / Automation
Monitoring & Observability
- Monitoring / Alerting / Metrics / Logging / Tracing
- Prometheus / Grafana / ELK / APM
- Location & Work Mode
- Location: Gurugram, Haryana
- Work Mode: Onsite