Department
Digital & Technology Office
Employee Type
Probationary
The Senior Site Reliability Engineer will serve as the first line of defense for our 24/7 operations. You will act as the guardian of our production environment, utilizing Dynatrace to maintain a holistic view of both Infrastructure and Application health.
You will not just monitor uptime; you will actively test system resilience, manage major incidents, and facilitate stability reporting. You will be the primary notification point for all P1/P2 incidents, responsible for deep-dive triage, quick remediation, and coordinating Major Incident Management (MIM).
Key Responsibilities
24/7 Incident Command & Alerting
- 24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the "eyes on glass" for the organization.
- Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
- Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.
Observability Strategy (Dynatrace Focus)
- Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
- Configure Management Zones, Alerting Profiles, and Dashboards to provide a "Single Pane of Glass."
- Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
- Leverage Davis AI to automatically detect anomalies and reduce alert noise.
- Comprehensive Monitoring Scope:
- Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
- Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
- Security: Monitor for DDoS attack patterns and WAF spikes.
Resilience & Chaos Engineering
- Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the system's resilience and verify that failover mechanisms work as expected.
- Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
- First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute "Quick Fix" runbooks to mitigate impact before escalating to platform engineering.
Application Triage & Analysis
- Deep-Dive Triage: Go beyond "system check" to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
- Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
- Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.
Governance & Reporting (Stability Cadence)
- Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
- Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
- Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed "blind spots" in production.
Automation & Toil Reduction
- Remediation Scripting: Develop scripts (Python/Bash) to "Auto-Heal" common issues (e.g., clearing logs when disk is full, restarting stuck services).
- Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.
Required Qualifications
- Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
- Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
- Troubleshooting Expertise:
- Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
- Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
- Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
- Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
- Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services
Experience Range Range (Years)
4 - 8 years
Job posted on
2026-03-12