Director - Site Reliability Engineering - IT

18 - 20 Years

Noida /Bangalore /Pune

Posted 4 months ago

#IT Operations #IT Change Management #IT Project Management #IT Program Management #IT Incident Management #ITIL #CTO

We are hiring Director - SRE

Experience : 18+

Notice Period : Immediate 30 days joiners only

Locations : Noida, Bangalore, Pune

JD :

- Leadership & Strategy : Provide technical and people leadership to SRE, DevOps, Monitoring, and Database Operations teams.

- Collaborate with leadership on budgeting, planning, hiring, and managing third-party contracts.

- Oversee project status, assemble project teams, and define assignments with schedules and milestones.

- Platform Reliability & Performance: Drive continuous improvement of reliability, stability, and performance of digital platforms.

- Oversee implementation of automated telemetry, observability, and applied intelligence systems.

- Lead efforts to develop automated alerting, self-healing mechanisms, and intelligent response systems.

- Incident & Escalation Management: - Ensure 24/7 uptime of sites and services, with minimal unplanned downtime.

- Serve as Escalation Manager/Critical Incident Manager during major incidents, leading teams in rapid service restoration.

- Provide on-call escalation support based on 24/7/365 schedules.

- Communicate timely updates and incident reports to senior leadership.

- Collaboration & Integration: Partner with administrators, platform engineers, and other stakeholders to achieve highly reliable infrastructure, systems, and integrations.

- Collaborate with product, application development, QA, and technology teams to enhance service reliability and performance.

- Incident Management & Automation: Provide advanced Incident and Problem Management support to effectively diagnose, remediate, and resolve platform issues.

- Automate critical workflows across the platform to minimize manual errors and reduce human intervention.

- Implement ITIL processes like Incident, Problem, and Change Management.

Monitoring & Scalability :

- Design and implement effective monitoring systems with proper alerting and escalation mechanisms for critical events.

- Ensure timely capacity planning and infrastructure upgrades for optimal reliability.

- Develop and refine processes to minimize Mean Time to Recover (MTTR) and extend Mean Time to Failure (MTTF).

Documentation & Compliance :

- Create and maintain detailed documentation, including run books, incident response guides, post-mortem reports, RCAs, and mitigation plans.

- Ensure all changes adhere to established procedures and documentation standards.

Business Alignment :

- Understand business workflows and map technology solutions to address problems effectively.

- Lead conversations and provide technical support to both internal and external customers