Director - Site Reliability Engineering

10 - 15 Years

Others

Posted 1 month ago

#ITIL #IT Operations #IT Product Development

Role - SRE Director.

Location - Hyderabad.

Qualifications:.

- Bachelor's or Master's degree in Computer Science, Data Engineering, AI/ML, or a related field.

- 10+ years of experience in software release management, with at least 3-5 years in SRE or.

- DevOps environments, preferably in AI or data-driven applications.

- Proven experience building and managing both release management and SRE teams in complex, multi-product environments.

- Strong knowledge of AI/ML operations (MLOps), data pipeline management, and cloud based AI product deployments.

- Expertise in release management tools (Jenkins, GitLab, Git, Jira) and SRE tools such as.

- Prometheus, Grafana, Datadog, or similar monitoring systems.

- Experience with cloud platforms (AWS, GCP, Azure), containerization (Kubernetes, Docker), and infrastructure automation tools (Terraform, Ansible).

- Excellent problem-solving, organizational, and leadership skills, with a strong track record of driving continuous improvement in both release and operational reliability processes.

Preferred Qualifications:.

- Experience deploying and maintaining large-scale AI/ML models in production environments, including monitoring, retraining, and operationalization.

- Familiarity with ITIL, MLOps, or DevOps frameworks and best practices.

- Knowledge of cloud-based services and tools specifically designed for AI/ML ( AWS. SageMaker, TensorFlow, PyTorch).

- Demonstrated ability to manage incident response and root cause analysis in complex software ecosystems.

Key Responsibilities:.

Strategic Leadership & Vision:.

- Lead and manage the Software Release Management function for all Data and AI products.

- Establish a centralized release management framework for AI and data products that scales with the growing product portfolio.

- Form and lead a high-performing Site Reliability Engineering (SRE) team to ensure the operational stability and performance of all AI and data-driven applications post-release.

- Collaborate with Product, Engineering and Operations teams to align release and SRE strategies with business objectives.

Release Planning & Coordination:.

- Oversee the full lifecycle of software and AI model releases, from planning and coordination to post-release evaluation.

- Develop and maintain a detailed release calendar that aligns with the timelines and priorities of various product teams.

- Coordinate release activities with multiple cross-functional teams, ensuring transparent communication of dependencies, risks, and milestones.

- Ensure that all releases are integrated seamlessly into production, minimizing downtime and disruptions to end users.

- Site Reliability Engineering (SRE) Team Formation: Hire, build, and lead the SRE team responsible for maintaining the reliability, scalability and performance of all Data and AI products in production.

- Define the roles and responsibilities of the SRE team, ensuring clear alignment with the goals of product engineering and release management.

- Develop and implement SRE best practices, including incident response, root cause analysis, and proactive performance monitoring.

- Establish SLAs, SLOs, and SLIs (Service Level Agreements/Objectives/Indicators) to track and measure the reliability and performance of all services post-release. Collaborate with DevOps to ensure that automated CI/CD pipelines integrate seamlessly with SRE processes and monitoring systems.

Process Optimization & Automation:.

- Lead the automation of software release processes, with an emphasis on CI/CD pipelines for AI models, data pipelines, and cloud-based AI products.

- Develop infrastructure-as-code practices to improve the scalability and reliability of AI and data systems across production environments.

- Introduce tools for version control, model governance, and monitoring for MLOps and AI model management in production.

- Continuously improve operational procedures to reduce the number of incidents and optimize recovery time.

Risk & Quality Management:.

- Implement comprehensive quality assurance and validation processes to ensure that all AI models, data products, and software releases meet security, performance, and compliance requirements.

- Proactively identify and mitigate risks related to releases, AI model performance, and operational stability in production.