Posted By
61
JOB VIEWS
17
APPLICATIONS
2
RECRUITER ACTIONS
See how you stand against competition
Pro
View Insights
Posted in
IT & Systems
Job Code
1537373
Role - SRE Director.
Location - Hyderabad.
Qualifications:.
- Bachelor's or Master's degree in Computer Science, Data Engineering, AI/ML, or a related field.
- 10+ years of experience in software release management, with at least 3-5 years in SRE or.
- DevOps environments, preferably in AI or data-driven applications.
- Proven experience building and managing both release management and SRE teams in complex, multi-product environments.
- Strong knowledge of AI/ML operations (MLOps), data pipeline management, and cloud based AI product deployments.
- Expertise in release management tools (Jenkins, GitLab, Git, Jira) and SRE tools such as.
- Prometheus, Grafana, Datadog, or similar monitoring systems.
- Experience with cloud platforms (AWS, GCP, Azure), containerization (Kubernetes, Docker), and infrastructure automation tools (Terraform, Ansible).
- Excellent problem-solving, organizational, and leadership skills, with a strong track record of driving continuous improvement in both release and operational reliability processes.
Preferred Qualifications:.
- Experience deploying and maintaining large-scale AI/ML models in production environments, including monitoring, retraining, and operationalization.
- Familiarity with ITIL, MLOps, or DevOps frameworks and best practices.
- Knowledge of cloud-based services and tools specifically designed for AI/ML ( AWS. SageMaker, TensorFlow, PyTorch).
- Demonstrated ability to manage incident response and root cause analysis in complex software ecosystems.
Key Responsibilities:.
Strategic Leadership & Vision:.
- Lead and manage the Software Release Management function for all Data and AI products.
- Establish a centralized release management framework for AI and data products that scales with the growing product portfolio.
- Form and lead a high-performing Site Reliability Engineering (SRE) team to ensure the operational stability and performance of all AI and data-driven applications post-release.
- Collaborate with Product, Engineering and Operations teams to align release and SRE strategies with business objectives.
Release Planning & Coordination:.
- Oversee the full lifecycle of software and AI model releases, from planning and coordination to post-release evaluation.
- Develop and maintain a detailed release calendar that aligns with the timelines and priorities of various product teams.
- Coordinate release activities with multiple cross-functional teams, ensuring transparent communication of dependencies, risks, and milestones.
- Ensure that all releases are integrated seamlessly into production, minimizing downtime and disruptions to end users.
- Site Reliability Engineering (SRE) Team Formation: Hire, build, and lead the SRE team responsible for maintaining the reliability, scalability and performance of all Data and AI products in production.
- Define the roles and responsibilities of the SRE team, ensuring clear alignment with the goals of product engineering and release management.
- Develop and implement SRE best practices, including incident response, root cause analysis, and proactive performance monitoring.
- Establish SLAs, SLOs, and SLIs (Service Level Agreements/Objectives/Indicators) to track and measure the reliability and performance of all services post-release. Collaborate with DevOps to ensure that automated CI/CD pipelines integrate seamlessly with SRE processes and monitoring systems.
Process Optimization & Automation:.
- Lead the automation of software release processes, with an emphasis on CI/CD pipelines for AI models, data pipelines, and cloud-based AI products.
- Develop infrastructure-as-code practices to improve the scalability and reliability of AI and data systems across production environments.
- Introduce tools for version control, model governance, and monitoring for MLOps and AI model management in production.
- Continuously improve operational procedures to reduce the number of incidents and optimize recovery time.
Risk & Quality Management:.
- Implement comprehensive quality assurance and validation processes to ensure that all AI models, data products, and software releases meet security, performance, and compliance requirements.
- Proactively identify and mitigate risks related to releases, AI model performance, and operational stability in production.
Didn’t find the job appropriate? Report this Job
Posted By
61
JOB VIEWS
17
APPLICATIONS
2
RECRUITER ACTIONS
See how you stand against competition
Pro
View Insights
Posted in
IT & Systems
Job Code
1537373
Download the iimjobs app to
apply for jobs anywhere, anytime
Download on
App Store
Get it on
Google Play
Scan to Download