Monitoring and Observability (AIOps) Lead Engineer
Capgemini
Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues around the world, and where you’ll be able to reimagine what’s possible. Join us and help the world’s leading organizations unlock the value of technology and build a more sustainable, more inclusive world.
Job Description
As an Observability & AIOps Lead Engineer, you lead the strategy and implementation of monitoring platforms, define standards for logging and metrics, and ensure system visibility and reliability. You collaborate across teams to improve performance, manage vendor tools, drive automation, and support security and compliance efforts.
Key Responsibilities:-
- Lead the strategy, design and implementation of monitoring and observability platform, collaborate with Design authority and solution architect
- Define and maintain standards for logging, metrics, distributed tracing, and dashboards
- Collaborate with engineering and operations teams to instrument applications and infrastructure
- Ensure end-to-end visibility into system health, performance and reliability
- Setup actionable alerting systems to reduce alert fatique and improve incident response
- Analyse observability data to drive continous improvement in system performance and operational processes
- Own relationships with third party observability vendors and tools
- Mentor and guide teams in adopting observability best practices
- Drive automation and self-service Observability capabilities across teams
- Partner with security and compliance teams to ensure auditability and data integrity
Required Skills & Experience
Required Qualifications:-
- 7+ Years of experience in IT, DevOps, SRE or Infrastruture roles, with at least 2+ years on leadership role focused on observability
- Strong hands-on experience with Modern Observability tools (e.g. Splunk, New Relic, DataDog, Dynatrace, Elastic, AppDynamics)
- Proficiency in Cloud Platforms (AWS, Azure or GCP)
- Solid understanding of microservices, containers, kubernetes and distributed systems
- Experience defining SLIs, SLOs, and SLAs
- Strong scripting skills (e.g. Python, Bash and Terraform)
- Excellent communication and stakeholder management skills
- Passion for system reliability, performance and user experience
- Familiarity with incident management framework (e.g. ITIL) is preferred
- Exposure to AI/ML based anomaly detection in observability is preferred
- Experience in driving cultural change in observability maturity across teams is preferred
What You’ll Love About Working Here
- You will be a part of a diverse collective of free-thinkers, entrepreneurs and industry experts. You will love the exposure to the scale of transformation, the depth of expertise, and the opportunities for growth.
- We aim to build an environment where employees can enjoy a positive work-life balance. We embed hybrid working in all that we do and make flexible working arrangements the day-to-day reality for our people.
- At the heart of our mission is your career growth. You will have countless learning and development opportunities from thinktanks to hackathons, and access to 250,000 courses with numerous external certifications crafted to support you in exploring a world of opportunities.
- We realise a Total Reward package should be more than just compensation. We offer a range of core and flexible benefits and have a Peer Recognition Portal called ‘Celebrate’.
Capgemini is a global business and technology transformation partner, helping organizations to accelerate their dual transition to a digital and sustainable world, while creating tangible impact for enterprises and society. It is a responsible and diverse group of 340,000 team members in more than 50 countries. With its strong over 55-year heritage, Capgemini is trusted by its clients to unlock the value of technology to address the entire breadth of their business needs. It delivers end-to-end services and solutions leveraging strengths from strategy and design to engineering, all fueled by its market leading capabilities in AI, generative AI, cloud and data, combined with its deep industry expertise and partner ecosystem.