Service Engineer II
Microsoft
Responsibilities
- Lead and manage high-severity incidents across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication.
- Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams.
- Contribute to the design of V. Next architecture for Cloud infrastructure services, based on Customer/ First party engagements.
- Engage in major production triage efforts and work with different teams in the identification of root cause of highly impactful or complex issues as required and identify Product gaps and work with Product teams to bridge the gaps.
- Partner closely with Software developers, Product Managers, architects, and Infrastructure teams to drive delivery of sustainable and reusable design solution patterns to ensure non-functional production support requirements are adopted early in the Migration /Deployment
- Promote a customer-first culture by prioritizing availability, reliability, and platform trust in every response.
- Participate in the on-call rotation.
- Analyze customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements.
- Drive continuous improvement of the Azure platform by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability.
- Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies.
- Identify and advocate for customer self-service capabilities, improved documentation, and scalable solutions that empower customers to resolve common issues independently.
- Design and drive adoption of incident response playbooks, mitigation levers, and operational frameworks aligned to real-world support scenarios and strategic customer needs.
- Contribute to the design of next-generation architecture for cloud infrastructure services with a focus on reliability and strategic customer support outcomes.
- Build and maintain cross-functional partnerships, ensuring alignment across engineering, business, and support organizations.
- Be data-driven and results-focused, using metrics to evaluate incident response effectiveness and platform health.
- Bring an engineering mindset to operational challenges, balancing agility, scalability, and technical excellence.
- Exhibit strong cross-team collaboration, engineering mindset, and results-oriented execution under pressure
Qualifications
- Bachelor’s degree in Computer Science, Information Technology, Data Science, Cybersecurity, or a related field AND 2+ years of technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls;
- OR equivalent hands-on experience.
- Proven experience in cloud operations, incident & crisis management, or large-scale systems engineering ideally within platforms such as Azure, AWS, or GCP.
- Demonstrated experience in 24×7×365 enterprise environments, managing mission-critical services.
- Demonstrated experience implementing AI-driven solutions and automation, with proficiency in one or more programming/automation languages (e.g., C, C++, C#, Java, JavaScript, Python) or equivalent expertise.
- ITIL, SRE, or other industry-recognized technical and operational certification.
- Master's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls
- OR Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 5+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls
- OR equivalent experience.
- 1+ year(s) technical experience working with large-scale cloud or distributed systems.
- 3+ Years of demonstrated experience as an Incident Management or Crisis Management for critical, high-severity incidents in high-availability, distributed environments.
- Experience with Service Engineering principles and practices with exceptional command-and-control communication skills—able to drive clarity and direction with customers - internal Microsoft stake holders and third-party vendors during ambiguity and chaos.
- Demonstrated ability to make decisions quickly with strategic thinking under high pressure situations with analytical skills, demonstrating team leadership quality, and collaboration with peer teams and internal engineering partners.
- Desired strong knowledge of Windows or Linux platforms, developer tools and ability to diagnose cloud computing platform issues, identifying patterns and implementing AI-driven approach for overall platform stability and reliability.
- Deep understanding of cloud architecture patterns, High Availability, Disaster Recovery, Business Continuity, Performance Tuning for service platform services.
- Familiarity with monitoring and observability tools (e.g., Azure Monitor, Watch Dog, Grafana, Prometheus, Datadog, Splunk, New Relic).
- Exposure to chaos engineering, fault injection, or high availability architecture.
- AI/ML Experience: [Beginner to Intermediate]
- Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes.
- Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting.
- An understanding of the challenges and risks associated with AI/ML systems in a production environment.
- Certifications:
- Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, GCP Professional Cloud Architect).
- Certifications in ITIL, SRE, or other relevant frameworks.
Service Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
https://careers.microsoft.com/us/en/us-corporate-pay
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.