My job alerts

Principal Service Reliability Engineer

Microsoft

Redmond, WA, USA

USD 142,800-274,800 / year

Posted on May 30, 2026

Apply now

Overview

Microsoft Digital (MSD) builds and manages the critical products and services that Microsoft runs on. We boldly pursue big ideas that power transformational advances at Microsoft and for our customers, while helping Microsoft teams work smarter, faster, and more securely every day. Microsoft Digital employees have deep technical and business expertise, customer insights, and a clear point of view that comes from first-hand, large-scale experience with Microsoft and industry solutions. We are engineers, technology leaders and experts, digital transformation change agents, and customer advocates.

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliability strategy for mission-critical, large-scale distributed systems. This role operates at a system and organizational level, driving reliability engineering practices across services, influencing architecture decisions, and establishing scalable frameworks for availability, performance, and operational excellence.

The Principal SRE defines reliability standards (SLOs/SLIs/error budgets), and partners with engineering, product, and platform teams to design, build, and operate resilient systems at enterprise scale. This role is accountable for reducing systemic risk, eliminating operational toil, and advancing toward autonomous, self-healing platforms.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

#MSD

#MSDJOBS

Responsibilities

Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities.
Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability.
Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes.
Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale.
Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization.
Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention.
Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements.
Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability.
Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries.

Qualifications

Required Qualifications:

8+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
- OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
- OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Preferred Qualifications:

Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations.
Experience leading reliability efforts for enterprise-scale or globally distributed systems.
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers.
Demonstrated ability to mentor senior engineers and influence engineering culture at scale.
Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks).
Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred).
Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards.
Deep experience in observability, incident management, and production operations at scale.
Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles.
Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making.

What Success Looks Like:

Systemic reduction in incidents and customer-impacting outages, with measurable improvements in MTTR and service stability.
Organization-wide adoption of consistent reliability standards (SLOs/SLIs) and operational excellence practices.
Highly automated, low-toil environments with clear ownership and scalable operational processes.
Services designed with resilience, fault isolation, and recovery as first-class principles.
A strong reliability culture, with engineering teams proactively investing in long-term system health and scalability.

Site Reliability Engineering IC5 - The typical base pay range for this role across the U.S. is USD $142,800.00 - $274,800.00 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000.00 - $304,200.00 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

Apply now

See more open positions at Microsoft

Connecting people I'd hire with companies I'd work at

Principal Service Reliability Engineer