Connecting people I'd hire with companies I'd work at

Matt Wallaert
companies
Jobs

Site Reliability Engineer

Microsoft

Microsoft

Software Engineering
Redmond, WA, USA
USD 84,200-165,200 / year
Posted on Feb 14, 2026
Overview

The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enablement is responsible for ensuring the reliability, availability, and performance of Fungible DPU based Azure Storage devices as they integrate next-generation networking and compute offload hardware. This role focuses on safe bring-up, validation, and scaled production operation of DPU-enabled platforms, bridging hardware, firmware, and software reliability and maintenance.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.



Responsibilities
  • Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments.
  • Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases.
  • Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments.
  • Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments.
  • Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments.
  • Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management.
  • Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments.
    Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals.


Qualifications

Required Qualifications:

  • Associate's Degree in Computer Science, Information Technology, or related field Bachelor's Degree in Computer Science, Information Technology, or related field
    • OR equivalent experience.

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
  • Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
    • OR equivalent experience.
  • Experience operating large-scale, distributed systems in a lab/validation.
  • Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines.
  • Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell); with experience reading lower-level system code.
  • Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
  • Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems.
  • Direct experience with Fungible DPU technology or similar SmartNIC/DPU platforms.
  • Existing hands-on experience working in Microsoft MLS (Microsoft Lab Services) or equivalent internal lab environments, including lab-based hardware validation, performanc testing, and bring-up workflows.
  • Experience enabling new hardware platforms or accelerators in a Windows/mixed OS environment.
  • Familiarity with firmware lifecycles, hardware validation, and silicon bring-up processes.
  • Experience with infrastructure-as-code and CI/CD pipelines (ARM/Bicep, Terraform, Azure DevOps).

#azurecorejobs



Site Reliability Engineering IC2 - The typical base pay range for this role across the U.S. is USD $84,200 - $165,200 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $109,000 - $180,400 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
https://careers.microsoft.com/us/en/us-corporate-pay


This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.




Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.