Connecting people I'd hire with companies I'd work at

Matt Wallaert
35
companies
9,520
Jobs

Director, Reliability Engineering

Microsoft

Microsoft

Posted on Dec 20, 2024

Director, Reliability Engineering

Redmond, Washington, United States

Save

Share job

Date posted
Dec 19, 2024
Job number
1795913
Work site
Up to 50% work from home
Travel
0-25 %
Role type
People Manager
Profession
Hardware Engineering
Discipline
Reliability Engineering
Employment type
Full-Time

Overview

Microsoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission.

As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a strong passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability.

We are looking for an experienced System Reliability Director who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.

Qualifications

Required Qualifications

  • Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience
    • OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience
    • OR Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 8+ years technical engineering experience.
  • 5+ Years of Management including resource planning, career development and performance management.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • MBA in engineering management or operations.
  • Experience with cloud fleet management, telemetry, diagnostic and troubleshooting of IT systems.
  • Experience and knowledge in the server industry product development process.
  • Experience in leading system engineering teams in both NPI and Sustaining lifecycles, and managing suppliers.
  • Experience and background developing design specifications and or developing product requirement documents.
  • Experience with system reliability, manufacturing process and datacenter operations, leading continuous improvements through automation
  • Experience with liquid cooling infrastructure for IT racks

Reliability Engineering M5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until January 18, 2025

#azurehwjobs #HIFE #Azure #Cloud #Hardware

Responsibilities

As a Director, Reliability Engineering, you will be responsible for the following:

  • Leading the Cloud System and Components Reliability Engineering organization with an ability to operate in a fast-paced environment, transforming ambiguity into clarity.
  • Leading strategic innovations and developing processes which integrate industry practices to ensure scalability and efficiency to achieve high reliability and quality performance.
  • Leading by example and coaching to inspire team members to grow and develop in the field of System and Components Reliability Engineering.
  • Leading retrospective and deep dives to drive root cause and corrective actions to prevent future escapes.
  • Combine technical and process expertise with in-depth understanding of cloud operations, to optimize reliability solutions for future server and storage products.
  • Define, facilitate and manage integration of architecture, design, manufacturing, operation, troubleshooting and diagnostic methods to optimize cloud infrastructure reliability.
  • Participate in, and approve, mechanical, thermal, electrical, telemetry & diagnostic design reviews to ensure system reliability requirements are properly implemented.
  • Drive System Reliability Readiness of new cloud platforms landing in Microsoft Datacenters.
  • Support Hardware Systems Group development, deployment and sustaining teams from system concept to decommission. Work with cross-functional strategic teams on process optimizations and inter-related strategic initiatives.
  • Develop key metrics to evaluate system reliability program’s performance and build implementation plans to confirm our performance and compliance against program metrics and internal company requirements.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.