Senior Software Engineer
Microsoft
Senior Software Engineer
Multiple Locations, United States
Save
Overview
The High Performance Computing and Artificial Intelligence (HPC/AI) team is on a mission to build the next-generation distributed AI supercomputer, enabling breakthroughs in artificial intelligence by delivering unmatched computational power, scalability and reliability. We design and develop cutting-edge infrastructure that supports high-performance AI model training at scale, laying the foundation for innovations that redefine what AI can achieve.
We are looking for passionate and innovative software engineers to design and develop the tooling and infrastructure that powers the next generation of large-scale AI and HPC networking systems. In this role, you will build network automation tools, observability frameworks, and performance optimization systems that are critical for achieving ultra-low latency, high throughput, and petabyte-scale efficiency in distributed AI workloads.
As a Senior Software Engineer on the HPC & AI Infrastructure team, you’ll work at the intersection of AI supercomputing and large-scale networking, shaping how advanced AI models are trained and deployed in the cloud. Your contributions will directly impact the reliability and performance of massive distributed clusters, leveraging high-speed fabrics (e.g., InfiniBand, RoCE) and accelerated compute platforms (e.g., NVIDIA, AMD GPUs).
This is a unique opportunity to build core software infrastructure—from telemetry and diagnostics tools to orchestration and network configuration systems—that ensures observability, debuggability, and operational excellence at exascale levels. You’ll collaborate across hardware, infrastructure, and ML platform teams to deliver systems that push the boundaries of what's possible in AI training and inference. If you're excited about distributed systems, low-level performance engineering, and software for next-generation AI infrastructure, come help us build the backbone of the AI supercomputers of tomorrow.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Required/Minimum Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- 3+ years of experience developing tools or software systems for distributed computing environments (e.g., HPC, AI/ML clusters, or cloud-scale platforms).
- 1+ years of familiarity with network performance tuning, telemetry, and observability tools in high-throughput, low-latency environments.
- 1+ years of exposure to network virtualization, software-defined networking (SDN), or fabric orchestration solutions.
Other Requirements:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Additional or Preferred Qualification:
- Bachelor's Degree in Computer Science
- OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
- OR Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- Hands-on experience with networking technologies in AI-specific hardware (e.g., InfiniBand, ROCE, NVLink).
- Familiarity with AI accelerators such as GPUs (NVIDIA, AMD) or TPUs, and how they interact with networking infrastructure.
- Background in building scalable and fault-tolerant systems in large, distributed environments and proficiency in Linux operating systems, including kernel-level networking, performance tuning, and debugging.
Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications and processes offers for these roles on an ongoing basis.
#azurecorejobs
Responsibilities
- Design and build software tools and frameworks to support high-performance, low-latency, and low-jitter networking for large-scale distributed AI and HPC systems, develop automation and observability tooling that optimizes networking infrastructure for petabyte-scale data movement and real-time AI model training, implement scalable, maintainable networking services and APIs that integrate seamlessly with fabric controllers, telemetry systems, and distributed runtimes.
- Analyze performance metrics and system behavior to identify bottlenecks, improve throughput, and enhance the reliability and resilience of communication stacks in GPU/accelerator-heavy environments.
- Debug and resolve complex networking and system-level issues across large clusters, collaborating closely with infrastructure, storage, and AI framework teams.
- Own design and documentation of new software systems, identify cross-service dependencies, and lead architectural reviews for networking tools and infrastructure components.
- Write, test, refactor, and optimize production-quality code to support core networking operations, configuration management, telemetry pipelines, and system introspection.
- Serve as a Designated Responsible Individual (DRI) for networking services and tooling—monitoring production systems, responding to incidents, coordinating root-cause analysis, and driving long-term improvements in observability and operational readiness.
- Stay ahead of emerging trends in AI infrastructure, HPC fabrics (InfiniBand, NVLink, etc.), and software-defined networking, integrating best practices into development workflows and driving innovation in system scalability, reliability, and performance.