Principal Software Engineer
Microsoft
Principal Software Engineer
Multiple Locations, United States
Save
Overview
We are looking for a Principal Software Engineer to help build the software systems, Artificial Intelligence (AI) agents, and automation platforms that power and maintain Azure’s global optical backbone, the foundation of Microsoft’s cloud and AI infrastructure.
This role is for engineers who think across layers, from the software running on optical devices that collect and instrument billions of data points, to the distributed high-availability systems that autonomously operate and repair the network. You will design and implement services that act as the sensory, cognitive, and motor systems of our AI-driven operations, safely and securely running one of the most advanced photonic networks in the world.
Our team has pioneered several industry-first AI agents and autonomous platforms that define the future of hyperscale network operations. We are looking for someone who thrives at the intersection of systems engineering, large-scale automation, and AI-native infrastructure, helping us evolve from reactive management to a self-sustaining intelligent network.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Required Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- Software & Systems experience: 6+ years building production software for network automation and operations; 4+ years designing and running distributed, highly available services at scale; fundamentals in concurrency, reliability, and performance.
- Programming & Automation experience : 1+ years building closed-loop automation including telemetry collection, streaming/state evaluation, policy orchestration, and safe actuation on network or optical devices and 1+ years of experience with Go or Python.
- Integration & Observability experience: 1+ years with device/controller interfaces including: Network Configuration Protocol/Yet Another Next Generation (NETCONF/YANG), Google Network Management Interface/Google Network Operations Interface (gNMI/gNOI), Simple Network Management Protocol (SNMP), vendor Software Development Kits (SDKs), Remote Procedure Call/Google Remote Procedure Call (RPC/gRPC); observability practices across metrics, logs, and traces, including Service Level Objective (SLO) design, error budgets, and on-call ownership.
- Security & Leadership experience: 3+ years with Secure-by-design mindset (auth, authorization, key/secret management, auditability) and proven ability to lead cross-functional engineering efforts from design to production with measurable outcomes.
Other Qualifications:
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
- OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
AI Context Engineering: Building context and knowledge services for AI agents, including embeddings and retrieval, vector/time-series stores, feature pipelines, and contract-first Application Programming Interfaces (APIs) for tool exposure.
AI Agent Development & Evaluation: Designing and assessing AI agents for operational automation with offline/online evaluations, golden sets, canary/A/B testing, safety guardrails, and audit trails.
Control & Workflow Expertise: Familiarity with MCP or eServices-style control/context planes, tool interface design, and agent workflow engines such as Temporal or equivalent.
Networking & Platform Skills: Exposure to optical networking including Dense Wavelength Division Multiplexing (DWDM) link budgeting, Optical Signal to Noise Ratio/Bit Error Rate (OSNR/BER) monitoring, transponder control, metro/long-haul design plus experience with Kubernetes and Continuous Integration/Continuous Delivery (CI/CD).
Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until November 12, 2025.
#azurecorejobs
Responsibilities
- Build and Scale Autonomous Network Systems: Design and implement highly available, distributed software systems that power and maintain Azure’s optical network at hyperscale. This includes everything from device-level telemetry, monitoring, and control software to globally distributed automation services that remediate and repair the network autonomously.
- Full-Stack Systems Engineering: Work across the full stack—from the embedded systems running on optical devices that collect and instrument data, to the cloud-scale services that analyze, decide, and act. Design for safety, resilience, observability, and rapid iteration across millions of data points per second.
- Agents and Automation Platforms: Develop the next generation of AI-driven agents and orchestration platforms that enable autonomous network operations. Build contextual, sensory, and motor systems that allow agents to perceive, reason about, and act safely and securely on the network.
- Context and Control Services, including Model Context Protocol/Electronic Services (MCP/eServices): Create and evolve micro-control planes and context services that give AI systems deep awareness of network state, enabling safe decision-making and intelligent automation across the optical domain.
- Cross-Domain System Integration: Collaborate closely with optical, switching, and AI infrastructure teams to deliver end-to-end, self-healing systems that tie together photonic, packet, and compute control planes.
- Operational Excellence and Reliability Engineering: Drive engineering rigor through metrics, observability, chaos testing, and continuous validation. Ensure the reliability and security of systems that operate some of the most mission-critical infrastructure in the world.
- Innovation and Industry Leadership: Contribute to pioneering efforts in autonomous infrastructure management—continuing our track record of delivering industry-first AI agents and platforms that redefine how hyperscale networks are built and operated