Senior Software Engineer
Microsoft
We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team. In this role, you will architect and optimize the core inference engine that powers our large-scale AI models. You will be responsible for pushing the boundaries of hardware performance, reducing latency, and maximizing throughput for Generative AI and Deep Learning workloads.
You will work at the intersection of Deep Learning algorithms and low-level hardware, designing custom operators and building a highly efficient training/inference execution engine from the ground up.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Starting January 26, 2026, Microsoft AI (MAI) employees who live within a 50- mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week. This expectation is subject to local law and may vary by jurisdiction.
Responsibilities
- Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries.
- Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization).
- Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads.
- Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching.
- Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism).
- Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy).
Qualifications
Required Qualifications:
- Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- Professional Depth: 4+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development.
- Architectural Mastery: Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper).Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution.
- Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy.
- Programming & Systems: Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel).
- Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads.
- Performance Engineering: Mastery of NVIDIA Nsight Systems/Compute.Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput.
Preferred Qualifications:
- Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
- OR equivalent experience.
- Professional Depth: 5+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development.
- Engine & Framework Expertise: Working knowledge of state-of-the-art inference/training stacks: sglang, vLLM, TensorRT-LLM, DeepSpeed, or Megatron-LM.Deep understanding of optimization patterns: PagedAttention, RadixAttention (Prefix Caching), continuous batching, and speculative decoding.
- Operator & GEMM Optimization: * Practical experience with CUTLASS, CuTe, or OpenAI Triton.Expertise in high-performance linear algebra (GEMM) optimization, including tiling strategies, data layouts, and mixed-precision accumulation.
- Distributed Systems: Proficiency in multi-GPU/multi-node scaling using NCCL and parallelism strategies (Tensor, Pipeline, and Sequence parallelism).
- Vibe Coding & AI-Native Velocity: An AI-native mindset: Expert at using vibe coding tools to bypass boilerplate and accelerate the development lifecycle.The technical intuition to architect systems rapidly, moving from "vibe" to "highly-optimized production code" with extreme velocity.
#MicrosoftAI
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.