Principal Software Engineer-distributed training system
Microsoft
Principal Software Engineer-distributed training system
Beijing, China
Save
Overview
MAI Ads team in Microsoft APRD is responsible for providing the advertising industry with the state-of-the-art online advertising platform and service. Our team is at the core of this effort, working on the following research & development: Selection(recall), Relevance, User Response Prediction (Click Prediction and Conversion prediction), Autobidding, Large Language Model and Large Scale Machine Learning & Serving System. The team is a world-class R&D team of passionate and talented scientists and engineers who aspire to solve challenging problems and turn innovative ideas into high-quality products and services that can help hundreds of millions of users and advertisers, and directly impact our business.
Qualifications
• Bachelor, Master, PhD degree in CS/EE or related areas is required.
• 6+ years of industry experiences in software engineering.
• Solid experience of shipping high performance C++, CUDA, python, C#, or equivalent language code.
• Experience with machine learning and TensorFlow/PyTorch distributed training is preferred.
• Domain knowledge of ads, search or content services is a plus.
• Quick learning and solid problem solving and debugging skills.
• Good communication skill, fluent in English (both oral and written).
Responsibilities
• Design and implement distributed training system for trillion parameter machine learning models.
• Drive our team efforts around utilization and optimization of training and inference on GPUs.
• Design and implement streaming training and publish of trillion parameter machine learning models.
• Analyze metrics and identify opportunities based on offline and online testing, develop and deliver robust and scalable solutions.
• Collaborate with cross-functional teams to deliver high-quality solutions.