We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results

Software Engineering IC5

Microsoft
$142,800.00 - $274,800.00 / yr
United States, Washington, Redmond
May 28, 2026
Overview

TheCoreAIInfrastructure team builds thefoundational accelerated compute platformsthat power largescale AI training and inference across Azure. Our mission is to deliversecure, reliable, and highly efficient GPU and CPU infrastructurethat enables multitenant AI systems atglobalscale while maximizingutilization, performance, and developer productivity.

This role sits at the intersection ofcloud infrastructure, systems software, virtualization, and container platforms, working closely withCoreAI, Azure Infrastructure, OS, Networking, and Hardware teams to deliverend-to-endplatform capabilities.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

#AIPLATFORM# #AIP #FIT #o11y



Responsibilities

Asthe Principalengineer on theteam, your responsibilities include:

  • Design and buildGPU and CPUaccelerated infrastructurefor training and inference workloads, spanning bare metal, virtual machines, and containerized environments with focus on observability key metrics at scale.
  • Develop End to End Observability operational excellence systems forGPU/CPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multitenant usage).
  • Build and operateadvanced orchestrationand resource governance and management scenariosusing platforms such asAKS,Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficientutilizationof accelerated resources.
  • Build and evolvevirtualization and container stacksto support modern AI workloads, including secure and confidentialcomputescenarios.
  • Optimizeperformance, reliability, andutilizationacross large GPU/CPU fleets, includingscaleupand scaleout configurations.
  • Partner with networking and storage teams to enablehighperformanceinterconnects(e.g., RDMA/InfiniBandclass networking) for distributed workloads.
  • Driveend-to-end platform featuresfrom design through production, including observability, diagnostics, and operational excellence.
  • Influence platform architecture and technical direction across teams through design reviews and technical leadership.


Qualifications

Required Qualifications:

  • Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience.

Other Requirements:

  • Proven ability to design and operate largescale, production infrastructure with high reliability and performance requirements using Azure Kubernetes Service (AKS).
  • Strong problem-solving skills and the ability to debug complex,crosslayersystems issues.
  • Demonstrated technical leadership, including mentoring engineers and driving crossteam architectural alignment.
  • Hands-onexperience withvirtualization and/or container platforms(e.g., VMs, Kubernetes, container runtimes).
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries.
  • Expertise with distributed observability technologies (e.g., Prometheus, OpenTelemetry, Grafana) and experience designing or scaling telemetry pipelines for high-throughput production systems.
  • Advanced, hands-on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.


Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $142,800 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

Applied = 0

(web-77cf7d65c7-z52c2)