Location: Mountain View, California | San Francisco, California
Lead the Future of Large-Scale AI Training Infrastructure
At Databricks, we empower organizations to solve some of the world’s most complex challenges through data and artificial intelligence. From advancing medical research to enabling next-generation transportation systems, our mission is driven by building and operating the industry’s leading Data Intelligence Platform.
We are looking for a Senior Engineering Manager, AI Runtime (AIR) to lead one of the most strategic engineering organizations within Databricks. This role offers the opportunity to define the future of managed GPU training infrastructure and AI model development at scale while leading a world-class engineering team.
About AI Runtime (AIR)
The AI Runtime (AIR) team powers enterprise-scale training and fine-tuning of deep learning and Large Language Models (LLMs) through on-demand GPU infrastructure.
Organizations rely on AIR to train cutting-edge models across a variety of use cases, including:
- Foundation Models
- Large Language Models (LLMs)
- Transformer-Based Architectures
- Drug Discovery Models
- Custom Deep Learning Systems
- Enterprise AI Applications
AIR provides customers with the infrastructure needed to build and operate state-of-the-art frontier AI models efficiently and reliably.
Your Leadership Opportunity
As a Senior Engineering Manager, you will oversee both the customer-facing product experience and the foundational infrastructure behind AI Runtime.
You will guide strategic investments across managed GPU training, distributed systems, scalability, reliability, and platform innovation while partnering closely with product, research, infrastructure, and platform teams.
This role combines technical leadership, product strategy, organizational development, and customer impact at one of the most exciting intersections of AI and cloud infrastructure.
Key Responsibilities
Lead and Scale a High-Performing Engineering Team
- Lead, mentor, and grow a high-performing engineering team responsible for the Custom Training product and its foundational infrastructure.
- Oversee distributed training orchestration, cluster lifecycle management, fault tolerance, and training efficiency initiatives.
- Foster a culture of technical excellence, innovation, and customer obsession.
Define the AI Runtime Vision
- Define and own the product and technical roadmap for AIR.
- Balance customer experience, platform functionality, scalability, and foundational infrastructure investments.
- Drive long-term strategic direction for managed GPU training capabilities.
Deliver End-to-End Product Innovation
- Collaborate closely with product, research, platform, infrastructure teams, and customers.
- Drive projects from ideation and prioritization through launch and ongoing operations.
- Ensure successful execution of complex cross-functional initiatives.
Drive Architecture for GPU Training at Scale
- Lead architectural decisions for large-scale managed GPU training systems.
- Design solutions that support growing customer workloads and emerging AI technologies.
- Ensure extensibility, performance, and operational excellence across the platform.
Champion Customer Success
- Engage directly with customers to understand challenges and opportunities.
- Advocate for customer needs within engineering decision-making processes.
- Translate technical investments into measurable product outcomes.
Strengthen Reliability & Observability
- Build observability frameworks for long-running distributed training jobs.
- Define checkpointing strategies, operational runbooks, and failure recovery mechanisms.
- Improve resilience for multi-node training environments.
Build Exceptional Teams
- Partner closely with recruiting efforts.
- Attract, hire, and develop top engineering talent.
- Create an environment that supports growth, innovation, and leadership development.
Required Qualifications
Professional Experience
- 8+ years of software engineering experience.
- 3+ years of engineering management experience.
- Proven track record building and operating managed GPU training infrastructure at scale (100s/1000s GPUs).
Distributed Training Expertise
Deep familiarity with:
- PyTorch
- DeepSpeed
- Composer
- Megatron-LM
Experience with parallelism strategies including:
- Fully Sharded Data Parallel (FSDP)
- Tensor Parallelism
- Pipeline Parallelism
Training Reliability & Resilience
Experience implementing:
- Checkpointing systems
- Elastic training architectures
- Automated failure recovery for long-running AI training jobs
GPU Performance Optimization
Strong understanding of:
- NCCL
- GPU interconnect topologies
- Memory optimization techniques
- Large-scale distributed GPU environments
Platform & Product Leadership
- Experience building platform products with clearly defined Service Level Agreements (SLAs).
- Proven ownership of customer experience beyond backend infrastructure responsibilities.
- Ability to align technical execution with business and customer outcomes.
Cross-Functional Leadership
- Strong leadership across platform, product, and research organizations.
- Demonstrated success delivering complex initiatives in ambiguous environments.
- Ability to influence stakeholders across multiple teams and functions.
Communication & Collaboration
- Excellent collaboration, communication, and stakeholder management skills.
- Ability to effectively partner with engineering, product, infrastructure, and research teams.
Education
- BS/MS in Computer Science, Electrical Engineering, or related technical field.
Compensation & Pay Transparency
Databricks is committed to fair and equitable compensation practices.
Local Pay Range
$228,600 — $314,250 USD
Actual compensation packages are determined based on factors including:
- Relevant experience
- Technical expertise
- Certifications and training
- Job-related skills
- Geographic location
In addition to base compensation, eligible employees may receive:
- Annual performance bonuses
- Equity awards
- Comprehensive employee benefits
Databricks anticipates utilizing the full salary range based on candidate qualifications and experience.
Why Join Databricks?
At Databricks, we are building state-of-the-art AI solutions that redefine how users interact with data and our products.
As part of the AI Runtime organization, you’ll:
- Lead mission-critical AI infrastructure initiatives.
- Work on cutting-edge GPU training systems powering frontier AI models.
- Influence the future of enterprise AI and Large Language Model development.
- Collaborate with world-class researchers, engineers, and product leaders.
- Solve some of the most challenging distributed systems problems in the industry.
If you’re passionate about scaling AI infrastructure and leading exceptional engineering teams, we’d love to hear from you.
About Databricks
Databricks is the Data and AI company trusted by more than 10,000 organizations worldwide.
Leading organizations including Comcast, Condé Nast, Grammarly, and over 50% of Fortune 500 companies rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics, and AI.
Databricks was founded by the original creators of:
- Apache Spark™
- Delta Lake
- MLflow
- Lakehouse Architecture
Headquartered in San Francisco, Databricks continues to drive innovation across data, analytics, and artificial intelligence.
Benefits
Databricks offers comprehensive employee benefits and perks designed to support health, wellness, financial security, and professional growth. Benefits may vary by region and location.
Diversity, Equity & Inclusion
Databricks is committed to building an inclusive workplace where every employee can thrive.
Employment decisions are made without regard to age, race, ethnicity, disability, gender identity, sexual orientation, religion, family status, veteran status, socio-economic background, political affiliation, or any other protected characteristic.
Compliance
If access to export-controlled technology or source code is required for performance of job duties, it is within Employer’s discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.
Apply for Senior Engineering Manager, AI Runtime Jobs at Databricks
Join a team building the infrastructure behind the world’s most advanced AI models and help shape the future of enterprise-scale machine learning and GPU training platforms.
