Databricks

Full Time

Mountain View, California; San Francisco, California

Posted 1 week ago

Location: Mountain View, California | San Francisco, California

Lead the Future of Large-Scale AI Training Infrastructure

At Databricks, we empower organizations to solve some of the world’s most complex challenges through data and artificial intelligence. From advancing medical research to enabling next-generation transportation systems, our mission is driven by building and operating the industry’s leading Data Intelligence Platform.

We are looking for a Senior Engineering Manager, AI Runtime (AIR) to lead one of the most strategic engineering organizations within Databricks. This role offers the opportunity to define the future of managed GPU training infrastructure and AI model development at scale while leading a world-class engineering team.

About AI Runtime (AIR)

The AI Runtime (AIR) team powers enterprise-scale training and fine-tuning of deep learning and Large Language Models (LLMs) through on-demand GPU infrastructure.

Organizations rely on AIR to train cutting-edge models across a variety of use cases, including:

Foundation Models
Large Language Models (LLMs)
Transformer-Based Architectures
Drug Discovery Models
Custom Deep Learning Systems
Enterprise AI Applications

AIR provides customers with the infrastructure needed to build and operate state-of-the-art frontier AI models efficiently and reliably.

Your Leadership Opportunity

As a Senior Engineering Manager, you will oversee both the customer-facing product experience and the foundational infrastructure behind AI Runtime.

You will guide strategic investments across managed GPU training, distributed systems, scalability, reliability, and platform innovation while partnering closely with product, research, infrastructure, and platform teams.

This role combines technical leadership, product strategy, organizational development, and customer impact at one of the most exciting intersections of AI and cloud infrastructure.

Key Responsibilities

Lead and Scale a High-Performing Engineering Team

Lead, mentor, and grow a high-performing engineering team responsible for the Custom Training product and its foundational infrastructure.
Oversee distributed training orchestration, cluster lifecycle management, fault tolerance, and training efficiency initiatives.
Foster a culture of technical excellence, innovation, and customer obsession.

Define the AI Runtime Vision

Define and own the product and technical roadmap for AIR.
Balance customer experience, platform functionality, scalability, and foundational infrastructure investments.
Drive long-term strategic direction for managed GPU training capabilities.

Deliver End-to-End Product Innovation

Collaborate closely with product, research, platform, infrastructure teams, and customers.
Drive projects from ideation and prioritization through launch and ongoing operations.
Ensure successful execution of complex cross-functional initiatives.

Drive Architecture for GPU Training at Scale

Lead architectural decisions for large-scale managed GPU training systems.
Design solutions that support growing customer workloads and emerging AI technologies.
Ensure extensibility, performance, and operational excellence across the platform.

Champion Customer Success

Engage directly with customers to understand challenges and opportunities.
Advocate for customer needs within engineering decision-making processes.
Translate technical investments into measurable product outcomes.

Strengthen Reliability & Observability

Build observability frameworks for long-running distributed training jobs.
Define checkpointing strategies, operational runbooks, and failure recovery mechanisms.
Improve resilience for multi-node training environments.

Build Exceptional Teams

Partner closely with recruiting efforts.
Attract, hire, and develop top engineering talent.
Create an environment that supports growth, innovation, and leadership development.

Required Qualifications

Professional Experience

8+ years of software engineering experience.
3+ years of engineering management experience.
Proven track record building and operating managed GPU training infrastructure at scale (100s/1000s GPUs).

Distributed Training Expertise

Deep familiarity with:

PyTorch
DeepSpeed
Composer
Megatron-LM

Experience with parallelism strategies including:

Fully Sharded Data Parallel (FSDP)
Tensor Parallelism
Pipeline Parallelism

Training Reliability & Resilience

Experience implementing:

Checkpointing systems
Elastic training architectures
Automated failure recovery for long-running AI training jobs

GPU Performance Optimization

Strong understanding of:

NCCL
GPU interconnect topologies
Memory optimization techniques
Large-scale distributed GPU environments

Platform & Product Leadership

Experience building platform products with clearly defined Service Level Agreements (SLAs).
Proven ownership of customer experience beyond backend infrastructure responsibilities.
Ability to align technical execution with business and customer outcomes.

Cross-Functional Leadership

Strong leadership across platform, product, and research organizations.
Demonstrated success delivering complex initiatives in ambiguous environments.
Ability to influence stakeholders across multiple teams and functions.

Communication & Collaboration

Excellent collaboration, communication, and stakeholder management skills.
Ability to effectively partner with engineering, product, infrastructure, and research teams.

Education

BS/MS in Computer Science, Electrical Engineering, or related technical field.

Compensation & Pay Transparency

Databricks is committed to fair and equitable compensation practices.

Local Pay Range

$228,600 — $314,250 USD

Actual compensation packages are determined based on factors including:

Relevant experience
Technical expertise
Certifications and training
Job-related skills
Geographic location

In addition to base compensation, eligible employees may receive:

Annual performance bonuses
Equity awards
Comprehensive employee benefits

Databricks anticipates utilizing the full salary range based on candidate qualifications and experience.

Why Join Databricks?

At Databricks, we are building state-of-the-art AI solutions that redefine how users interact with data and our products.

As part of the AI Runtime organization, you’ll:

Lead mission-critical AI infrastructure initiatives.
Work on cutting-edge GPU training systems powering frontier AI models.
Influence the future of enterprise AI and Large Language Model development.
Collaborate with world-class researchers, engineers, and product leaders.
Solve some of the most challenging distributed systems problems in the industry.

If you’re passionate about scaling AI infrastructure and leading exceptional engineering teams, we’d love to hear from you.

About Databricks

Databricks is the Data and AI company trusted by more than 10,000 organizations worldwide.

Leading organizations including Comcast, Condé Nast, Grammarly, and over 50% of Fortune 500 companies rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics, and AI.

Databricks was founded by the original creators of:

Apache Spark™
Delta Lake
MLflow
Lakehouse Architecture

Headquartered in San Francisco, Databricks continues to drive innovation across data, analytics, and artificial intelligence.

Benefits

Databricks offers comprehensive employee benefits and perks designed to support health, wellness, financial security, and professional growth. Benefits may vary by region and location.

Diversity, Equity & Inclusion

Databricks is committed to building an inclusive workplace where every employee can thrive.

Employment decisions are made without regard to age, race, ethnicity, disability, gender identity, sexual orientation, religion, family status, veteran status, socio-economic background, political affiliation, or any other protected characteristic.

Compliance

If access to export-controlled technology or source code is required for performance of job duties, it is within Employer’s discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.

Apply for Senior Engineering Manager, AI Runtime Jobs at Databricks

Join a team building the infrastructure behind the world’s most advanced AI models and help shape the future of enterprise-scale machine learning and GPU training platforms.

Apply Today