Site Reliability Engineer, HPC / AI Infrastructure — Tesla Supercomputing/AI Infrastructure Team | India

Full Time
Bengaluru, India
Posted 9 hours ago

Category: AI & Robotics | Site Reliability Engineering | High-Performance Computing

Employment Type: Full-Time

Location: India

Department: AI Infrastructure / Supercomputing


About the Role

Tesla’s Supercomputing/AI Infrastructure team works directly with the high-performance computing and machine learning infrastructure that powers Tesla’s most demanding ML workloads — from virtual simulations to Autopilot hardware and silicon design. As the scale and complexity of cluster builds continue to grow, the team’s focus on automation, monitoring, self-healing, and alerting becomes mission-critical to engineering success across Tesla.

As the scope and impact of Optimus, Full Self-Driving (FSD), and Robotaxi continue to expand, so does the importance of this team’s work. As a Site Reliability Engineer, you will be responsible for maintaining and improving the platform that powers Tesla’s FSD and Optimus engineering teams — ensuring they have the tools, resources, and infrastructure reliability needed to be productive. Your work will directly enable large-scale neural network training and streamline FSD development.


What You Will Be Doing

AI/ML Cluster Infrastructure Support You will support GPU-based AI/ML cluster infrastructure, with a focus on systems automation, configuration management, and deployment at scale.

Monitoring & Self-Healing Pipelines You will improve monitoring and self-healing pipelines, as well as strengthen the overall security posture of Tesla’s AI infrastructure.

Performance Optimisation You will optimise server, storage, and network performance across Tesla’s large-scale compute clusters.

Tooling Development You will develop new internal tools using Python, Golang, or Bash/Shell to streamline operations and reduce manual overhead.

Infrastructure as Code You will apply Infrastructure as Code (IaC) best practices to ensure repeatable, scalable, and reliable infrastructure deployments.

On-Call Participation You will participate in a 24×7 on-call rotation, ensuring continuous reliability of critical AI training infrastructure.


Required Qualifications

  • Education: Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, Physics, or proof of exceptional skills in a related field
  • 3+ years of additional equivalent experience, or evidence of exceptional ability related to the position
  • Proficiency with Linux fundamentals and performance optimisation
  • Experience with Slurm, LSF, and storage management of parallel file systems
  • Proficiency in Python, Golang, and/or Bash
  • Experience with configuration management software (e.g., Ansible) and systems monitoring & alerting tools (e.g., Prometheus, Grafana, Telegraf, Splunk)
  • Experience with containerisation technologies such as Kubernetes
  • Experience with high-throughput, low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus

About Tesla AI Infrastructure

This team sits at the heart of Tesla’s AI ambitions — building and maintaining the supercomputing infrastructure that trains the neural networks behind Full Self-Driving, Optimus, and Robotaxi. As Tesla’s compute needs continue to scale rapidly, this team’s automation and reliability work becomes increasingly central to the company’s AI roadmap.

Job Features

Job Category

AI Infrastructure

Apply For This Job

A valid email address is required.
A valid phone number is required.
loader
Scroll to Top