Category: AI & Robotics | Site Reliability Engineering | High-Performance Computing
Employment Type: Full-Time
Location: India
Department: AI Infrastructure / Supercomputing
About the Role
Tesla’s Supercomputing/AI Infrastructure team works directly with the high-performance computing and machine learning infrastructure that powers Tesla’s most demanding ML workloads — from virtual simulations to Autopilot hardware and silicon design. As the scale and complexity of cluster builds continue to grow, the team’s focus on automation, monitoring, self-healing, and alerting becomes mission-critical to engineering success across Tesla.
As the scope and impact of Optimus, Full Self-Driving (FSD), and Robotaxi continue to expand, so does the importance of this team’s work. As a Site Reliability Engineer, you will be responsible for maintaining and improving the platform that powers Tesla’s FSD and Optimus engineering teams — ensuring they have the tools, resources, and infrastructure reliability needed to be productive. Your work will directly enable large-scale neural network training and streamline FSD development.
What You Will Be Doing
AI/ML Cluster Infrastructure Support You will support GPU-based AI/ML cluster infrastructure, with a focus on systems automation, configuration management, and deployment at scale.
Monitoring & Self-Healing Pipelines You will improve monitoring and self-healing pipelines, as well as strengthen the overall security posture of Tesla’s AI infrastructure.
Performance Optimisation You will optimise server, storage, and network performance across Tesla’s large-scale compute clusters.
Tooling Development You will develop new internal tools using Python, Golang, or Bash/Shell to streamline operations and reduce manual overhead.
Infrastructure as Code You will apply Infrastructure as Code (IaC) best practices to ensure repeatable, scalable, and reliable infrastructure deployments.
On-Call Participation You will participate in a 24×7 on-call rotation, ensuring continuous reliability of critical AI training infrastructure.
Required Qualifications
- Education: Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, Physics, or proof of exceptional skills in a related field
- 3+ years of additional equivalent experience, or evidence of exceptional ability related to the position
- Proficiency with Linux fundamentals and performance optimisation
- Experience with Slurm, LSF, and storage management of parallel file systems
- Proficiency in Python, Golang, and/or Bash
- Experience with configuration management software (e.g., Ansible) and systems monitoring & alerting tools (e.g., Prometheus, Grafana, Telegraf, Splunk)
- Experience with containerisation technologies such as Kubernetes
- Experience with high-throughput, low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus
About Tesla AI Infrastructure
This team sits at the heart of Tesla’s AI ambitions — building and maintaining the supercomputing infrastructure that trains the neural networks behind Full Self-Driving, Optimus, and Robotaxi. As Tesla’s compute needs continue to scale rapidly, this team’s automation and reliability work becomes increasingly central to the company’s AI roadmap.
