Apple

Full Time

Austin, Texas, USA, New York City, USA, San Diego, California, USA

Posted 1 day ago

Apply Now

Category: Machine Learning | Generative AI | Model Evaluation | AI Quality Engineering

Employment Type: Full-Time

Weekly Hours: 40 Location: Austin, Texas, United States (3 Work Locations Available)

Posted: June 11, 2026

Role Number: 200667292-0157

About the Role

Apple is hiring a Machine Learning Engineer specialising in ML and GenAI Evaluation to define the quality bar for AI models powering Apple Wallet, Payments, and Commerce. This is not a supporting role — this is the role that decides whether a model ships.

You will own the full evaluation lifecycle for production ML systems, establishing the evaluation criteria, metrics frameworks, adversarial test strategies, and fairness standards that determine when models are truly ready to reach hundreds of millions of users globally. Your technical judgement directly shapes model development priorities and product decisions at Apple’s largest scale.

If you believe that how you measure a model is just as important as how you train it — and you hold quality standards that others find uncomfortably high — this role is for you.

What You Will Be Doing

Evaluation Criteria & Quality Metrics Definition You will define the evaluation criteria and quality metrics for ML models powering Wallet features — going far beyond accuracy and F1 to capture precision-recall trade-offs, calibration, fairness dimensions, and task-specific quality standards that genuinely reflect real-world user trust.

Structured Test Set Design You will design and maintain comprehensive test sets covering the full diversity of real-world scenarios — varied document formats, distributions, languages, edge cases, and adversarial inputs — ensuring models are battle-tested before they reach any user.

Robustness & Distribution Shift Testing You will develop evaluation methodologies for robustness testing, covering distribution shift, out-of-distribution generalisation, temporal drift, and aggressor scenarios that expose how models behave under pressure.

Fairness Evaluation Ownership You will own fairness evaluation end-to-end — defining fairness metrics tailored to each Wallet feature, building bias test suites across protected attributes and user populations, measuring disparate performance across subgroups, and enforcing fairness as a hard launch gate with the same rigour as any conventional quality metric.

User Persona–Stratified Benchmarking You will build benchmarks stratified by user persona — reflecting the full breadth of Wallet’s global user base across spending patterns, locales, and document types — ensuring no population is underserved by a shipped model.

GenAI & Agentic Model Evaluation You will evaluate generative and agentic model outputs, assessing hallucination rates, faithfulness, and groundedness using LLM-as-a-judge frameworks, human evaluation protocols, and prompt regression testing.

Model Quality Sign-Off You will own the final model quality sign-off process — establishing launch criteria, running final evaluations, and making the definitive call on model readiness before any Wallet feature ships.

Insight Synthesis & Cross-Functional Partnership You will synthesise evaluation results into clear, actionable insights that guide model development priorities and product roadmap decisions. Working closely with ML and Quality Engineering teams, you will identify failure modes early and close the loop between evaluation findings and model improvements.

Evaluation Best Practice Evangelism You will establish and champion evaluation best practices across the Wallet ML team — raising the bar for how models are tested, monitored, and maintained post-launch.

Minimum Qualifications

Education: MS in Machine Learning, Computer Science, Statistics, Applied Mathematics, or a related technical field (strongly preferred) — OR a Bachelor’s degree with 7+ years of hands-on experience in ML evaluation, model quality, or applied research
5+ years of hands-on ML experience with deep expertise in model evaluation, offline metrics design, and behavioural testing
Strong track record designing evaluation frameworks for production ML systems — spanning precision-recall trade-offs, calibration, fairness, and task-specific quality dimensions
Creative ability to translate standard ML metrics (F1, AUC, etc.) into utility and user trust measures
Proven experience testing for distribution shift, out-of-distribution generalisation, and temporal drift in real-world deployed models
Demonstrated ability to construct adversarial test suites, aggressor scenarios, and edge-case corpora that surface model failure modes before production
Experience with structured/semi-structured document understanding, OCR pipelines, or financial data extraction is a strong plus
Strong Python programming skills with fluency in evaluation tooling, data pipelines, and experiment tracking (MLflow, Weights & Biases, or equivalent)
Excellent communication skills — able to translate metric results into product-quality narratives for engineering and executive audiences
Experience owning model quality sign-off in a cross-functional launch process

Preferred Qualifications

Candidates with the following will stand out significantly:

PhD in Computer Science, Data Science, Statistics, AI/ML, or a related field
Experience with Bayesian or causal graph-based approaches to data generation
Experience with causal approaches to fairness evaluation — including counterfactual fairness, causal Shapley values, or structural causal model–based bias auditing
Experience evaluating models under privacy constraints or on-device inference settings
Familiarity with confidence calibration techniques and uncertainty quantification
Background in financial services, fintech, or consumer payment products

About Apple

On the Wallet ML team, rigorous evaluation is not an afterthought — it is the foundation of every model that ships. Apple is an equal opportunity employer committed to inclusion and diversity, and does not discriminate on the basis of race, colour, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or any other legally protected characteristic.

Salary information was not disclosed for this role. Compensation will be competitive and consistent with Apple’s total rewards package, including equity and comprehensive benefits.

Job Features

Job Category

Machine Learning Engineer AI