Category: Machine Learning | Generative AI | Model Evaluation | AI Quality Engineering
Employment Type: Full-Time
Weekly Hours: 40 Location: Austin, Texas, United States (3 Work Locations Available)
Posted: June 11, 2026
Role Number: 200667292-0157
About the Role
Apple is hiring a Machine Learning Engineer specialising in ML and GenAI Evaluation to define the quality bar for AI models powering Apple Wallet, Payments, and Commerce. This is not a supporting role — this is the role that decides whether a model ships.
You will own the full evaluation lifecycle for production ML systems, establishing the evaluation criteria, metrics frameworks, adversarial test strategies, and fairness standards that determine when models are truly ready to reach hundreds of millions of users globally. Your technical judgement directly shapes model development priorities and product decisions at Apple’s largest scale.
If you believe that how you measure a model is just as important as how you train it — and you hold quality standards that others find uncomfortably high — this role is for you.
What You Will Be Doing
Evaluation Criteria & Quality Metrics Definition You will define the evaluation criteria and quality metrics for ML models powering Wallet features — going far beyond accuracy and F1 to capture precision-recall trade-offs, calibration, fairness dimensions, and task-specific quality standards that genuinely reflect real-world user trust.
Structured Test Set Design You will design and maintain comprehensive test sets covering the full diversity of real-world scenarios — varied document formats, distributions, languages, edge cases, and adversarial inputs — ensuring models are battle-tested before they reach any user.
Robustness & Distribution Shift Testing You will develop evaluation methodologies for robustness testing, covering distribution shift, out-of-distribution generalisation, temporal drift, and aggressor scenarios that expose how models behave under pressure.
Fairness Evaluation Ownership You will own fairness evaluation end-to-end — defining fairness metrics tailored to each Wallet feature, building bias test suites across protected attributes and user populations, measuring disparate performance across subgroups, and enforcing fairness as a hard launch gate with the same rigour as any conventional quality metric.
User Persona–Stratified Benchmarking You will build benchmarks stratified by user persona — reflecting the full breadth of Wallet’s global user base across spending patterns, locales, and document types — ensuring no population is underserved by a shipped model.
GenAI & Agentic Model Evaluation You will evaluate generative and agentic model outputs, assessing hallucination rates, faithfulness, and groundedness using LLM-as-a-judge frameworks, human evaluation protocols, and prompt regression testing.
Model Quality Sign-Off You will own the final model quality sign-off process — establishing launch criteria, running final evaluations, and making the definitive call on model readiness before any Wallet feature ships.
Insight Synthesis & Cross-Functional Partnership You will synthesise evaluation results into clear, actionable insights that guide model development priorities and product roadmap decisions. Working closely with ML and Quality Engineering teams, you will identify failure modes early and close the loop between evaluation findings and model improvements.
Evaluation Best Practice Evangelism You will establish and champion evaluation best practices across the Wallet ML team — raising the bar for how models are tested, monitored, and maintained post-launch.
Minimum Qualifications
- Education: MS in Machine Learning, Computer Science, Statistics, Applied Mathematics, or a related technical field (strongly preferred) — OR a Bachelor’s degree with 7+ years of hands-on experience in ML evaluation, model quality, or applied research
- 5+ years of hands-on ML experience with deep expertise in model evaluation, offline metrics design, and behavioural testing
- Strong track record designing evaluation frameworks for production ML systems — spanning precision-recall trade-offs, calibration, fairness, and task-specific quality dimensions
- Creative ability to translate standard ML metrics (F1, AUC, etc.) into utility and user trust measures
- Proven experience testing for distribution shift, out-of-distribution generalisation, and temporal drift in real-world deployed models
- Demonstrated ability to construct adversarial test suites, aggressor scenarios, and edge-case corpora that surface model failure modes before production
- Experience with structured/semi-structured document understanding, OCR pipelines, or financial data extraction is a strong plus
- Strong Python programming skills with fluency in evaluation tooling, data pipelines, and experiment tracking (MLflow, Weights & Biases, or equivalent)
- Excellent communication skills — able to translate metric results into product-quality narratives for engineering and executive audiences
- Experience owning model quality sign-off in a cross-functional launch process
Preferred Qualifications
Candidates with the following will stand out significantly:
- PhD in Computer Science, Data Science, Statistics, AI/ML, or a related field
- Experience with Bayesian or causal graph-based approaches to data generation
- Experience with causal approaches to fairness evaluation — including counterfactual fairness, causal Shapley values, or structural causal model–based bias auditing
- Experience evaluating models under privacy constraints or on-device inference settings
- Familiarity with confidence calibration techniques and uncertainty quantification
- Background in financial services, fintech, or consumer payment products
About Apple
On the Wallet ML team, rigorous evaluation is not an afterthought — it is the foundation of every model that ships. Apple is an equal opportunity employer committed to inclusion and diversity, and does not discriminate on the basis of race, colour, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or any other legally protected characteristic.
Salary information was not disclosed for this role. Compensation will be competitive and consistent with Apple’s total rewards package, including equity and comprehensive benefits.
