AI QA engineer

Europe

Remote

Who We Are

Role Description

Project info:

Participate in AI incubator projects to scout, incubate, and validate client and PwC-internal ideas on a 3–5‑year horizon; develop technology roadmaps and prototypes that deliver advanced client solutions; champion internally generated concepts; and continually explore, test, and demonstrate cutting‑edge AI to create new products, services, and capabilities.

Role Overview

The AI QA role is the quality, safety, and reliability backbone of IE delivery. You ensure that everything built across IE pods — agentic workflows, retrieval systems, models, data pipelines, APIs, UX surfaces, and multi-agent orchestration layers — behaves reliably, safely, ethically, and consistently under real-world conditions. You are the final line of defense between experimental frontier AI and the client environment.

You design, own, and continuously refine the end-to-end test and evaluation strategy across multiple IE pods. This includes model and agent behavior evaluations, scenario stress tests, red-team probes, data quality and lineage checks, integration and regression tests, observability metrics, drift monitoring, bias detection, and failure-mode validation. You anticipate how things can break — and ensure they break in testing, not in production.

IE pods operate at the edge of what AI, data, and multi-agent systems can do today. Unlike traditional QA, where testing is a phase, AI QA is a continuous discipline woven through every sprint, because frontier agentic systems are nondeterministic, dynamic, and deeply sensitive to context, data quality, and model updates. You build the evaluation harnesses and monitoring capabilities that let the team trust its own intelligence systems.

Your role extents beyond testing — you help define what quality means for intelligent systems, how safety is enforced, how drift is detected early, how agents behave under pressure, and how to measure reliability for emerging capabilities that have no established playbook.

Responsibilities

1. Own the Test Strategy (App + Data + Model)

· Define the unified QA strategy across UI, backend, retrieval, and AI components.

· Design test plans that match the pod’s frontier experiment, NFRs, and acceptance criteria.

· Ensure coverage across functional, non-functional, and behavioral dimensions.

2. Build Evaluation Suites

· Develop evaluation frameworks for LLMs, retrieval, and agent workflows.

· Implement scenario-based evals, confusion tests, regression tests, and behavioral probes.

· Create both automated and manual eval paths for different model behaviors.

3. Test Automation

· Build automated test harnesses for Python services, agents, APIs, pipelines, and integrations.

· Integrate test automation into CI/CD pipelines.

· Maintain automated testing scripts that catch regressions early.

4. Data Quality & ML Evaluation

· Validate data correctness, consistency, and completeness for all RAG and agent pipelines.

· Test embedding quality, retrieval accuracy, and ranking performance.

· Identify hallucination patterns, reasoning failures, and model drift.

5. Red-Team & Edge-Case Scenario Design

· Simulate high-risk or adversarial scenarios to uncover weaknesses.

· Create structured red-team tests for safety, compliance, and robustness.

· Validate handling of ambiguous inputs, missing data, or malformed requests.

6. Observability, Monitoring & Drift Detection

· Define metrics and logs to monitor agent behavior, latency, cost, and error modes.

· Work with AI Ops to implement dashboards and alerts for reliability tracking.

· Detect and escalate drift, bias, or degradation trends quickly.

7. Defect Management & Triage

· Runs the defect triage workflow, partnering with the Tech Lead and engineers.

· Diagnose root causes and categorize failures across UI, API, data, or model layers.

· Ensure clear, crisp documentation with reproduction steps.

Required Skills & Experience

Technical Skills

· Strong Python scripting for test automation and scenario evaluation.

· Experience with ML evaluation tools, LLM/RAG testing, or model benchmarking suites.

· Familiarity with vector DBs, retrieval systems, and agent workflows.

· Understanding of CI/CD pipelines, DevOps tooling, and observability platforms.

· Ability to query data, validate embeddings, and test ranking/precision metrics.

QA & Risk Expertise

· 5–6+ years in QA, SDET, testing, or evaluation-focused ML engineering.

· Strong instincts for edge cases, risk modes, and adversarial failures.

· Experience designing tests for systems with nondeterministic or probabilistic behavior (preferred).

Mindset

· Curious, skeptical, and systematic.

· Thrives on breaking things to make them better.

· Strong communicator — crisp defect reporting is non-negotiable.

· High ownership and discipline; loves clear structure and tight loops.

Success Criteria (12-Week Pod)

· Robust evaluation suite delivered early (W2–W4) and expanded through W12.

· Zero untracked regressions in agent behavior after integration cycles.

· Data quality validated weekly; drift identified early.

· Clear, actionable defect reporting that accelerates engineering velocity.

· Reliable, safe, stable AI behavior for all pod deliverables

Start date: ASAP

HackerRank Challenge: Yes

Remote vs Onsite: Fully remote, with possible occasional in person team sessions / workshops / gatherings (i.e. 1x quarter) likely to take place in Prague

US Hours overlap needed: Minimum 2-6pm CET, preferred 2-7pm CET

‍

We Expect You to Have:

Apply for this position

Our team will review your application within the next 5 days.

Upload Resume

Uploading...

fileuploaded.jpg

Upload failed. Max size for files is 10 MB.

Send

Thank you!
We will be in touch shortly

kid giving a thumbs-up while sitting at a desktop table

Done

Oops! Something went wrong while submitting the form.

Apply for this position

Thank you!We will be in touch shortly

Thank you!
We will be in touch shortly