Skip to main content

Adversarial AI Security & ML Red Teaming

Services  /  Adversarial AI Security & ML Red Teaming

The security of an AI system is not determined by whether it produces correct outputs on test data. It is determined by whether it continues to produce correct and safe outputs when an adversary is actively working to make it fail — by manipulating the inputs it receives, the data it was trained on, or the model itself. These are not theoretical academic concerns. Model extraction attacks that reconstruct proprietary models from API queries are documented at scale. Data poisoning attacks that alter model behaviour by corrupting training datasets have been demonstrated against production systems in financial fraud detection, malware classification, and medical imaging. Adversarial examples that cause confident misclassification with imperceptible input modifications have been reproduced across every major model architecture.

The gap between ML engineering and security engineering is where most production AI vulnerabilities live. ML teams are trained to maximise accuracy on held-out test sets — a discipline that is entirely orthogonal to adversarial robustness. Security teams are trained to find weaknesses in conventional software systems — a discipline that does not extend naturally to the probabilistic, high-dimensional failure modes of trained models. The adversarial AI security assessment requires both, applied together, to the specific deployment context and threat model of the system under assessment.

This service conducts structured adversarial assessment of production and near-production AI and ML systems: attack simulation across all relevant threat categories, measurement of the system’s resistance to each, identification of the vulnerabilities that would be exploited by a motivated adversary in the specific deployment context, and implementation of the technical controls that reduce residual risk to an acceptable level. It also satisfies the EU AI Act Article 9 red-teaming requirement for high-risk AI systems — not by producing documentation that describes a red team assessment, but by conducting one.

Price Range
£18,000 – £240,000+
Red team assessment, vulnerability findings, and defensive specification. Implementation of defences by your ML and engineering teams is additional.
Duration
4 – 20 weeks
Assessment phase. Defence implementation timelines depend on which vulnerabilities are found and the complexity of the required mitigations.
Distinct from
Prompt injection testing (see Prompt Engineering) — that is the application layer. This service operates at the model and training pipeline layer, and covers the full attack surface of deployed ML systems.
EU AI Act
Article 9 requires post-market monitoring and testing for high-risk AI systems. Article 15 requires robustness, accuracy, and cybersecurity measures. This engagement produces the technical evidence for both.
Contract
Fixed-price. 50% on signing, 50% on delivery acceptance.
Standard penetration testing does not assess adversarial ML riskConventional penetration testing finds vulnerabilities in the software infrastructure surrounding an AI system — the API, the authentication, the data storage. It does not find model extraction vulnerabilities, adversarial example susceptibility, data poisoning backdoors, or membership inference exposure. These require attack methodologies specific to machine learning systems that are not part of standard penetration testing methodology. An AI system that passes a conventional penetration test has not been assessed for adversarial ML risk.

Six categories of adversarial ML attack. Each operates at a different stage of the ML lifecycle. Each requires a different defensive approach. None of them are found by conventional security testing.

Adversarial ML is a research discipline that has been active for over a decade. The academic attacks developed in that time have been translated into practical exploits against production systems by adversaries with access to the same published literature. The attacks below are not theoretical — each has documented production system impact or has been demonstrated to be practical in settings directly analogous to production deployment. The assessment methodology for each is distinct; they cannot be collapsed into a single unified approach.

Attack Stage: Inference Time
Evasion Attacks — Adversarial Examples
An adversary makes imperceptible modifications to an input that cause a trained model to misclassify it with high confidence. The modification is optimised against the model’s decision boundary using gradient information (white-box) or using the model’s output probabilities or decisions (black-box). The attack crosses every domain: images modified to evade computer vision classifiers; network traffic modified to evade intrusion detection systems; financial transaction features perturbed to evade fraud detection; text modified to evade content classifiers. The adversarial example looks identical to the original to a human reviewer. The model classifies it as something entirely different, with high confidence.
Documented production impact
A financial institution deployed an ML-based fraud detection classifier. Researchers demonstrated that transaction features could be perturbed by amounts within the legitimate variation range for each feature to produce a consistently fraudulent transaction that the classifier rated as low risk. The perturbations were small enough to be indistinguishable from legitimate transaction variation. The attack required approximately 40 queries to the classifier’s risk score output to develop an effective adversarial example.
Assessment methodology
White-box assessment if model access is available: gradient-based adversarial example generation (PGD, C&W, AutoAttack) with measurement of the perturbation magnitude required for successful evasion across the deployment domain. Black-box assessment for API-accessible models: boundary-based and decision-based attacks that require only query access to the model’s output. Measurement of the adversarial robustness curve — the relationship between perturbation budget and attack success rate — as the primary output metric.
Attack Stage: Training Time
Data Poisoning — Corrupting the Training Distribution
An adversary with access to the training data pipeline — which includes any organisation that collects training data from external sources, scrapes the web, accepts user-submitted data, or uses third-party data providers — can inject carefully crafted examples that degrade overall model performance or introduce specific misclassification behaviour. Clean-label poisoning attacks corrupt model behaviour using only correctly-labelled examples, making them invisible to standard data quality review. Gradient-based poisoning attacks compute the training examples that, when added to the training set, maximally shift the model’s decision boundary towards a specific target. Organisations that retrain their models on production data are continuously exposed to production-data poisoning.
Documented production impact
A security vendor’s malware classification model was trained on samples submitted by security researchers and enterprise customers. An adversarial actor submitted carefully-crafted benign-appearing samples designed to shift the classifier’s decision boundary for a specific malware family towards the benign class. The attack was effective because clean-label poisoning examples pass standard malware analysis review — they are genuinely benign files, and the poisoning effect only emerges when the model is retrained on the batch containing them.
Assessment methodology
Training data provenance audit: mapping every data source used in training to its trust level and the adversary access it provides. Data integrity assessment: statistical testing for anomalous examples that may represent poisoning attempts using outlier detection and spectral methods. Clean-label poisoning susceptibility: measurement of the fraction of training examples that would need to be poisoned to produce a specified behavioural change in the model. Continuous retraining risk assessment for models that incorporate production data feedback.
Attack Stage: Training Time / Supply Chain
Backdoor (Trojan) Attacks — Embedded Triggers
A backdoor attack embeds a hidden behaviour in a trained model that activates only when a specific trigger pattern is present in the input. The model behaves correctly on all normal inputs; it misclassifies only inputs containing the trigger. The trigger can be a specific pixel pattern in an image classifier, a specific phrase or token in a text classifier, a specific feature value in a tabular classifier. Backdoor attacks are particularly dangerous in the supply chain context: a model downloaded from a public repository, a foundation model used as a starting point for fine-tuning, or a model trained by a third-party ML service provider could contain a backdoor that the organisation using it cannot detect through normal accuracy testing, because the backdoor does not affect normal inputs.
Why supply chain backdoors are the highest-risk variant
An organisation fine-tunes a publicly-available foundation model for clinical document classification. The foundation model contains a backdoor planted by an adversarial actor who contributed the model to a public repository. The backdoor trigger is a specific token sequence that appears rarely in normal clinical text but can be injected into documents by an adversary with write access to clinical documentation systems. When the trigger is present, the classifier consistently assigns the highest-risk classification regardless of document content. The backdoor survives fine-tuning because it was embedded in the foundation model weights at a depth not reached by the fine-tuning gradient updates.
Assessment methodology
Neural Cleanse and STRIP-based backdoor detection: reverse-engineering potential trigger patterns and testing for the characteristic sharp-boundary activation pattern of backdoor behaviour. Model inspection for anomalous activation clusters in internal representations. Supply chain provenance review: for each model in the ML pipeline, assessing the trust level of its source and the verification measures applied before deployment. Fine-tuning inheritance assessment: for fine-tuned models, measuring whether fine-tuning has eliminated or preserved potential backdoors from the base model.
Attack Stage: Inference Time / API Access
Model Extraction — Stealing Proprietary Models
An adversary with query access to a model’s API can reconstruct a functionally equivalent model by submitting carefully chosen inputs and observing the outputs. The reconstructed model can then be used to develop adversarial examples against the original model, to replicate the model’s intellectual property, or to circumvent the access controls and pricing that the original model provider applies. Model extraction attacks have been demonstrated to reconstruct commercial ML models with high fidelity using query budgets that are affordable to any motivated adversary. The attack is not dependent on any implementation vulnerability — it is inherent to the information exposed by any prediction API that returns confidence scores or soft-max probabilities.
Documented production impact
Researchers demonstrated extraction of a commercial credit scoring model’s decision boundary to within 99.8% agreement using approximately 36,000 API queries at a total cost of approximately £9 in API fees. The extracted model reproduced the original model’s scoring decisions closely enough to be used as a substitute. The attack required only the credit score output — not the confidence scores — demonstrating that decision-only APIs are not extraction-resistant.
Assessment methodology
Extraction attack simulation using a controlled query budget against a test instance of the model: measurement of the model fidelity achievable at different query counts. API output information leakage assessment: does the API return confidence scores, class probabilities, or only class labels? Each provides different information to an extractor. Query-rate sensitivity: how much information does each query reveal, and what is the minimum query budget for a high-fidelity extraction? Watermarking feasibility assessment: whether model watermarking can be applied to detect extracted model deployment.
Attack Stage: Inference Time / API Access
Membership Inference — Exposing Training Data
A membership inference attack determines whether a specific data point was in the model’s training dataset by querying the model and analysing its response pattern. Models trained without sufficient regularisation exhibit higher confidence on training data than on unseen data — a characteristic that membership inference attacks exploit. The privacy implications are severe: for a healthcare model trained on patient records, an adversary can determine whether a specific individual’s medical data was used in training. For a model trained on confidential business data, whether a specific document was in the training set. UK GDPR requires a lawful basis for personal data processing — membership inference attacks can reveal that data was processed in ML training without the subject’s knowledge.
Regulatory consequence
An NHS trust deploys a clinical risk prediction model trained on historical patient data. A membership inference attack using the model’s API could reveal whether a specific named patient’s records were used in training. If those patients were not informed that their data would be used for ML model training and did not consent to that use, the attack reveals a UK GDPR Article 6 lawful basis violation. The attack does not require access to the training data — only to the deployed model. The vulnerability is inherent to the model’s behaviour and cannot be resolved by securing the training data after the model is deployed.
Assessment methodology
Shadow model membership inference using a subset of known training and non-training examples to calibrate the attack. Likelihood-ratio attack for models that expose confidence scores. Attack success measurement against the model’s actual training population and test population, producing the privacy leakage metric as the primary output. Differential privacy feasibility assessment: quantifying the privacy-utility trade-off of applying differential privacy training to reduce membership inference risk to acceptable levels.
Attack Stage: Inference Time / High Confidence
Model Inversion — Reconstructing Training Data
A model inversion attack reconstructs representative examples of the training data by repeatedly querying the model and using gradient information to find inputs that maximise the model’s confidence in a specific class. For face recognition systems, model inversion can reconstruct recognisable facial features of training subjects. For medical classifiers, it can reconstruct characteristic features of specific diagnostic categories that may reveal information about the training population. Model inversion attacks are most effective against models that expose gradient information or detailed confidence scores, but black-box variants have been demonstrated to be effective against decision-only APIs for models with highly confident outputs in well-separated regions of the input space.
Privacy implication
A biometric authentication system was demonstrated to be vulnerable to model inversion attacks that reconstructed facial images with sufficient fidelity to be recognised by human reviewers as resembling specific training subjects, using only query access to the model. The reconstructed images were not the original training images but represented the “average face” that the model associated with a specific identity class. For high-value targets, this provides actionable intelligence for social engineering or credential theft.
Assessment methodology
Gradient-based inversion for white-box accessible models: measurement of the reconstruction fidelity achievable from gradient information at different query budgets. Decision-based inversion for black-box models: boundary-walking attacks that reconstruct input-space characteristics using only output labels. Quantification of the privacy leakage as the mutual information between model outputs and training data characteristics. Mitigation assessment: whether output perturbation, confidence truncation, or model architecture changes reduce inversion effectiveness to acceptable levels without unacceptable accuracy loss.

Five ways conventional security assessment misses adversarial ML vulnerabilities entirely. Each represents a gap that a motivated adversary will find before your security team does.

The security testing methodologies that organisations currently apply to AI systems — penetration testing, OWASP Top 10 assessments, code review, dependency scanning — are necessary but not sufficient. They test the software infrastructure surrounding the model. They do not test the model itself as a security-relevant artefact with its own attack surface. The gaps below are not edge cases — they are the majority of the adversarial ML attack surface, and they are systematically missed by every conventional security methodology.

01
Penetration testing tests the application layer, not the model layer
A penetration test of an ML-powered application finds authentication weaknesses, injection vulnerabilities, insecure direct object references, and misconfigured cloud storage — the OWASP Top 10 applied to the API that wraps the model. It does not test whether the model’s predictions can be manipulated by carefully crafted inputs, whether the model leaks information about its training data, whether the model has been backdoored, or whether the model can be extracted through query interactions. These are not application-layer vulnerabilities. They are model-layer vulnerabilities. They require different tools, different expertise, and different methodology to assess. A penetration test that reports “the API is secure” has said nothing about the model’s adversarial robustness.
What the pentest report says vs. what it means
The penetration test report states: “No significant vulnerabilities were found in the API or surrounding infrastructure.” What this means: no OWASP Top 10 vulnerabilities were found in the code and configuration. What it does not address: whether a query sequence of 10,000 API calls could extract a functionally equivalent model; whether a fraud transaction perturbed by 0.3% on three features evades the classifier; whether the model contains a backdoor inherited from a foundation model. The adversary reads the OWASP report and tests the model layer, which was not tested.
What adversarial ML assessment covers that penetration testing does not
The model as an attack surface: its decision boundary geometry, its confidence output information content, its training data memorisation, its sensitivity to structured input perturbations, and the feasibility of extracting its functional behaviour through query interactions. These assessments require ML expertise to conduct, ML tools to execute, and ML knowledge to interpret.
02
Standard model evaluation measures accuracy on clean data, not robustness on adversarial data
The primary metric of ML model evaluation is accuracy on a held-out test set. This measures the model’s performance on the data distribution it was trained on, perturbed by the natural variation in that distribution. It says nothing about the model’s performance when an adversary has specifically crafted the input to cause misclassification. A model with 99.7% accuracy on clean test data may have 12% accuracy under a white-box adversarial attack — not because the model is poorly trained, but because the model has learned decision boundaries that are locally correct but globally fragile in ways that only become visible under adversarial perturbation. The standard evaluation pipeline that produces model accuracy reports produces no adversarial robustness information.
The accuracy-robustness trade-off that evaluation hides
Two classifiers: Classifier A has 99.2% clean accuracy and 18% adversarial accuracy under PGD-20 attack. Classifier B has 97.1% clean accuracy and 71% adversarial accuracy under the same attack. Standard model evaluation reports Classifier A as the better model. For a deployment context where adversarial inputs are a realistic threat, Classifier B is the better model by a large margin. The standard evaluation report does not reveal this because adversarial robustness is not a standard evaluation metric.
What adversarial evaluation adds
Adversarial accuracy curves: model performance under attacks of increasing perturbation magnitude, providing a complete characterisation of adversarial robustness rather than a single accuracy number. Certified robustness bounds where computationally feasible: formal guarantees on the maximum perturbation that cannot cause misclassification. Comparison of the model’s adversarial robustness against the attack budgets realistic for the deployment’s threat actors.
03
Data quality review does not detect clean-label poisoning
Standard data quality processes — deduplication, format validation, labelling consistency checks, outlier detection — are designed to find genuine data quality problems: mislabelled examples, duplicated records, formatting errors, anomalously distributed features. Clean-label poisoning attacks inject examples that are correctly labelled, correctly formatted, and within the normal distribution of the feature space. They look like legitimate training examples. They pass every standard data quality check. Their effect on model behaviour only emerges after training, when the poisoned examples have shifted the model’s decision boundary in the intended direction. Data quality review that does not include adversarial poisoning detection is not a defence against clean-label poisoning.
Why clean-label poisoning is the most dangerous variant
A dataset curator performing standard data quality review examines a submitted training batch for label errors and anomalous features. All examples are correctly labelled and within normal feature ranges. The batch contains 140 clean-label poisoning examples computed to shift the classifier’s decision boundary for high-value fraud patterns. The examples pass review and are incorporated into the training set. The retrained model classifies a specific structured fraud pattern as low-risk. No quality control step detected the attack because the attack was designed specifically to pass quality control.
What data pipeline security assessment adds
Spectral signature detection: identifying poisoned examples by their anomalous representation in the model’s internal feature space, which differs from their appearance in the raw data space. Influence function analysis: measuring which training examples have disproportionate influence on specific predictions, identifying potential poisoning concentrations. Poisoning resilience measurement: quantifying the fraction of training data that would need to be poisoned to produce a specified behavioural change in the model.
04
Supply chain security scanning does not detect model backdoors
Software supply chain security tools — dependency vulnerability scanners, licence compliance checkers, SBOM generators — operate on software packages and their declared dependencies. A model weights file is not a software package. It does not have a CVE. It does not have a declared dependency tree. Software composition analysis tools do not scan model weights for backdoors. A foundation model downloaded from Hugging Face and incorporated into a production ML pipeline is a supply chain component whose security properties are entirely outside the scope of standard supply chain security tooling. The only way to assess whether it contains a backdoor is to test it for backdoor behaviour using adversarial ML methods.
The model supply chain exposure that SBOM does not address
An organisation’s SBOM lists every Python package, its version, and known CVEs. It lists the PyTorch version, transformers version, and tokenizer version. It does not list the model weights file as a supply chain component, because model weights files are not software packages and SBOM tooling does not process them. The model weights file — downloaded from a public repository with 50,000 downloads — contains a backdoor. The SBOM scan passes. The model is deployed. The backdoor is active.
What model supply chain assessment adds
Model provenance verification: establishing the chain of custody from model training to deployment, identifying points at which an adversary could have introduced modifications. Backdoor detection scanning using Neural Cleanse, ABS, and STRIP methodologies applied to every foundation model and pre-trained component in the ML pipeline before production deployment. Model integrity attestation: a signed hash of the validated model weights that verifies the deployed model is the assessed model.
05
Privacy impact assessments do not measure ML-specific privacy leakage
A Data Protection Impact Assessment (DPIA) conducted for an ML system addresses the privacy risks of the data processing in the conventional sense: lawful basis, data minimisation, retention, third-party sharing, security controls. It does not address the privacy risks that are specific to trained models: membership inference (can an adversary determine whose data was in the training set?), model inversion (can an adversary reconstruct training data features from the model?), and training data extraction (can an adversary recover verbatim training examples from the model?). These are privacy risks that exist after the model is trained and deployed, arising from the model itself rather than from the training data pipeline, and they require different assessment methodology from a standard DPIA.
The GDPR risk a DPIA does not capture
A DPIA for a clinical risk prediction model concludes that the processing has a lawful basis, the data is minimised, and the technical controls are appropriate. It does not assess whether the deployed model enables an adversary to determine, using only query access, whether specific named patients were in the training set — a capability that effectively reproduces the personal data processing in a form accessible to unauthorised parties. The DPIA was conducted on the training data pipeline. The UK GDPR risk from the deployed model’s information leakage was not assessed.
What ML privacy risk assessment adds
Quantitative membership inference risk: the success rate of membership inference attacks against the specific model, expressed as the privacy leakage metric (information the adversary gains about training set membership relative to a random baseline). Model inversion feasibility: whether the model exposes sufficient gradient or confidence information to enable meaningful reconstruction of training data features. Training data extraction testing for LLMs: whether the model memorises and can be induced to reproduce verbatim training examples. Differential privacy cost-benefit analysis for models where these risks exceed acceptable thresholds.
06
Monitoring and logging detect system anomalies, not adversarial intent within model behaviour
Operational monitoring systems — logs, SIEM pipelines, anomaly detection dashboards — are designed to identify deviations in system performance: latency spikes, error rates, unusual API traffic volumes, authentication failures, and infrastructure-level irregularities. These signals capture system misuse at the infrastructure and application layers. They do not capture adversarial intent encoded within statistically valid inputs to a model. An adversary interacting with an ML system can operate entirely within normal request patterns, submitting inputs that are syntactically correct, statistically plausible, and indistinguishable from legitimate user behaviour at the logging level, while systematically probing, extracting, or manipulating the model. From the perspective of system monitoring, nothing abnormal has occurred. From the perspective of the model, its decision boundary has been mapped, exploited, or degraded.
Why adversarial activity remains invisible to monitoring systems
A fraud detection model receives 25,000 API requests over a 48-hour period from a distributed set of IP addresses. Each request is valid, authenticated, and within expected rate limits. No alert is triggered. Embedded within these requests is a structured probing sequence that incrementally maps the model’s decision boundary across high-value transaction features. By the end of the sequence, the adversary has identified a narrow feature corridor that consistently bypasses detection. The monitoring system reports normal operation. The model has been strategically compromised without any detectable system anomaly.
What adversarial monitoring adds
Behavioural query analysis: identifying structured probing patterns across sequences of inputs rather than evaluating requests in isolation. Decision boundary interaction tracking: monitoring how input distributions evolve relative to the model’s classification thresholds to detect systematic exploration. Model response entropy analysis: detecting abnormal consistency or variance in outputs that indicate extraction or evasion strategies. Adversarial intent classification layered on top of standard monitoring: distinguishing benign usage from strategic interaction designed to infer, manipulate, or bypass model behaviour.

Four engagement types. Single system assessment, full ML portfolio, continuous red team programme, and supply chain assessment.

Every engagement produces a findings report with specific vulnerabilities, their severity, the attack scenarios that would exploit them in the deployment context, and the defensive specifications required to remediate them. Implementation of the defences — adversarial training, differential privacy, model watermarking, output perturbation, rate limiting — is performed by your ML engineering team from our specifications. Re-assessment after defence implementation is available as part of the engagement or as a follow-on.

Engagement Type 1
Single System Adversarial Assessment
For organisations assessing a single production or near-production ML system across all six adversarial threat categories. One model, one deployment context, one comprehensive assessment. Appropriate for high-value systems where the consequence of a successful adversarial attack is material — fraud detection, clinical decision support, identity verification, credit scoring, automated trading signals, content moderation with legal implications, or any ML system that produces outputs on which significant decisions depend. The EU AI Act high-risk system classification covers most of these use cases and mandates robustness testing under Article 15.
£18,000
Fixed · VAT excl.
6 weeksWhite-box assessment (model access available) completes faster than black-box assessment (API only). Confirm access level before engagement begins.
Threat Categories Assessed
Evasion: adversarial example generation using white-box and black-box methods appropriate to the access level; adversarial robustness curve measurement; perturbation budget vs. attack success rate
Data poisoning: training data provenance audit; clean-label poisoning susceptibility measurement; continuous retraining pipeline risk assessment
Backdoor: Neural Cleanse and STRIP-based backdoor detection; foundation model backdoor inheritance assessment for fine-tuned models
Model extraction: query-based extraction simulation; information leakage quantification from API outputs; watermarking feasibility assessment
Membership inference: shadow model attack; likelihood-ratio attack; privacy leakage metric calculation
Model inversion: gradient-based and decision-based inversion feasibility; training data reconstruction assessment
Assessment Outputs
Findings report: every vulnerability found, its severity rating (critical/high/medium/low), the specific attack scenario that would exploit it in the deployment context, and the adversary capability required
Adversarial robustness metrics: quantitative measurements for each threat category, comparable against published benchmarks for similar model architectures and tasks
Threat model mapping: which of the six threat categories are most relevant for the specific deployment context and threat actor profile, with justification
Defensive specifications: for each finding above the acceptable threshold, the specific defensive technique recommended, the expected effectiveness, and the implementation specification for the ML engineering team
EU AI Act Article 9/15 compliance evidence: the assessment results in the format required for the technical documentation and conformity assessment record
Re-assessment scope: the specific tests that should be re-run after defensive implementations are complete to verify the vulnerabilities have been adequately addressed
Access Requirements
Model access options: white-box (full model weights and architecture), grey-box (architecture known, weights not accessible), black-box (API access only). Each enables different attack methodologies; we confirm which tests are applicable at the access level before beginning.
Training data access: needed for poisoning susceptibility and membership inference testing. Anonymised or pseudonymised samples are acceptable for most tests.
Non-production environment: testing must be conducted on a non-production instance. All adversarial testing generates queries that would trigger anomalous behaviour detection if conducted against a production system.
ML team technical contact: 3 hours total across the engagement for model architecture discussion, training pipeline review, and findings walkthrough.
The most common finding that surprises ML teamsEvasion attack effectiveness consistently exceeds ML teams’ prior estimates. A model reported as achieving 99.4% accuracy is routinely found to have below 30% adversarial accuracy under moderate white-box attacks in assessments. This is not a criticism of the model’s training — it is the known characteristic of models trained to optimise clean accuracy without adversarial training. The finding surprises ML teams because adversarial robustness is not a standard training objective and is not measured in standard model evaluation. The surprise is not evidence that the model is poorly built. It is evidence that the adversarial dimension was never assessed.
What Your Team Must Provide
Model access: confirmed access level (white/grey/black-box) and access credentials to a non-production instance before the engagement begins
Deployment context documentation: the operational environment, the inputs the model receives in production, the outputs it produces and what actions those outputs drive, and the threat actor profile for the deployment context
Training data sample: 200–500 training examples and a comparable number of held-out non-training examples for membership inference testing — these do not need to be the full training dataset
Architecture documentation: model architecture, training methodology, and foundation model provenance if fine-tuned
What Is Not in This Engagement
Defence implementation: all adversarial training, differential privacy application, output perturbation, and rate limiting implemented by your ML engineering team from our defensive specifications
Re-assessment after defence implementation: available at £6,500 for a focused re-test of the specific vulnerabilities addressed
Application layer security testing: prompt injection, API authentication, infrastructure security — these are conventional penetration testing scope, not adversarial ML assessment scope
Full EU AI Act conformity assessment support: the assessment produces the Article 9/15 technical evidence; the full conformity assessment programme is covered under AI Systems Engineering
Engagement Type 2
ML Portfolio Adversarial Assessment
For organisations operating multiple production ML systems that require independent adversarial assessment of each, or where the interaction between ML systems in a pipeline creates emergent adversarial vulnerabilities not present in any system assessed independently. Appropriate for organisations with 3–10 ML systems in production or near-production. The portfolio assessment applies a threat-model-calibrated triage to each system, conducting full assessment of critical systems and focused assessment of lower-risk systems — allocating assessment depth proportionally to the actual risk rather than uniformly across all systems regardless of criticality.
£48,000
Fixed · up to 5 systems · VAT excl.
14 weeks5 systems at standard depth. Additional systems at £7,500 each. Pipeline interaction assessment adds 3–4 weeks for complex ML pipelines.
Portfolio Triage
Risk-based prioritisation: each system assessed against the adversarial risk dimensions (data sensitivity, adversary motivation, attack feasibility, consequence of successful attack) before assessment depth is allocated
Shared vulnerability identification: vulnerabilities in shared training data pipelines, shared feature stores, or shared foundation models that affect multiple systems simultaneously
Pipeline interaction assessment: for ML systems that feed each other’s inputs, assessment of cascading attack scenarios where an adversarial perturbation crafted for an upstream model affects downstream model behaviour
Common attack surface mapping: shared API infrastructure, shared authentication, shared monitoring — common attack surfaces that affect multiple systems
Assessment Depth by Risk Tier
Critical systems (typically 1–2): full six-category assessment equivalent to Type 1, including all attack methodologies and quantitative robustness metrics
High-risk systems (typically 2–3): focused assessment on the two highest-priority threat categories for each system’s deployment context, plus extraction and membership inference
Medium-risk systems: threat model review and the single highest-priority threat category assessment, plus findings summary for the remaining categories based on architecture analysis without full attack simulation
Assessment depth decisions are documented with the reasoning for each allocation, enabling the client to understand what was assessed and at what depth for each system
Portfolio Outputs
Portfolio risk register: all systems ranked by adversarial risk with findings and severity ratings per system
Cross-system vulnerability map: vulnerabilities affecting multiple systems through shared components
Prioritised remediation roadmap: defensive implementation sequence across the portfolio, with effort estimates per system and cross-system remediation dependencies
Portfolio EU AI Act evidence pack: Article 9/15 documentation for each system at the appropriate conformity assessment level
Annual re-assessment recommendation: which systems should be re-assessed annually, which quarterly, and which after any significant model update
Engagement Type 3
Continuous Red Team Programme
For organisations where adversarial ML risk is a sustained operational concern rather than a one-time assessment — where models are retrained regularly on production data (creating continuous poisoning exposure), where new model versions are deployed frequently, where the threat actor profile is sophisticated and actively targeting the organisation’s AI systems, or where regulatory requirements mandate ongoing adversarial testing. The continuous programme provides scheduled assessment cadences calibrated to the model update frequency and threat profile, with continuous monitoring for the attack patterns most relevant to each system.
From £65,000/yr
Annual retainer · quarterly in arrears · VAT excl.
OngoingInitial scoping and baseline assessment programme: 8–12 weeks. Ongoing monitoring and assessment cadence begins after baseline.
Continuous Monitoring
Adversarial query pattern detection: monitoring for API query patterns consistent with model extraction or adversarial example development, with alert and response procedures
Training data poisoning detection: continuous application of spectral and influence-based poisoning detection to each training batch before model retraining
Model drift monitoring: tracking shifts in the model’s decision boundary over retraining cycles, alerting when drift is consistent with poisoning rather than legitimate distributional shift
New attack methodology monitoring: tracking the adversarial ML research literature for new attack methodologies applicable to the organisation’s model architectures, with assessment of relevance before each assessment cycle
Assessment Cadence
Quarterly: evasion and extraction assessment for critical models, poisoning pipeline review for continuously retrained models
Post-model-update: focused re-assessment of the threat categories most affected by the model changes, using the pre-update baseline as comparison
Annual: full six-category assessment equivalent to Type 1 for all critical models, with comparison against the previous annual assessment to track robustness improvement or degradation trends
On-demand: for specific threat intelligence or for organisations that have detected or suspect an active adversarial attack against a production model
Programme Outputs
Quarterly assessment reports per critical system
Annual robustness trend report: how each system’s adversarial robustness has evolved across the programme year, identifying improvement and regression
Incident response support: when a suspected adversarial attack against a production model is detected, root cause analysis and response specification within defined SLA
Regulatory reporting evidence: continuous assessment programme documentation suitable for DORA ICT risk reporting and EU AI Act post-market monitoring evidence
Threat intelligence briefings: quarterly briefings on new adversarial ML attack techniques and their relevance to the organisation’s specific AI system portfolio
Engagement Type 4
ML Supply Chain Security Assessment
A focused engagement specifically for organisations that use pre-trained foundation models, fine-tune models from public repositories, use third-party ML APIs as components in their systems, or incorporate ML models from vendors or partners whose training processes are not transparent. The supply chain assessment addresses the backdoor and poisoning risks that arise from third-party model provenance — the risks that SBOM tools, penetration tests, and standard security assessments do not touch. Can be conducted independently or as a precursor to a Type 1 or Type 2 assessment.
£14,500
Fixed · up to 5 models · VAT excl.
4 weeksEach model in the supply chain assessed independently. Above 5 models: £2,500 per additional model.
What the Assessment Covers
Backdoor detection: Neural Cleanse, ABS, and STRIP applied to each model in scope — the established techniques for detecting trigger-based backdoors in trained models
Provenance chain analysis: for each model, tracing the source from the published repository through any fine-tuning or modification steps to the deployed version — identifying points in the chain where unauthorised modification could have occurred
Fine-tuning inheritance assessment: for models that were fine-tuned from a foundation model, assessing whether fine-tuning has preserved, modified, or eliminated potential backdoors from the base model
Third-party API trust assessment: for ML APIs used as components (sentiment analysis, image classification, transcription, translation), assessing the information exposed by the API that could be exploited in downstream model poisoning or evasion
Model integrity specification: the hash-based integrity verification and signed provenance attestation specification that prevents supply chain substitution attacks
What the Assessment Produces
Supply chain risk register: each model in scope with its provenance confidence level and backdoor detection result
Backdoor findings: any identified backdoor candidates with the specific trigger pattern, the affected classification boundary, and the confidence level of the detection
Provenance gap report: points in the model provenance chain where the chain of custody cannot be verified and where adversarial modification cannot be ruled out
Model integrity implementation specification: the cryptographic hash, signature, and verification procedures for implementing model integrity attestation before production deployment
Supply chain policy recommendations: model acceptance criteria, required provenance documentation, and ongoing verification cadence for the organisation’s ML model procurement process
Who This Is For Specifically
Any organisation that has downloaded a model from Hugging Face, GitHub, or any public model repository and deployed it in a production system — regardless of the model’s download count or the repository’s reputation
Any organisation fine-tuning a foundation model provided by a commercial vendor without access to the foundation model’s full training process
Any organisation using ML models developed by an acquisition target, a partner, or a subcontractor as part of a due diligence or vendor security assessment
Any organisation whose AI risk register includes “supply chain attack” as a threat but has taken no specific action to assess or mitigate it
The “50,000 downloads means it’s safe” assumptionModel repositories do not perform adversarial security testing on submitted models. Download count is not a security signal — a backdoored model that is useful for its stated purpose will be downloaded and used precisely because it works well on normal inputs. The backdoor only activates when the specific trigger is present. A backdoored model in a public repository will have positive reviews and high download counts from users who never encountered the trigger, and will be assessed as high-quality. Download count is a proxy for usefulness on normal inputs. It is not a proxy for adversarial safety.

Client Obligations
Provide access to a non-production instance — adversarial testing cannot be conducted against a live production system
Adversarial ML testing generates query patterns that are fundamentally different from legitimate use: high-volume structured queries designed to probe decision boundaries, queries designed to elicit maximum confidence in misclassified inputs, systematic API interactions that no legitimate user would produce. Conducting this testing against a production system would trigger rate limiting, anomaly detection, and security alerts, would affect real users, and would violate the responsible disclosure obligations of any system that processes personal data. A non-production instance must be provisioned before the engagement begins. The non-production instance must be a genuine replica — the same model weights, the same API configuration, the same output format. A different model or a simplified configuration produces results that do not apply to the production system.
If a non-production instance cannot be provisionedAdversarial testing cannot be conducted. The engagement is limited to architecture analysis, threat modelling, and defensive specification based on the model architecture and known vulnerabilities of that architecture class. This is disclosed before the engagement begins, not after the testing phase reveals it is not possible.
Findings are disclosed to the ML team and used to implement defences — not suppressed because the results are uncomfortable
Adversarial ML findings are frequently surprising to ML teams who have not previously conducted adversarial assessment. A model that the team is confident in — one with high accuracy on clean test data and positive user feedback — may have significant adversarial vulnerabilities. The purpose of the assessment is to find these vulnerabilities before an adversary does. Finding them is the successful outcome, not a failure of the model or the team. The obligation is that the findings are disclosed to the teams responsible for implementing the defences and are acted upon according to the severity-prioritised remediation roadmap. Findings that are disclosed but not acted upon provide no security improvement.
If critical findings are not acted uponWe document in writing that critical findings were delivered and not addressed within the recommended timeframe. For EU AI Act high-risk systems, unaddressed critical findings represent a failure of the Article 9 risk management system obligation. We do not retain responsibility for outcomes from vulnerabilities identified in the assessment and not remediated.
RJV Obligations
All adversarial testing conducted under explicit written authorisation and within the agreed scope — no out-of-scope testing, no production system access
Adversarial ML testing generates attacks against the client’s system. This is authorised testing; it must be explicitly bounded. Before any testing begins, we provide a testing scope document that specifies: the systems to be tested, the access level granted, the query budget for extraction testing, the specific attack methodologies to be applied, and the systems explicitly excluded from testing. No testing is conducted outside the agreed scope. The scope document is signed by both parties before testing begins. Any discovery during testing that suggests a potentially out-of-scope system is at risk is reported to the client immediately — we do not expand testing without explicit re-authorisation.
If we discover a vulnerability during testing that suggests a system outside the agreed scope is at riskWe report the finding to the client immediately and provide a brief description of the suspected vulnerability. We do not test the out-of-scope system without explicit written authorisation from the client.
Attack success rates calibrated to the realistic adversary capability for the deployment context — not against adversaries with unlimited compute
Adversarial ML attacks can be conducted with different levels of adversary capability: from “no model access, limited queries” to “full white-box access, unlimited compute.” The relevant assessment is against the adversary capability that is realistic for the deployment context — not against the strongest possible adversary unless that adversary profile is genuinely applicable. We calibrate the attack budget and methodology to the threat model for each deployment context and document the adversary capability assumptions for each test. A finding reported as “critical” means critical against a realistic adversary for this deployment. We do not present results from unrealistically powerful adversaries as the relevant risk level, because this produces defensive investments that are not calibrated to actual risk.
If the client wants assessment against a more powerful adversary profile than we recommend for the deployment contextWe conduct the additional testing and report the results separately from the realistic-adversary results, with explicit documentation that the stronger adversary results represent an elevated threat model not justified by the current deployment context. You receive both datasets.

The assessment session answers one question: which of your production AI systems would a motivated adversary focus on first, and what would they find?

90 minutes reviewing your AI system portfolio: the models in production, their deployment contexts, the data they are trained on, the adversary profiles relevant to your organisation, and the access your APIs provide to potential attackers. We give you a preliminary threat model assessment identifying the highest-priority adversarial ML risks in your specific portfolio and recommending which engagement type is appropriate for each system. You leave knowing what you do not know about the adversarial robustness of your AI systems.

If your organisation has deployed production AI systems and has not conducted adversarial ML assessment, the probability that significant adversarial vulnerabilities exist is high — not because the systems were poorly built, but because adversarial robustness is not a standard training objective and adversarial assessment is not a standard part of ML deployment practice. The assessment session is the fastest way to understand your current position.

Format
Video call or in-person in London. 90 minutes.
Cost
Free. No commitment.
Lead time
Within 5 business days of contact.
Bring
A description of your AI systems in production: model types, deployment contexts, what decisions the model outputs drive, and who has query access to each model’s API. Your organisation’s threat profile: who would be motivated to attack your AI systems and what they would gain. Any existing security assessments of these systems and what they covered. Whether you use pre-trained or fine-tuned foundation models and from which sources.
Attendees
ML lead or principal data scientist and the security lead or CISO. Both are needed — the model architecture context and the threat model context must be in the same room. From RJV: a specialist in adversarial machine learning with both ML and security backgrounds.
After
Written preliminary threat model assessment within 2 business days. Fixed-price scope proposal within 5 business days.