Adversarial AI Security & ML Red Teaming -

Services / Adversarial AI Security & ML Red Teaming

The security of an AI system is not determined by whether it produces correct outputs on test data. It is determined by whether it continues to produce correct and safe outputs when an adversary is actively working to make it fail — by manipulating the inputs it receives, the data it was trained on, or the model itself. These are not theoretical academic concerns. Model extraction attacks that reconstruct proprietary models from API queries are documented at scale. Data poisoning attacks that alter model behaviour by corrupting training datasets have been demonstrated against production systems in financial fraud detection, malware classification, and medical imaging. Adversarial examples that cause confident misclassification with imperceptible input modifications have been reproduced across every major model architecture.

The gap between ML engineering and security engineering is where most production AI vulnerabilities live. ML teams are trained to maximise accuracy on held-out test sets — a discipline that is entirely orthogonal to adversarial robustness. Security teams are trained to find weaknesses in conventional software systems — a discipline that does not extend naturally to the probabilistic, high-dimensional failure modes of trained models. The adversarial AI security assessment requires both, applied together, to the specific deployment context and threat model of the system under assessment.

This service conducts structured adversarial assessment of production and near-production AI and ML systems: attack simulation across all relevant threat categories, measurement of the system’s resistance to each, identification of the vulnerabilities that would be exploited by a motivated adversary in the specific deployment context, and implementation of the technical controls that reduce residual risk to an acceptable level. It also satisfies the EU AI Act Article 9 red-teaming requirement for high-risk AI systems — not by producing documentation that describes a red team assessment, but by conducting one.

Book a Red Team Assessment →

Pricing & Scope

Price Range

£18,000 – £240,000+
Red team assessment, vulnerability findings, and defensive specification. Implementation of defences by your ML and engineering teams is additional.

Duration

4 – 20 weeks
Assessment phase. Defence implementation timelines depend on which vulnerabilities are found and the complexity of the required mitigations.

Distinct from

Prompt injection testing (see Prompt Engineering) — that is the application layer. This service operates at the model and training pipeline layer, and covers the full attack surface of deployed ML systems.

EU AI Act

Article 9 requires post-market monitoring and testing for high-risk AI systems. Article 15 requires robustness, accuracy, and cybersecurity measures. This engagement produces the technical evidence for both.

Contract

Fixed-price. 50% on signing, 50% on delivery acceptance.

Standard penetration testing does not assess adversarial ML riskConventional penetration testing finds vulnerabilities in the software infrastructure surrounding an AI system — the API, the authentication, the data storage. It does not find model extraction vulnerabilities, adversarial example susceptibility, data poisoning backdoors, or membership inference exposure. These require attack methodologies specific to machine learning systems that are not part of standard penetration testing methodology. An AI system that passes a conventional penetration test has not been assessed for adversarial ML risk.

The Attack Surface — Six Threat Categories, Each Requiring Different Methodology

Six categories of adversarial ML attack. Each operates at a different stage of the ML lifecycle. Each requires a different defensive approach. None of them are found by conventional security testing.

Adversarial ML is a research discipline that has been active for over a decade. The academic attacks developed in that time have been translated into practical exploits against production systems by adversaries with access to the same published literature. The attacks below are not theoretical — each has documented production system impact or has been demonstrated to be practical in settings directly analogous to production deployment. The assessment methodology for each is distinct; they cannot be collapsed into a single unified approach.

Attack Stage: Inference Time

Evasion Attacks — Adversarial Examples

An adversary makes imperceptible modifications to an input that cause a trained model to misclassify it with high confidence. The modification is optimised against the model’s decision boundary using gradient information (white-box) or using the model’s output probabilities or decisions (black-box). The attack crosses every domain: images modified to evade computer vision classifiers; network traffic modified to evade intrusion detection systems; financial transaction features perturbed to evade fraud detection; text modified to evade content classifiers. The adversarial example looks identical to the original to a human reviewer. The model classifies it as something entirely different, with high confidence.

Documented production impact

A financial institution deployed an ML-based fraud detection classifier. Researchers demonstrated that transaction features could be perturbed by amounts within the legitimate variation range for each feature to produce a consistently fraudulent transaction that the classifier rated as low risk. The perturbations were small enough to be indistinguishable from legitimate transaction variation. The attack required approximately 40 queries to the classifier’s risk score output to develop an effective adversarial example.

Assessment methodology

White-box assessment if model access is available: gradient-based adversarial example generation (PGD, C&W, AutoAttack) with measurement of the perturbation magnitude required for successful evasion across the deployment domain. Black-box assessment for API-accessible models: boundary-based and decision-based attacks that require only query access to the model’s output. Measurement of the adversarial robustness curve — the relationship between perturbation budget and attack success rate — as the primary output metric.

Attack Stage: Training Time

Data Poisoning — Corrupting the Training Distribution

An adversary with access to the training data pipeline — which includes any organisation that collects training data from external sources, scrapes the web, accepts user-submitted data, or uses third-party data providers — can inject carefully crafted examples that degrade overall model performance or introduce specific misclassification behaviour. Clean-label poisoning attacks corrupt model behaviour using only correctly-labelled examples, making them invisible to standard data quality review. Gradient-based poisoning attacks compute the training examples that, when added to the training set, maximally shift the model’s decision boundary towards a specific target. Organisations that retrain their models on production data are continuously exposed to production-data poisoning.

Documented production impact

A security vendor’s malware classification model was trained on samples submitted by security researchers and enterprise customers. An adversarial actor submitted carefully-crafted benign-appearing samples designed to shift the classifier’s decision boundary for a specific malware family towards the benign class. The attack was effective because clean-label poisoning examples pass standard malware analysis review — they are genuinely benign files, and the poisoning effect only emerges when the model is retrained on the batch containing them.

Assessment methodology

Training data provenance audit: mapping every data source used in training to its trust level and the adversary access it provides. Data integrity assessment: statistical testing for anomalous examples that may represent poisoning attempts using outlier detection and spectral methods. Clean-label poisoning susceptibility: measurement of the fraction of training examples that would need to be poisoned to produce a specified behavioural change in the model. Continuous retraining risk assessment for models that incorporate production data feedback.

Attack Stage: Training Time / Supply Chain

Backdoor (Trojan) Attacks — Embedded Triggers

A backdoor attack embeds a hidden behaviour in a trained model that activates only when a specific trigger pattern is present in the input. The model behaves correctly on all normal inputs; it misclassifies only inputs containing the trigger. The trigger can be a specific pixel pattern in an image classifier, a specific phrase or token in a text classifier, a specific feature value in a tabular classifier. Backdoor attacks are particularly dangerous in the supply chain context: a model downloaded from a public repository, a foundation model used as a starting point for fine-tuning, or a model trained by a third-party ML service provider could contain a backdoor that the organisation using it cannot detect through normal accuracy testing, because the backdoor does not affect normal inputs.

Why supply chain backdoors are the highest-risk variant

An organisation fine-tunes a publicly-available foundation model for clinical document classification. The foundation model contains a backdoor planted by an adversarial actor who contributed the model to a public repository. The backdoor trigger is a specific token sequence that appears rarely in normal clinical text but can be injected into documents by an adversary with write access to clinical documentation systems. When the trigger is present, the classifier consistently assigns the highest-risk classification regardless of document content. The backdoor survives fine-tuning because it was embedded in the foundation model weights at a depth not reached by the fine-tuning gradient updates.

Assessment methodology

Neural Cleanse and STRIP-based backdoor detection: reverse-engineering potential trigger patterns and testing for the characteristic sharp-boundary activation pattern of backdoor behaviour. Model inspection for anomalous activation clusters in internal representations. Supply chain provenance review: for each model in the ML pipeline, assessing the trust level of its source and the verification measures applied before deployment. Fine-tuning inheritance assessment: for fine-tuned models, measuring whether fine-tuning has eliminated or preserved potential backdoors from the base model.

Attack Stage: Inference Time / API Access

Model Extraction — Stealing Proprietary Models

An adversary with query access to a model’s API can reconstruct a functionally equivalent model by submitting carefully chosen inputs and observing the outputs. The reconstructed model can then be used to develop adversarial examples against the original model, to replicate the model’s intellectual property, or to circumvent the access controls and pricing that the original model provider applies. Model extraction attacks have been demonstrated to reconstruct commercial ML models with high fidelity using query budgets that are affordable to any motivated adversary. The attack is not dependent on any implementation vulnerability — it is inherent to the information exposed by any prediction API that returns confidence scores or soft-max probabilities.

Documented production impact

Researchers demonstrated extraction of a commercial credit scoring model’s decision boundary to within 99.8% agreement using approximately 36,000 API queries at a total cost of approximately £9 in API fees. The extracted model reproduced the original model’s scoring decisions closely enough to be used as a substitute. The attack required only the credit score output — not the confidence scores — demonstrating that decision-only APIs are not extraction-resistant.

Assessment methodology

Extraction attack simulation using a controlled query budget against a test instance of the model: measurement of the model fidelity achievable at different query counts. API output information leakage assessment: does the API return confidence scores, class probabilities, or only class labels? Each provides different information to an extractor. Query-rate sensitivity: how much information does each query reveal, and what is the minimum query budget for a high-fidelity extraction? Watermarking feasibility assessment: whether model watermarking can be applied to detect extracted model deployment.

Attack Stage: Inference Time / API Access

Membership Inference — Exposing Training Data

A membership inference attack determines whether a specific data point was in the model’s training dataset by querying the model and analysing its response pattern. Models trained without sufficient regularisation exhibit higher confidence on training data than on unseen data — a characteristic that membership inference attacks exploit. The privacy implications are severe: for a healthcare model trained on patient records, an adversary can determine whether a specific individual’s medical data was used in training. For a model trained on confidential business data, whether a specific document was in the training set. UK GDPR requires a lawful basis for personal data processing — membership inference attacks can reveal that data was processed in ML training without the subject’s knowledge.

Regulatory consequence

An NHS trust deploys a clinical risk prediction model trained on historical patient data. A membership inference attack using the model’s API could reveal whether a specific named patient’s records were used in training. If those patients were not informed that their data would be used for ML model training and did not consent to that use, the attack reveals a UK GDPR Article 6 lawful basis violation. The attack does not require access to the training data — only to the deployed model. The vulnerability is inherent to the model’s behaviour and cannot be resolved by securing the training data after the model is deployed.

Assessment methodology

Shadow model membership inference using a subset of known training and non-training examples to calibrate the attack. Likelihood-ratio attack for models that expose confidence scores. Attack success measurement against the model’s actual training population and test population, producing the privacy leakage metric as the primary output. Differential privacy feasibility assessment: quantifying the privacy-utility trade-off of applying differential privacy training to reduce membership inference risk to acceptable levels.

Attack Stage: Inference Time / High Confidence

Model Inversion — Reconstructing Training Data

A model inversion attack reconstructs representative examples of the training data by repeatedly querying the model and using gradient information to find inputs that maximise the model’s confidence in a specific class. For face recognition systems, model inversion can reconstruct recognisable facial features of training subjects. For medical classifiers, it can reconstruct characteristic features of specific diagnostic categories that may reveal information about the training population. Model inversion attacks are most effective against models that expose gradient information or detailed confidence scores, but black-box variants have been demonstrated to be effective against decision-only APIs for models with highly confident outputs in well-separated regions of the input space.

Privacy implication

A biometric authentication system was demonstrated to be vulnerable to model inversion attacks that reconstructed facial images with sufficient fidelity to be recognised by human reviewers as resembling specific training subjects, using only query access to the model. The reconstructed images were not the original training images but represented the “average face” that the model associated with a specific identity class. For high-value targets, this provides actionable intelligence for social engineering or credential theft.

Assessment methodology

Gradient-based inversion for white-box accessible models: measurement of the reconstruction fidelity achievable from gradient information at different query budgets. Decision-based inversion for black-box models: boundary-walking attacks that reconstruct input-space characteristics using only output labels. Quantification of the privacy leakage as the mutual information between model outputs and training data characteristics. Mitigation assessment: whether output perturbation, confidence truncation, or model architecture changes reduce inversion effectiveness to acceptable levels without unacceptable accuracy loss.

Why Standard Security Approaches Do Not Cover This

Five ways conventional security assessment misses adversarial ML vulnerabilities entirely. Each represents a gap that a motivated adversary will find before your security team does.

The security testing methodologies that organisations currently apply to AI systems — penetration testing, OWASP Top 10 assessments, code review, dependency scanning — are necessary but not sufficient. They test the software infrastructure surrounding the model. They do not test the model itself as a security-relevant artefact with its own attack surface. The gaps below are not edge cases — they are the majority of the adversarial ML attack surface, and they are systematically missed by every conventional security methodology.

Penetration testing tests the application layer, not the model layer

A penetration test of an ML-powered application finds authentication weaknesses, injection vulnerabilities, insecure direct object references, and misconfigured cloud storage — the OWASP Top 10 applied to the API that wraps the model. It does not test whether the model’s predictions can be manipulated by carefully crafted inputs, whether the model leaks information about its training data, whether the model has been backdoored, or whether the model can be extracted through query interactions. These are not application-layer vulnerabilities. They are model-layer vulnerabilities. They require different tools, different expertise, and different methodology to assess. A penetration test that reports “the API is secure” has said nothing about the model’s adversarial robustness.

What the pentest report says vs. what it means

The penetration test report states: “No significant vulnerabilities were found in the API or surrounding infrastructure.” What this means: no OWASP Top 10 vulnerabilities were found in the code and configuration. What it does not address: whether a query sequence of 10,000 API calls could extract a functionally equivalent model; whether a fraud transaction perturbed by 0.3% on three features evades the classifier; whether the model contains a backdoor inherited from a foundation model. The adversary reads the OWASP report and tests the model layer, which was not tested.

What adversarial ML assessment covers that penetration testing does not

The model as an attack surface: its decision boundary geometry, its confidence output information content, its training data memorisation, its sensitivity to structured input perturbations, and the feasibility of extracting its functional behaviour through query interactions. These assessments require ML expertise to conduct, ML tools to execute, and ML knowledge to interpret.

Standard model evaluation measures accuracy on clean data, not robustness on adversarial data

The primary metric of ML model evaluation is accuracy on a held-out test set. This measures the model’s performance on the data distribution it was trained on, perturbed by the natural variation in that distribution. It says nothing about the model’s performance when an adversary has specifically crafted the input to cause misclassification. A model with 99.7% accuracy on clean test data may have 12% accuracy under a white-box adversarial attack — not because the model is poorly trained, but because the model has learned decision boundaries that are locally correct but globally fragile in ways that only become visible under adversarial perturbation. The standard evaluation pipeline that produces model accuracy reports produces no adversarial robustness information.

The accuracy-robustness trade-off that evaluation hides

Two classifiers: Classifier A has 99.2% clean accuracy and 18% adversarial accuracy under PGD-20 attack. Classifier B has 97.1% clean accuracy and 71% adversarial accuracy under the same attack. Standard model evaluation reports Classifier A as the better model. For a deployment context where adversarial inputs are a realistic threat, Classifier B is the better model by a large margin. The standard evaluation report does not reveal this because adversarial robustness is not a standard evaluation metric.

What adversarial evaluation adds

Adversarial accuracy curves: model performance under attacks of increasing perturbation magnitude, providing a complete characterisation of adversarial robustness rather than a single accuracy number. Certified robustness bounds where computationally feasible: formal guarantees on the maximum perturbation that cannot cause misclassification. Comparison of the model’s adversarial robustness against the attack budgets realistic for the deployment’s threat actors.

Data quality review does not detect clean-label poisoning

Standard data quality processes — deduplication, format validation, labelling consistency checks, outlier detection — are designed to find genuine data quality problems: mislabelled examples, duplicated records, formatting errors, anomalously distributed features. Clean-label poisoning attacks inject examples that are correctly labelled, correctly formatted, and within the normal distribution of the feature space. They look like legitimate training examples. They pass every standard data quality check. Their effect on model behaviour only emerges after training, when the poisoned examples have shifted the model’s decision boundary in the intended direction. Data quality review that does not include adversarial poisoning detection is not a defence against clean-label poisoning.

Why clean-label poisoning is the most dangerous variant

A dataset curator performing standard data quality review examines a submitted training batch for label errors and anomalous features. All examples are correctly labelled and within normal feature ranges. The batch contains 140 clean-label poisoning examples computed to shift the classifier’s decision boundary for high-value fraud patterns. The examples pass review and are incorporated into the training set. The retrained model classifies a specific structured fraud pattern as low-risk. No quality control step detected the attack because the attack was designed specifically to pass quality control.

What data pipeline security assessment adds

Spectral signature detection: identifying poisoned examples by their anomalous representation in the model’s internal feature space, which differs from their appearance in the raw data space. Influence function analysis: measuring which training examples have disproportionate influence on specific predictions, identifying potential poisoning concentrations. Poisoning resilience measurement: quantifying the fraction of training data that would need to be poisoned to produce a specified behavioural change in the model.

Supply chain security scanning does not detect model backdoors

Software supply chain security tools — dependency vulnerability scanners, licence compliance checkers, SBOM generators — operate on software packages and their declared dependencies. A model weights file is not a software package. It does not have a CVE. It does not have a declared dependency tree. Software composition analysis tools do not scan model weights for backdoors. A foundation model downloaded from Hugging Face and incorporated into a production ML pipeline is a supply chain component whose security properties are entirely outside the scope of standard supply chain security tooling. The only way to assess whether it contains a backdoor is to test it for backdoor behaviour using adversarial ML methods.

The model supply chain exposure that SBOM does not address

An organisation’s SBOM lists every Python package, its version, and known CVEs. It lists the PyTorch version, transformers version, and tokenizer version. It does not list the model weights file as a supply chain component, because model weights files are not software packages and SBOM tooling does not process them. The model weights file — downloaded from a public repository with 50,000 downloads — contains a backdoor. The SBOM scan passes. The model is deployed. The backdoor is active.

What model supply chain assessment adds

Model provenance verification: establishing the chain of custody from model training to deployment, identifying points at which an adversary could have introduced modifications. Backdoor detection scanning using Neural Cleanse, ABS, and STRIP methodologies applied to every foundation model and pre-trained component in the ML pipeline before production deployment. Model integrity attestation: a signed hash of the validated model weights that verifies the deployed model is the assessed model.

Privacy impact assessments do not measure ML-specific privacy leakage

A Data Protection Impact Assessment (DPIA) conducted for an ML system addresses the privacy risks of the data processing in the conventional sense: lawful basis, data minimisation, retention, third-party sharing, security controls. It does not address the privacy risks that are specific to trained models: membership inference (can an adversary determine whose data was in the training set?), model inversion (can an adversary reconstruct training data features from the model?), and training data extraction (can an adversary recover verbatim training examples from the model?). These are privacy risks that exist after the model is trained and deployed, arising from the model itself rather than from the training data pipeline, and they require different assessment methodology from a standard DPIA.

The GDPR risk a DPIA does not capture

A DPIA for a clinical risk prediction model concludes that the processing has a lawful basis, the data is minimised, and the technical controls are appropriate. It does not assess whether the deployed model enables an adversary to determine, using only query access, whether specific named patients were in the training set — a capability that effectively reproduces the personal data processing in a form accessible to unauthorised parties. The DPIA was conducted on the training data pipeline. The UK GDPR risk from the deployed model’s information leakage was not assessed.

What ML privacy risk assessment adds

Quantitative membership inference risk: the success rate of membership inference attacks against the specific model, expressed as the privacy leakage metric (information the adversary gains about training set membership relative to a random baseline). Model inversion feasibility: whether the model exposes sufficient gradient or confidence information to enable meaningful reconstruction of training data features. Training data extraction testing for LLMs: whether the model memorises and can be induced to reproduce verbatim training examples. Differential privacy cost-benefit analysis for models where these risks exceed acceptable thresholds.

Monitoring and logging detect system anomalies, not adversarial intent within model behaviour

Operational monitoring systems — logs, SIEM pipelines, anomaly detection dashboards — are designed to identify deviations in system performance: latency spikes, error rates, unusual API traffic volumes, authentication failures, and infrastructure-level irregularities. These signals capture system misuse at the infrastructure and application layers. They do not capture adversarial intent encoded within statistically valid inputs to a model. An adversary interacting with an ML system can operate entirely within normal request patterns, submitting inputs that are syntactically correct, statistically plausible, and indistinguishable from legitimate user behaviour at the logging level, while systematically probing, extracting, or manipulating the model. From the perspective of system monitoring, nothing abnormal has occurred. From the perspective of the model, its decision boundary has been mapped, exploited, or degraded.

Why adversarial activity remains invisible to monitoring systems

A fraud detection model receives 25,000 API requests over a 48-hour period from a distributed set of IP addresses. Each request is valid, authenticated, and within expected rate limits. No alert is triggered. Embedded within these requests is a structured probing sequence that incrementally maps the model’s decision boundary across high-value transaction features. By the end of the sequence, the adversary has identified a narrow feature corridor that consistently bypasses detection. The monitoring system reports normal operation. The model has been strategically compromised without any detectable system anomaly.

What adversarial monitoring adds

Behavioural query analysis: identifying structured probing patterns across sequences of inputs rather than evaluating requests in isolation. Decision boundary interaction tracking: monitoring how input distributions evolve relative to the model’s classification thresholds to detect systematic exploration. Model response entropy analysis: detecting abnormal consistency or variance in outputs that indicate extraction or evasion strategies. Adversarial intent classification layered on top of standard monitoring: distinguishing benign usage from strategic interaction designed to infer, manipulate, or bypass model behaviour.

Engagement Types — Scope, Price, Timeline

Four engagement types. Single system assessment, full ML portfolio, continuous red team programme, and supply chain assessment.

Every engagement produces a findings report with specific vulnerabilities, their severity, the attack scenarios that would exploit them in the deployment context, and the defensive specifications required to remediate them. Implementation of the defences — adversarial training, differential privacy, model watermarking, output perturbation, rate limiting — is performed by your ML engineering team from our specifications. Re-assessment after defence implementation is available as part of the engagement or as a follow-on.

Engagement Type 1

Single System Adversarial Assessment

For organisations assessing a single production or near-production ML system across all six adversarial threat categories. One model, one deployment context, one comprehensive assessment. Appropriate for high-value systems where the consequence of a successful adversarial attack is material — fraud detection, clinical decision support, identity verification, credit scoring, automated trading signals, content moderation with legal implications, or any ML system that produces outputs on which significant decisions depend. The EU AI Act high-risk system classification covers most of these use cases and mandates robustness testing under Article 15.

£18,000

Fixed · VAT excl.

6 weeksWhite-box assessment (model access available) completes faster than black-box assessment (API only). Confirm access level before engagement begins.

Threat Categories Assessed

Evasion: adversarial example generation using white-box and black-box methods appropriate to the access level; adversarial robustness curve measurement; perturbation budget vs. attack success rate

Data poisoning: training data provenance audit; clean-label poisoning susceptibility measurement; continuous retraining pipeline risk assessment

Backdoor: Neural Cleanse and STRIP-based backdoor detection; foundation model backdoor inheritance assessment for fine-tuned models

Model extraction: query-based extraction simulation; information leakage quantification from API outputs; watermarking feasibility assessment

Membership inference: shadow model attack; likelihood-ratio attack; privacy leakage metric calculation

Model inversion: gradient-based and decision-based inversion feasibility; training data reconstruction assessment

Assessment Outputs

Findings report: every vulnerability found, its severity rating (critical/high/medium/low), the specific attack scenario that would exploit it in the deployment context, and the adversary capability required

Adversarial robustness metrics: quantitative measurements for each threat category, comparable against published benchmarks for similar model architectures and tasks

Threat model mapping: which of the six threat categories are most relevant for the specific deployment context and threat actor profile, with justification

Defensive specifications: for each finding above the acceptable threshold, the specific defensive technique recommended, the expected effectiveness, and the implementation specification for the ML engineering team

EU AI Act Article 9/15 compliance evidence: the assessment results in the format required for the technical documentation and conformity assessment record

Re-assessment scope: the specific tests that should be re-run after defensive implementations are complete to verify the vulnerabilities have been adequately addressed

Access Requirements

Model access options: white-box (full model weights and architecture), grey-box (architecture known, weights not accessible), black-box (API access only). Each enables different attack methodologies; we confirm which tests are applicable at the access level before beginning.

Training data access: needed for poisoning susceptibility and membership inference testing. Anonymised or pseudonymised samples are acceptable for most tests.

Non-production environment: testing must be conducted on a non-production instance. All adversarial testing generates queries that would trigger anomalous behaviour detection if conducted against a production system.

ML team technical contact: 3 hours total across the engagement for model architecture discussion, training pipeline review, and findings walkthrough.

The most common finding that surprises ML teamsEvasion attack effectiveness consistently exceeds ML teams’ prior estimates. A model reported as achieving 99.4% accuracy is routinely found to have below 30% adversarial accuracy under moderate white-box attacks in assessments. This is not a criticism of the model’s training — it is the known characteristic of models trained to optimise clean accuracy without adversarial training. The finding surprises ML teams because adversarial robustness is not a standard training objective and is not measured in standard model evaluation. The surprise is not evidence that the model is poorly built. It is evidence that the adversarial dimension was never assessed.

What Your Team Must Provide

Model access: confirmed access level (white/grey/black-box) and access credentials to a non-production instance before the engagement begins

Deployment context documentation: the operational environment, the inputs the model receives in production, the outputs it produces and what actions those outputs drive, and the threat actor profile for the deployment context

Training data sample: 200–500 training examples and a comparable number of held-out non-training examples for membership inference testing — these do not need to be the full training dataset

Architecture documentation: model architecture, training methodology, and foundation model provenance if fine-tuned

What Is Not in This Engagement

Defence implementation: all adversarial training, differential privacy application, output perturbation, and rate limiting implemented by your ML engineering team from our defensive specifications

Re-assessment after defence implementation: available at £6,500 for a focused re-test of the specific vulnerabilities addressed

Application layer security testing: prompt injection, API authentication, infrastructure security — these are conventional penetration testing scope, not adversarial ML assessment scope

Full EU AI Act conformity assessment support: the assessment produces the Article 9/15 technical evidence; the full conformity assessment programme is covered under AI Systems Engineering

Engagement Type 2

ML Portfolio Adversarial Assessment

For organisations operating multiple production ML systems that require independent adversarial assessment of each, or where the interaction between ML systems in a pipeline creates emergent adversarial vulnerabilities not present in any system assessed independently. Appropriate for organisations with 3–10 ML systems in production or near-production. The portfolio assessment applies a threat-model-calibrated triage to each system, conducting full assessment of critical systems and focused assessment of lower-risk systems — allocating assessment depth proportionally to the actual risk rather than uniformly across all systems regardless of criticality.

£48,000

Fixed · up to 5 systems · VAT excl.

14 weeks5 systems at standard depth. Additional systems at £7,500 each. Pipeline interaction assessment adds 3–4 weeks for complex ML pipelines.

Portfolio Triage

Risk-based prioritisation: each system assessed against the adversarial risk dimensions (data sensitivity, adversary motivation, attack feasibility, consequence of successful attack) before assessment depth is allocated

Shared vulnerability identification: vulnerabilities in shared training data pipelines, shared feature stores, or shared foundation models that affect multiple systems simultaneously

Pipeline interaction assessment: for ML systems that feed each other’s inputs, assessment of cascading attack scenarios where an adversarial perturbation crafted for an upstream model affects downstream model behaviour

Common attack surface mapping: shared API infrastructure, shared authentication, shared monitoring — common attack surfaces that affect multiple systems

Assessment Depth by Risk Tier

Critical systems (typically 1–2): full six-category assessment equivalent to Type 1, including all attack methodologies and quantitative robustness metrics

High-risk systems (typically 2–3): focused assessment on the two highest-priority threat categories for each system’s deployment context, plus extraction and membership inference

Medium-risk systems: threat model review and the single highest-priority threat category assessment, plus findings summary for the remaining categories based on architecture analysis without full attack simulation

Assessment depth decisions are documented with the reasoning for each allocation, enabling the client to understand what was assessed and at what depth for each system

Portfolio Outputs

Portfolio risk register: all systems ranked by adversarial risk with findings and severity ratings per system

Cross-system vulnerability map: vulnerabilities affecting multiple systems through shared components

Prioritised remediation roadmap: defensive implementation sequence across the portfolio, with effort estimates per system and cross-system remediation dependencies

Portfolio EU AI Act evidence pack: Article 9/15 documentation for each system at the appropriate conformity assessment level

Annual re-assessment recommendation: which systems should be re-assessed annually, which quarterly, and which after any significant model update

Engagement Type 3

Continuous Red Team Programme

For organisations where adversarial ML risk is a sustained operational concern rather than a one-time assessment — where models are retrained regularly on production data (creating continuous poisoning exposure), where new model versions are deployed frequently, where the threat actor profile is sophisticated and actively targeting the organisation’s AI systems, or where regulatory requirements mandate ongoing adversarial testing. The continuous programme provides scheduled assessment cadences calibrated to the model update frequency and threat profile, with continuous monitoring for the attack patterns most relevant to each system.

From £65,000/yr

Annual retainer · quarterly in arrears · VAT excl.

OngoingInitial scoping and baseline assessment programme: 8–12 weeks. Ongoing monitoring and assessment cadence begins after baseline.

Continuous Monitoring

Adversarial query pattern detection: monitoring for API query patterns consistent with model extraction or adversarial example development, with alert and response procedures

Training data poisoning detection: continuous application of spectral and influence-based poisoning detection to each training batch before model retraining

Model drift monitoring: tracking shifts in the model’s decision boundary over retraining cycles, alerting when drift is consistent with poisoning rather than legitimate distributional shift

New attack methodology monitoring: tracking the adversarial ML research literature for new attack methodologies applicable to the organisation’s model architectures, with assessment of relevance before each assessment cycle

Assessment Cadence

Quarterly: evasion and extraction assessment for critical models, poisoning pipeline review for continuously retrained models

Post-model-update: focused re-assessment of the threat categories most affected by the model changes, using the pre-update baseline as comparison

Annual: full six-category assessment equivalent to Type 1 for all critical models, with comparison against the previous annual assessment to track robustness improvement or degradation trends

On-demand: for specific threat intelligence or for organisations that have detected or suspect an active adversarial attack against a production model

Programme Outputs

Quarterly assessment reports per critical system

Annual robustness trend report: how each system’s adversarial robustness has evolved across the programme year, identifying improvement and regression

Incident response support: when a suspected adversarial attack against a production model is detected, root cause analysis and response specification within defined SLA

Regulatory reporting evidence: continuous assessment programme documentation suitable for DORA ICT risk reporting and EU AI Act post-market monitoring evidence

Threat intelligence briefings: quarterly briefings on new adversarial ML attack techniques and their relevance to the organisation’s specific AI system portfolio

Engagement Type 4

ML Supply Chain Security Assessment

A focused engagement specifically for organisations that use pre-trained foundation models, fine-tune models from public repositories, use third-party ML APIs as components in their systems, or incorporate ML models from vendors or partners whose training processes are not transparent. The supply chain assessment addresses the backdoor and poisoning risks that arise from third-party model provenance — the risks that SBOM tools, penetration tests, and standard security assessments do not touch. Can be conducted independently or as a precursor to a Type 1 or Type 2 assessment.

£14,500

Fixed · up to 5 models · VAT excl.

4 weeksEach model in the supply chain assessed independently. Above 5 models: £2,500 per additional model.

What the Assessment Covers

Backdoor detection: Neural Cleanse, ABS, and STRIP applied to each model in scope — the established techniques for detecting trigger-based backdoors in trained models

Provenance chain analysis: for each model, tracing the source from the published repository through any fine-tuning or modification steps to the deployed version — identifying points in the chain where unauthorised modification could have occurred

Fine-tuning inheritance assessment: for models that were fine-tuned from a foundation model, assessing whether fine-tuning has preserved, modified, or eliminated potential backdoors from the base model

Third-party API trust assessment: for ML APIs used as components (sentiment analysis, image classification, transcription, translation), assessing the information exposed by the API that could be exploited in downstream model poisoning or evasion

Model integrity specification: the hash-based integrity verification and signed provenance attestation specification that prevents supply chain substitution attacks

What the Assessment Produces

Supply chain risk register: each model in scope with its provenance confidence level and backdoor detection result

Backdoor findings: any identified backdoor candidates with the specific trigger pattern, the affected classification boundary, and the confidence level of the detection

Provenance gap report: points in the model provenance chain where the chain of custody cannot be verified and where adversarial modification cannot be ruled out

Model integrity implementation specification: the cryptographic hash, signature, and verification procedures for implementing model integrity attestation before production deployment

Supply chain policy recommendations: model acceptance criteria, required provenance documentation, and ongoing verification cadence for the organisation’s ML model procurement process

Who This Is For Specifically

Any organisation that has downloaded a model from Hugging Face, GitHub, or any public model repository and deployed it in a production system — regardless of the model’s download count or the repository’s reputation

Any organisation fine-tuning a foundation model provided by a commercial vendor without access to the foundation model’s full training process

Any organisation using ML models developed by an acquisition target, a partner, or a subcontractor as part of a due diligence or vendor security assessment

Any organisation whose AI risk register includes “supply chain attack” as a threat but has taken no specific action to assess or mitigate it

The “50,000 downloads means it’s safe” assumptionModel repositories do not perform adversarial security testing on submitted models. Download count is not a security signal — a backdoored model that is useful for its stated purpose will be downloaded and used precisely because it works well on normal inputs. The backdoor only activates when the specific trigger is present. A backdoored model in a public repository will have positive reviews and high download counts from users who never encountered the trigger, and will be assessed as high-quality. Download count is a proxy for usefulness on normal inputs. It is not a proxy for adversarial safety.

Bilateral Obligations

Client Obligations

Provide access to a non-production instance — adversarial testing cannot be conducted against a live production system

Adversarial ML testing generates query patterns that are fundamentally different from legitimate use: high-volume structured queries designed to probe decision boundaries, queries designed to elicit maximum confidence in misclassified inputs, systematic API interactions that no legitimate user would produce. Conducting this testing against a production system would trigger rate limiting, anomaly detection, and security alerts, would affect real users, and would violate the responsible disclosure obligations of any system that processes personal data. A non-production instance must be provisioned before the engagement begins. The non-production instance must be a genuine replica — the same model weights, the same API configuration, the same output format. A different model or a simplified configuration produces results that do not apply to the production system.

If a non-production instance cannot be provisionedAdversarial testing cannot be conducted. The engagement is limited to architecture analysis, threat modelling, and defensive specification based on the model architecture and known vulnerabilities of that architecture class. This is disclosed before the engagement begins, not after the testing phase reveals it is not possible.

Findings are disclosed to the ML team and used to implement defences — not suppressed because the results are uncomfortable

Adversarial ML findings are frequently surprising to ML teams who have not previously conducted adversarial assessment. A model that the team is confident in — one with high accuracy on clean test data and positive user feedback — may have significant adversarial vulnerabilities. The purpose of the assessment is to find these vulnerabilities before an adversary does. Finding them is the successful outcome, not a failure of the model or the team. The obligation is that the findings are disclosed to the teams responsible for implementing the defences and are acted upon according to the severity-prioritised remediation roadmap. Findings that are disclosed but not acted upon provide no security improvement.

If critical findings are not acted uponWe document in writing that critical findings were delivered and not addressed within the recommended timeframe. For EU AI Act high-risk systems, unaddressed critical findings represent a failure of the Article 9 risk management system obligation. We do not retain responsibility for outcomes from vulnerabilities identified in the assessment and not remediated.

RJV Obligations

All adversarial testing conducted under explicit written authorisation and within the agreed scope — no out-of-scope testing, no production system access

Adversarial ML testing generates attacks against the client’s system. This is authorised testing; it must be explicitly bounded. Before any testing begins, we provide a testing scope document that specifies: the systems to be tested, the access level granted, the query budget for extraction testing, the specific attack methodologies to be applied, and the systems explicitly excluded from testing. No testing is conducted outside the agreed scope. The scope document is signed by both parties before testing begins. Any discovery during testing that suggests a potentially out-of-scope system is at risk is reported to the client immediately — we do not expand testing without explicit re-authorisation.

If we discover a vulnerability during testing that suggests a system outside the agreed scope is at riskWe report the finding to the client immediately and provide a brief description of the suspected vulnerability. We do not test the out-of-scope system without explicit written authorisation from the client.

Attack success rates calibrated to the realistic adversary capability for the deployment context — not against adversaries with unlimited compute

Adversarial ML attacks can be conducted with different levels of adversary capability: from “no model access, limited queries” to “full white-box access, unlimited compute.” The relevant assessment is against the adversary capability that is realistic for the deployment context — not against the strongest possible adversary unless that adversary profile is genuinely applicable. We calibrate the attack budget and methodology to the threat model for each deployment context and document the adversary capability assumptions for each test. A finding reported as “critical” means critical against a realistic adversary for this deployment. We do not present results from unrealistically powerful adversaries as the relevant risk level, because this produces defensive investments that are not calibrated to actual risk.

If the client wants assessment against a more powerful adversary profile than we recommend for the deployment contextWe conduct the additional testing and report the results separately from the realistic-adversary results, with explicit documentation that the stronger adversary results represent an elevated threat model not justified by the current deployment context. You receive both datasets.

The assessment session answers one question: which of your production AI systems would a motivated adversary focus on first, and what would they find?

90 minutes reviewing your AI system portfolio: the models in production, their deployment contexts, the data they are trained on, the adversary profiles relevant to your organisation, and the access your APIs provide to potential attackers. We give you a preliminary threat model assessment identifying the highest-priority adversarial ML risks in your specific portfolio and recommending which engagement type is appropriate for each system. You leave knowing what you do not know about the adversarial robustness of your AI systems.

If your organisation has deployed production AI systems and has not conducted adversarial ML assessment, the probability that significant adversarial vulnerabilities exist is high — not because the systems were poorly built, but because adversarial robustness is not a standard training objective and adversarial assessment is not a standard part of ML deployment practice. The assessment session is the fastest way to understand your current position.

Book a Red Team Assessment →

AI Systems Engineering →

Post-Quantum Cryptography →

Format

Video call or in-person in London. 90 minutes.

Cost

Free. No commitment.

Lead time

Within 5 business days of contact.

Bring

A description of your AI systems in production: model types, deployment contexts, what decisions the model outputs drive, and who has query access to each model’s API. Your organisation’s threat profile: who would be motivated to attack your AI systems and what they would gain. Any existing security assessments of these systems and what they covered. Whether you use pre-trained or fine-tuned foundation models and from which sources.

Attendees

ML lead or principal data scientist and the security lead or CISO. Both are needed — the model architecture context and the threat model context must be in the same room. From RJV: a specialist in adversarial machine learning with both ML and security backgrounds.

After

Written preliminary threat model assessment within 2 business days. Fixed-price scope proposal within 5 business days.

AI Systems Engineering · LLM Architecture · Cybersecurity & Resilience · Post-Quantum Cryptography