The security testing methodologies that organisations currently apply to AI systems — penetration testing, OWASP Top 10 assessments, code review, dependency scanning — are necessary but not sufficient. They test the software infrastructure surrounding the model. They do not test the model itself as a security-relevant artefact with its own attack surface. The gaps below are not edge cases — they are the majority of the adversarial ML attack surface, and they are systematically missed by every conventional security methodology.
01
Penetration testing tests the application layer, not the model layer
A penetration test of an ML-powered application finds authentication weaknesses, injection vulnerabilities, insecure direct object references, and misconfigured cloud storage — the OWASP Top 10 applied to the API that wraps the model. It does not test whether the model’s predictions can be manipulated by carefully crafted inputs, whether the model leaks information about its training data, whether the model has been backdoored, or whether the model can be extracted through query interactions. These are not application-layer vulnerabilities. They are model-layer vulnerabilities. They require different tools, different expertise, and different methodology to assess. A penetration test that reports “the API is secure” has said nothing about the model’s adversarial robustness.
What the pentest report says vs. what it means
The penetration test report states: “No significant vulnerabilities were found in the API or surrounding infrastructure.” What this means: no OWASP Top 10 vulnerabilities were found in the code and configuration. What it does not address: whether a query sequence of 10,000 API calls could extract a functionally equivalent model; whether a fraud transaction perturbed by 0.3% on three features evades the classifier; whether the model contains a backdoor inherited from a foundation model. The adversary reads the OWASP report and tests the model layer, which was not tested.
What adversarial ML assessment covers that penetration testing does not
The model as an attack surface: its decision boundary geometry, its confidence output information content, its training data memorisation, its sensitivity to structured input perturbations, and the feasibility of extracting its functional behaviour through query interactions. These assessments require ML expertise to conduct, ML tools to execute, and ML knowledge to interpret.
02
Standard model evaluation measures accuracy on clean data, not robustness on adversarial data
The primary metric of ML model evaluation is accuracy on a held-out test set. This measures the model’s performance on the data distribution it was trained on, perturbed by the natural variation in that distribution. It says nothing about the model’s performance when an adversary has specifically crafted the input to cause misclassification. A model with 99.7% accuracy on clean test data may have 12% accuracy under a white-box adversarial attack — not because the model is poorly trained, but because the model has learned decision boundaries that are locally correct but globally fragile in ways that only become visible under adversarial perturbation. The standard evaluation pipeline that produces model accuracy reports produces no adversarial robustness information.
The accuracy-robustness trade-off that evaluation hides
Two classifiers: Classifier A has 99.2% clean accuracy and 18% adversarial accuracy under PGD-20 attack. Classifier B has 97.1% clean accuracy and 71% adversarial accuracy under the same attack. Standard model evaluation reports Classifier A as the better model. For a deployment context where adversarial inputs are a realistic threat, Classifier B is the better model by a large margin. The standard evaluation report does not reveal this because adversarial robustness is not a standard evaluation metric.
What adversarial evaluation adds
Adversarial accuracy curves: model performance under attacks of increasing perturbation magnitude, providing a complete characterisation of adversarial robustness rather than a single accuracy number. Certified robustness bounds where computationally feasible: formal guarantees on the maximum perturbation that cannot cause misclassification. Comparison of the model’s adversarial robustness against the attack budgets realistic for the deployment’s threat actors.
03
Data quality review does not detect clean-label poisoning
Standard data quality processes — deduplication, format validation, labelling consistency checks, outlier detection — are designed to find genuine data quality problems: mislabelled examples, duplicated records, formatting errors, anomalously distributed features. Clean-label poisoning attacks inject examples that are correctly labelled, correctly formatted, and within the normal distribution of the feature space. They look like legitimate training examples. They pass every standard data quality check. Their effect on model behaviour only emerges after training, when the poisoned examples have shifted the model’s decision boundary in the intended direction. Data quality review that does not include adversarial poisoning detection is not a defence against clean-label poisoning.
Why clean-label poisoning is the most dangerous variant
A dataset curator performing standard data quality review examines a submitted training batch for label errors and anomalous features. All examples are correctly labelled and within normal feature ranges. The batch contains 140 clean-label poisoning examples computed to shift the classifier’s decision boundary for high-value fraud patterns. The examples pass review and are incorporated into the training set. The retrained model classifies a specific structured fraud pattern as low-risk. No quality control step detected the attack because the attack was designed specifically to pass quality control.
What data pipeline security assessment adds
Spectral signature detection: identifying poisoned examples by their anomalous representation in the model’s internal feature space, which differs from their appearance in the raw data space. Influence function analysis: measuring which training examples have disproportionate influence on specific predictions, identifying potential poisoning concentrations. Poisoning resilience measurement: quantifying the fraction of training data that would need to be poisoned to produce a specified behavioural change in the model.
04
Supply chain security scanning does not detect model backdoors
Software supply chain security tools — dependency vulnerability scanners, licence compliance checkers, SBOM generators — operate on software packages and their declared dependencies. A model weights file is not a software package. It does not have a CVE. It does not have a declared dependency tree. Software composition analysis tools do not scan model weights for backdoors. A foundation model downloaded from Hugging Face and incorporated into a production ML pipeline is a supply chain component whose security properties are entirely outside the scope of standard supply chain security tooling. The only way to assess whether it contains a backdoor is to test it for backdoor behaviour using adversarial ML methods.
The model supply chain exposure that SBOM does not address
An organisation’s SBOM lists every Python package, its version, and known CVEs. It lists the PyTorch version, transformers version, and tokenizer version. It does not list the model weights file as a supply chain component, because model weights files are not software packages and SBOM tooling does not process them. The model weights file — downloaded from a public repository with 50,000 downloads — contains a backdoor. The SBOM scan passes. The model is deployed. The backdoor is active.
What model supply chain assessment adds
Model provenance verification: establishing the chain of custody from model training to deployment, identifying points at which an adversary could have introduced modifications. Backdoor detection scanning using Neural Cleanse, ABS, and STRIP methodologies applied to every foundation model and pre-trained component in the ML pipeline before production deployment. Model integrity attestation: a signed hash of the validated model weights that verifies the deployed model is the assessed model.
05
Privacy impact assessments do not measure ML-specific privacy leakage
A Data Protection Impact Assessment (DPIA) conducted for an ML system addresses the privacy risks of the data processing in the conventional sense: lawful basis, data minimisation, retention, third-party sharing, security controls. It does not address the privacy risks that are specific to trained models: membership inference (can an adversary determine whose data was in the training set?), model inversion (can an adversary reconstruct training data features from the model?), and training data extraction (can an adversary recover verbatim training examples from the model?). These are privacy risks that exist after the model is trained and deployed, arising from the model itself rather than from the training data pipeline, and they require different assessment methodology from a standard DPIA.
The GDPR risk a DPIA does not capture
A DPIA for a clinical risk prediction model concludes that the processing has a lawful basis, the data is minimised, and the technical controls are appropriate. It does not assess whether the deployed model enables an adversary to determine, using only query access, whether specific named patients were in the training set — a capability that effectively reproduces the personal data processing in a form accessible to unauthorised parties. The DPIA was conducted on the training data pipeline. The UK GDPR risk from the deployed model’s information leakage was not assessed.
What ML privacy risk assessment adds
Quantitative membership inference risk: the success rate of membership inference attacks against the specific model, expressed as the privacy leakage metric (information the adversary gains about training set membership relative to a random baseline). Model inversion feasibility: whether the model exposes sufficient gradient or confidence information to enable meaningful reconstruction of training data features. Training data extraction testing for LLMs: whether the model memorises and can be induced to reproduce verbatim training examples. Differential privacy cost-benefit analysis for models where these risks exceed acceptable thresholds.
06
Monitoring and logging detect system anomalies, not adversarial intent within model behaviour
Operational monitoring systems — logs, SIEM pipelines, anomaly detection dashboards — are designed to identify deviations in system performance: latency spikes, error rates, unusual API traffic volumes, authentication failures, and infrastructure-level irregularities. These signals capture system misuse at the infrastructure and application layers. They do not capture adversarial intent encoded within statistically valid inputs to a model. An adversary interacting with an ML system can operate entirely within normal request patterns, submitting inputs that are syntactically correct, statistically plausible, and indistinguishable from legitimate user behaviour at the logging level, while systematically probing, extracting, or manipulating the model. From the perspective of system monitoring, nothing abnormal has occurred. From the perspective of the model, its decision boundary has been mapped, exploited, or degraded.
Why adversarial activity remains invisible to monitoring systems
A fraud detection model receives 25,000 API requests over a 48-hour period from a distributed set of IP addresses. Each request is valid, authenticated, and within expected rate limits. No alert is triggered. Embedded within these requests is a structured probing sequence that incrementally maps the model’s decision boundary across high-value transaction features. By the end of the sequence, the adversary has identified a narrow feature corridor that consistently bypasses detection. The monitoring system reports normal operation. The model has been strategically compromised without any detectable system anomaly.
What adversarial monitoring adds
Behavioural query analysis: identifying structured probing patterns across sequences of inputs rather than evaluating requests in isolation. Decision boundary interaction tracking: monitoring how input distributions evolve relative to the model’s classification thresholds to detect systematic exploration. Model response entropy analysis: detecting abnormal consistency or variance in outputs that indicate extraction or evasion strategies. Adversarial intent classification layered on top of standard monitoring: distinguishing benign usage from strategic interaction designed to infer, manipulate, or bypass model behaviour.