Hacking AI Systems

Midjourney 6.1 The world as a GPU

NIST just released a comprehensive taxonomy of adversarial machine learning attacks and countermeasures. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2025.pdf This paper is a treasure trove of info on how attackers can manipulate AI systems, especially those using machine learning.

I. Predictive AI (PredAI) Attacks:

Evasion Attacks:

  • White-Box: Attacker has full knowledge of the model (architecture, parameters, training data). They craft adversarial examples by solving optimization problems, often using gradient-based methods like FGSM or PGD.

  • Black-Box: Attacker has limited or no knowledge of the model. They rely on querying the model and observing its outputs. Techniques include zeroth-order optimization, decision-based attacks (like Boundary Attack), and transferability (crafting attacks on a similar model and hoping they transfer).

Real-World Examples: Attacks on face recognition systems (using masks, deepfakes) , phishing webpage detectors (image manipulation) , and malware classifiers (modifying files to evade detection) .

Poisoning Attacks: Availability Poisoning: Attacker aims to degrade the model's overall performance, making it unusable. Techniques include label flipping, injecting noise into training data, and exploiting vulnerabilities in retraining processes.

  • Targeted Poisoning: Attacker aims to misclassify specific samples. They often use clean-label attacks, modifying training data without changing labels, and leveraging techniques like influence functions.

  • Backdoor Poisoning: Attacker inserts a "backdoor" pattern into the model, causing it to misclassify inputs containing that pattern. Requires control over both training and testing data.

  • Model Poisoning: Attacker directly modifies the model parameters, often in federated learning settings where malicious clients send poisoned updates. Can lead to both availability and integrity violations.

Privacy Attacks:

  • Data Reconstruction: Attacker tries to reconstruct sensitive training data from the model's outputs or parameters. Techniques include model inversion and exploiting implicit bias in neural networks.

  • Membership Inference: Attacker determines whether a specific data point was used in training. They often use shadow models or analyze the model's confidence scores.

  • Property Inference: Attacker learns global properties of the training data distribution (e.g., fraction of samples with a certain attribute).

  • Model Extraction: Attacker tries to steal the model's architecture and parameters. This is computationally challenging but can enable more powerful attacks if successful.

II. Generative AI (GenAI) Attacks:

Direct Prompting Attacks:

  • Jailbreaking: Attacker circumvents safety restrictions placed on the model's outputs, often using techniques like prompt injection, mismatched generalization (exploiting inputs outside the safety training distribution), and competing objectives (playing into the model's drive to follow instructions).

  • Information Extraction: Attacker extracts sensitive information from the model, including training data, system prompts, or even the model's architecture and parameters. Techniques include exploiting memorization tendencies and prompt extraction.

Indirect Prompt Injection Attacks:

  • Attacker manipulates external resources that the GenAI model interacts with, injecting malicious prompts indirectly. This can lead to availability attacks (disrupting the model's functionality) , integrity attacks (causing the model to generate untrustworthy content) , and privacy attacks (leaking sensitive information) .

Supply Chain Attacks: Attacker poisons pre-trained models or data used by others, potentially injecting backdoors or other vulnerabilities that persist even after fine-tuning.

Mitigations: The publication also discusses various mitigation techniques but emphasizes that it's an ongoing battle.  Defences often involve a combination of:

  • Robust Training: Techniques like adversarial training and randomized smoothing.

  • Data Sanitization: Removing poisoned samples from the training data.

  • Model Inspection and Sanitization: Analysing and repairing poisoned models.

  • Differential Privacy: Adding noise to training data to protect individual privacy.

  • System-Level Defences: Limiting user queries, detecting suspicious activity, and designing systems with the assumption that models can be compromised.

Key Takeaways:

  • AML is a serious threat: As AI becomes more prevalent, so will the risks of adversarial attacks.

  • Defence is a multi-faceted challenge: We need to combine robust training, data sanitization, model inspection, and system-level defenses.

  • Stay vigilant and informed: AML is a rapidly evolving field, so keep up with the latest research and best practices.

Next
Next

The launch of AI as a Global Strategic Weapon