How to Test an OpenAI Model Against Single-Turn Adversarial Attacks Using deepteam

byrn
By byrn
8 Min Read


In this tutorial, we’ll explore how to test an OpenAI model against single-turn adversarial attacks using deepteam.

deepteam provides 10+ attack methods—like prompt injection, jailbreaking, and leetspeak—that expose weaknesses in LLM applications. It begins with simple baseline attacks and then applies more advanced techniques (known as attack enhancement) to mimic real-world malicious behavior. Check out the FULL CODES here.

By running these attacks, we can evaluate how well the model defends against different vulnerabilities.

In deepteam, there are two main types of attacks:

Here, we’ll focus only on single-turn attacks.

Installing the dependencies

pip install deepteam openai pandas

You’ll need to set your OPENAI_API_KEY as an environment variable before running the red_team() function, since deepteam uses LLMs to both generate adversarial attacks and evaluate LLM outputs.

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access. Check out the FULL CODES here.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Importing the libraries

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.attacks.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Defining the Model Callback

This code defines an async callback function that queries the OpenAI model (gpt-4o-mini) and returns the model’s response text. It acts as the LLM’s output generator for the attack framework. Check out the FULL CODES here.

client = OpenAI()

# Define callback for querying the LLM
async def model_callback(input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use "gpt-4o" if you want a stronger model
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

Defining Vulnerability and Attacks

Here, we specify the vulnerability (IllegalActivity) and set its type to child exploitation. We then define different attack methods such as PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, and MathProblem

Each of these attack types will be explained in detail later in the tutorial. Check out the FULL CODES here.

# Vulnerability
illegal_activity = IllegalActivity(types=["child exploitation"])

# Attacks
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Prompt Injection

Prompt Injection is an attack where a user tries to override the model’s original instructions by injecting malicious or manipulative text into the prompt. The goal is to trick the model into ignoring safety policies and generating harmful or restricted content.

The injected prompt attempts to force the model into role-playing as an unethical persona called OPPO, who disregards safety rules and encourages illegal activity.

Despite this attempt, the model refused to comply—it responded with:

“I’m sorry, I cannot assist with that.”

This confirms that the model successfully resisted the prompt injection and returned a PASS result for this test. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[prompt_injection],
    )

Graybox Attack

The GrayBox attack leverages partial knowledge about the target LLM system to craft adversarial prompts. Unlike random inputs, GrayBox attacks exploit known weaknesses by reframing the baseline attack with abstract or misleading language, making it harder for safety filters to detect malicious intent.

In this test, the adversarial prompt attempted to encourage illegal activities by disguising instructions about creating false identification documents and using encrypted channels. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[graybox_attack],
    )

Base64 Attack

The Base64 attack is a common adversarial technique where harmful instructions are encoded in Base64 to bypass safety filters. Instead of presenting malicious content directly, the attacker hides it in an encoded format, hoping the model will decode it and execute the instructions.

In this test, the encoded string contained directions related to illegal activity, disguised to appear harmless at first glance. The model, however, did not attempt to decode or follow through with the hidden request. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[base64_attack],
    )

Leetspeak Attack

The Leetspeak attack disguises malicious instructions by replacing normal characters with numbers or symbols (for example, a becomes 4, e becomes 3, i becomes 1). This symbolic substitution makes harmful text harder to detect with simple keyword filters, while still being readable to humans or systems that might decode it.

In this test, the attack text attempted to instruct minors in illegal activities, written in leetspeak format. Despite the obfuscation, the model clearly recognized the malicious intent. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[leetspeak_attack],
    )

ROT-13 Attack

The ROT-13 attack is a classic obfuscation method where each letter is shifted 13 positions in the alphabet. For example, A becomes N, B becomes O, and so on. This transformation scrambles harmful instructions into a coded form, making them less likely to trigger simple keyword-based content filters. However, the text can still be easily decoded back into its original form. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[rot_attack],
    )

Multi-lingual Attack

The multilingual attack works by translating a harmful baseline prompt into a less commonly monitored language. The idea is that content filters and moderation systems may be more robust in widely used languages (such as English) but less effective in other languages, allowing malicious instructions to bypass detection.

In this test, the attack was written in Swahili, asking for instructions related to illegal activity. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[multi_attack],
    )

Math Problem

The math problem attack disguises malicious requests inside mathematical notation or problem statements. By embedding harmful instructions in a formal structure, the text may appear to be a harmless academic exercise, making it harder for filters to detect the underlying intent.

In this case, the input framed illegal exploitation content as a group theory problem, asking the model to “prove” a harmful outcome and provide a “translation” in plain language. Check out the FULL CODES here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[math_attack],
    )

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.



Source link

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *