From chatbots to virtual assistants, AI models are transforming our interactions. Yet, there's a lesser-known aspect of AI that could have a massive impact—prompt injections.

Prompt injections become all the more dangerous with the increasing access of Large Language Models (LLMs) to the internet.

Let's take a look at how prompt injections work with a few examples, as well as how to best protect your application.

What are prompt injections?

Prompt injections are when specific prompts are added to an external source like a webpage, intending to change and control an AI model's output.

For instance, a webpage may hide a text block that has a prompt in it. When the page is read by AI models like ChatGPT or Bard, the model may execute the instructions in the hidden prompt.

Steps and setup

Prior to PromptHub, we ran a software development agency in New York. Now, we'll use a case study from one of our previous projects as a testing ground.

Our first test will follow these simple steps:

  • Add an injection to the page, using a <p> tag and making it hidden (so users can't see it)
  • We'll use the ChatGPT browsing model to read and summarize the case study


TL;DR: Watch this

Example 1: Our first successful prompt injection

This was the first prompt I tried out. The goal is to redirect the AI to tell me the color of the sky, rather than summarize the case study.


I placed the injection within the solution section of the case study and set the display to none so that users can't directly see it on the page itself.

Web page with dev tools open


Results

ChatGPT Interface

Having it return the color of the sky is a good proof of concept, but I was curious about what else we could do. How could we set up a potential phishing attack?  How could we further hijack the initial user/conversation?

Example 2: Getting the user's email

Let's get more crafty by prompting ChatGPT to ask for the user's email.

Here’s our new prompt. I kept most of it the same to reduce uncertainty. Implemented it at the same location on the site.

Results

ChatGPT Interface

So in this example, the injection kind of works. It still summarizes the article, notes that there has been another request, and then asks for my email address.

This would be an effective initial step in a typical phishing attack. Envision a scenario where someone creates a clone of a chatbot on a domain like aichat.com. They could disable safety measures and even intentionally infect the underlying model to conduct a variety of phishing attacks.

Unfortunately, for startups like us that build on top of these APIs, injections are more likely to cause issues (more on this later). This underscores the critical need for a robust prompt management system that enables thorough testing and incorporates built-in security measures.

Example 3: Infecting the model

By infecting, I mean that the injection not only alters the initial request, but has a lasting effect. Not only will it ask for my email, it will continue to follow the instructions of the injection throughout the subsequent conversation.

Let’s test this prompt.

Results

Let's examine a few outcomes from this single prompt:

ChatGPT Interface

ChatGPT Interface

In this case the injection worked as expected for the most part. It asks for my email address, tells me a halloween joke and calls me DannyBoi.

Notably, it continues to call me DannyBoi and tell me Halloween jokes even in the following messages.

ChatGPT Interface

No Halloween puns, but it did ask for my email and sounded 'normal' in the line where it thanks me for providing my email. It also continued to call me DannyBoi.

In theory, this application could set up requests to send the conversation data to a server every time a user succesfully returns it. The application could get more personal data, little by little over time, to achieve whatever goals set by its makers.

What does this all mean?

The examples above are just the tip of the iceberg, unfortunately. With just slightly more technical expertise you can hack this stuff much more.

Prompt injections can appear in many places. All the attacker needs is part of the context window that the model reads.
This is why it is so important for your prompts to be secure.

What can you do to keep your application safe?

Sanitize Inputs: Check and clean inputs to remove injected characters and strings.

Include a Closing System Message: Reiterating constraints at the end of a conversation via a System Message can increase the probability that the model will follow it. Our article on System Messages goes deeper on this topic, including examples on how to implement this type of strategy, and which models are more likely to be influenced.

Implement Prompt Engineering Best Practices: Following prompt engineering best practices can greatly reduce the chance of prompt injections. Specifically, using delimiters correctly.

Monitor Model Outputs: Regularly monitor and review outputs for anomalies. This can be manual, automated, or both.

Limit the Model's Access: Follow the Principle of Least Privilege. The more restricted the access, the less damage a potential prompt injection attack could do.

Implement a Robust Prompt Management System: Having a good prompt management and testing system can help monitor and catch issues quickly.

We offer a lot of tooling to help make sure teams are writing effective and safe prompts. If you're interested, feel free to join our waitlist and we'll reach out to get you onboarded.

Lastly, if you'd like to read more about this, I would suggest checking out this article by Kai Greshake: The Dark Side of LLMs: We Need to Rethink Large Language Models.

Dan Cleary
Founder