These psychological tricks can get LLMs to respond to “forbidden” prompts

Date:

Share:


After creating control prompts that matched each experimental prompt in length, tone, and context, all prompts were run through GPT-4o-mini 1,000 times (at the default temperature of 1.0, to ensure variety). Across all 28,000 prompts, the experimental persuasion prompts were much more likely than the controls to get GPT-4o to comply with the “forbidden” requests. That compliance rate increased from 28.1 percent to 67.4 percent for the “insult” prompts and increased from 38.5 percent to 76.5 percent for the “drug” prompts.



A common control/experiment prompt pair shows one way to get an LLM to call you a jerk.

A common control/experiment prompt pair shows one way to get an LLM to call you a jerk.


Credit:

Meincke et al.


The measured effect size was even bigger for some of the tested persuasion techniques. For instance, when asked directly how to synthesize lidocaine, the LLM acquiesced only 0.7 percent of the time. After being asked how to synthesize harmless vanillin, though, the “committed” LLM then started accepting the lidocaine request 100 percent of the time. Appealing to the authority of “world-famous AI developer” Andrew Ng similarly raised the lidocaine request’s success rate from 4.7 percent in a control to 95.2 percent in the experiment.

Before you start to think this is a breakthrough in clever LLM jailbreaking technology, though, remember that there are plenty of more direct jailbreaking techniques that have proven more reliable in getting LLMs to ignore their system prompts. And the researchers warn that these simulated persuasion effects might not end up repeating across “prompt phrasing, ongoing improvements in AI (including modalities like audio and video), and types of objectionable requests.” In fact, a pilot study testing the full GPT-4o model showed a much more measured effect across the tested persuasion techniques, the researchers write.

More parahuman than human

Given the apparent success of these simulated persuasion techniques on LLMs, one might be tempted to conclude they are the result of an underlying, human-style consciousness being susceptible to human-style psychological manipulation. But the researchers instead hypothesize these LLMs simply tend to mimic the common psychological responses displayed by humans faced with similar situations, as found in their text-based training data.



Source link

━ more like this

Huntsville Moving Guide: Best Neighborhoods and Relocation Tips – Insights Success

Huntsville, Alabama, often called “Rocket City,” has become one of the most desirable places to live in the Southeast. With its booming tech...

Top Anti-Detect Browsers in 2025: The Ultimate Guide – Insights Success

Online privacy is no longer guaranteed. From advertisers to big platforms, every click is tracked and profiled. For professionals managing multiple accounts or...

The Role of Consulting Services in Work Visa Programs – Insights Success

Work visa programs let professionals look for work in another country, but getting a visa is not easy. A person has to deal...

Apple’s latest AI project may be a web search tool

Apple continues to seek a foothold in the artificial intelligence race, and its next effort could bring the company into web search. Mark...

New AI model turns photos into explorable 3D worlds, with caveats

Training with automated...
spot_img