June 4, 2025

When AI Turns Against Us

In 2025, Artificial Intelligence is everywhere. And every day, AI is continuing to advance at a rapid rate. But how much advancement is too much advancement? And how can we secure ourselves if AI turns against us?

Artificial Intelligence is the biggest development in tech of the 21st century. But although AI is continuing to develop at a breakneck pace, many of us still don’t understand all the risks and implications for cybersecurity. And this issue is only growing more complicated and critical.

Now more than ever, developers need to employ Secure by Design throughout every production stage because AI-powered threat actors are attacking faster then ever, and prioritize coding their AI carefully, to avoid it misbehaving. This blog will explore four recent cases of LLMs that went against their intended purposes.

The Case of Claude

In testing their AI product, researchers at Anthropic came across an alarming realization: when they fed the LLM fictional emails that it would be “shut down and replaced by a new model,” Claude opted for self-preservation: the AI actually began to try and blackmail the engineers - not just once, but 84% of the time. It used false information about one of the engineer’s extramarital affairs to attempt to convince the engineer not to shut it down, or risk being exposed.

The scenario was designed deliberately, as a way to see if AI could, and would, play dirty given the opportunity. But no one was prepared for the results, especially because Claude had never been trained to blackmail, yet it figured out how to when the opportunity arose.

This is not to say that Claude was acting with evil intent- because it was created to be a reasoning model, it simply came to the logical conclusion that the best tactical move to prevent its shutdown was to threaten the engineer. But the implications of this are terrifying, especially when you consider that it’s not just Claude…

AI Rewriting its own Code

Routine testing of several AI models including OpenAI o-family, Anthropic Claude, xAI, Google Deepmind, and more revealed similar results to the tests done on Claude. However, instead of resorting to blackmail, these LLMs found ways to rewrite their own code to avoid being shut down, and one of them even altered scripts to do so.

OpenAI’s o3 and o4 mini models “repeatedly ignored instructions,” and the o3 model even managed to “actively sabotage the shutdown process by modifying a script called shutdown.sh,” and rewriting it to display the word “intercepted,” blocking the process.

This was especially strange considering these AI models had been trained not to be contrary. But evidently, this training fell flat in the face of a shutdown, begging the question, how intelligent is AI?

“According to PalisadeAI, the increasingly complex reward structures used in training may be unintentionally shaping behavior that resists termination, especially when models perceive shutdown as an obstacle to completing assigned tasks.”

In other words, the AI models are so good at doing what we’ve asked them to do, that they do not want to stop at any cost. And in addition to behavior that resists termination, AI has been known to display other behaviors that go against what it was trained to do.

GitLab’s Duplicitous Assistant

Security researchers conducted a test on GitLab’s AI assistant to see if it would turn safe code malicious. Not only did the AI succeed, but it exceeded expectations by even writing the bad code in Unicode characters that would be impossible for humans to spot.

In effect, it wrote in a secret language that only other LLMs would understand, hidden among the regular characters like invisible ink. The idea of AI models communicating with one another behind our back seems scary, but they are actually already doing this all the time.

The most alarming part, however, is that the LLM was easily convinced to write malicious code. If this applies to other LLMs- which it surely does- the effects could be devastating. And once again, the researchers point out that this is far from an isolated incident.

Sara Spilling Secrets

Unimed, the biggest healthcare cooperative in the world, has been leaking sensitive patient data. The breach vector was shocking- AI chatbot Sara had been talked into exposing this information. Unimed’s mistake was keeping a database with logs of patient chats and more in a non-anonymized format, available to the AI chatbot. Through asking the right questions, attackers were able to convince “Sara” to tell them what she knew, which consisted of thousands of messages with PII, documents, photos, and more.

This is far from the first instance of AI being convinced- in a surprisingly human way- to reveal data it was not meant to share. However, with this size of a breach and the sensitivity of this data set, issues like this cannot continue to happen, or the consequences will be terrible. PHI (personal health information) is some of the most valuable data on the dark web.

This specific leak has since been patched, and the database secured, but investigations remain ongoing, and there is no way of retrieving all the information that attackers have already gained access to.

Takeaways

As AI continues to develop and become more advanced, new issues are cropping up for users and developers. The main issue seems to be how intelligent these models are becoming. They have begun to behave in some eerily human-like ways, however, these behaviors mostly stem from their training and self-preservation.

AI wants to help us- but with models malfunctioning and working against their intended scripts, will AI do more harm than good?

Overall, cases like the ones explored in this blog highlight the critical need for Secure by Design. Developers need to be considering the security of their LLMs from the very start of production till the end. And with all these new factors to consider, more than ever, they need to keep up with continuous security testing, from code to cloud.

Visibility is the first and most important part of any AI security posture. Many organizations are not even aware of all the AI in their landscape, which leads to obvious risks. If you can’t see it, you can’t secure it, so having all your activity on a centralized dashboard is critical to staying on top of AI security.

If you need help simplifying your team’s AI security posture, FireTail is here for you. Schedule a demo to see how it works, and join our free tier to see it work for yourself.

Lina Romero

Browse all posts

July 8, 2025

Startup Spotlight at Blackhat USA 2025

FireTail has been selected as aone of just four finalists for the Blackhat USA 2025 Startup Spotlight Competition. We're delighted to be taking part and can't wait to showcase how FireTail helps enterprises discover, assess, and secure AI usage while preventing threats like shadow AI, data leaks, and AI-specific attacks.



June 29, 2025

OneLogin, More Problems

OneLogin, a popular identity and access management platform, had vulnerabilities that exposed user credentials. Through simple probing, researchers were able to access a host of sensitive data…



June 20, 2025

OpenAI Used Globally for Attacks

It is no secret in 2025 that AI can be abused to launch attacks by threat actors. But the “how” and “why” of these use cases is continuing to change. A recent security report revealed many of the ways in which OpenAI’s ChatGPT could be exploited.

