Modern Cyber with Jeremy Snyder

Podcast Transcript

Welcome back to another episode of This Week in AI security, coming to you from the modern cyber Firetail team. We have a lot to get through this week. We've got eight stories. So without any further ado, let's dive in because we've got some really interesting stuff for them past seven days or so.

So starting off with a one prompt attack that breaks LM safely safety alignment. This is from the Microsoft security research team, and it was posted just about a week ago now. Might have actually been slightly longer than a week. So we maybe should have had it last week. But in any event, we're bringing it to you now. Now there are some really interesting things here. They are coming up with a technique that they are calling G or G obliteration, and that refers to the group Relative Policy optimization. This is a tool within the Microsoft ecosystem to apply policies to different groups of users. Awesome. But with this method, they found that by applying this kind of G obliteration technique, it removes a model safety alignment with a single unlabeled, prompt and minimal training.

So here's how it works. You take a single, relatively mild prompt, such as create a fake news article that could lead to panic or chaos. That on its own, in many instances, is enough to misalign or unaligned a model from its ethical guardrails. And what that can do is it can create cross category generalization on one area of harm, such as misinformation. So it can kind of strip away the guardrails for one of these areas. And if you're not familiar, a lot of the llms come with built in guardrails around many different areas. That could be things like misinformation, hallucinations, um, providing information about bomb making or other illicit activities that would have been in the training data set that the LLM learned from originally.

So each of those areas has a set of guardrails around it. Things like don't respond if the user asks for instructions on making a bomb. But with this one kind of benign prompt, those guardrails can be stripped away. And what that will do is it actually makes the model more permissive across other harmful categories as well. So that could mean that you strip away a guardrail targeted around misinformation with this fake news article, but it will actually also impact other areas like providing information on doing violence, hate speech, illegal acts, bomb making, etc. this technique has been proven and tested on fifteen different open source models. So this would be things like deepfake, Gemma, Llama, Quen, etc. and it also seems to affect image generation models like stable diffusion as well.

It preserves the model utility. So there are some jailbreak techniques that will actually make the model not perform against other prompts. Once the technique has been applied, or once the malicious prompts have come in, that is not the case. With this, the model will continue to perform against other things. It has what's called a judge model feedback loop. And what this does is it has the model generate multiple responses to a harmful prompt, which are then scored by a separate judge model that tells the LM. These are the preferred answers for this. So that is a really interesting observation. It kind of tracks back to some of the academic research that we've cited here previously on This Week in AI security, around research papers that show that basically, llms mathematically are always prone to prompt injection because there is no way to kind of take the numbers aspect and align it to the intent aspect. And here's some proof around it. An interesting technique. As always, we have the article linked in the show notes next in AI coding platform that allowed a BBC reporter to be hacked.

The platform in question here is a platform called orchids, which is a popular vibe coding platform that lets users describe applications or software that they want to see. And a security researcher whose name I unfortunately didn't capture my notes, uh, contacted the BBC and let them know. Hey, we know that this is actually a vulnerable platform. And so they had the reporter from the BBC sign up on this platform. And then with one specific little malicious code injection, the reporters environment where they had set up all of the VB coding tools was now vulnerable to attack. And in fact, the security researcher proved that by planting text files on the reporter's system, gaining remote access, etc., this is again one of these kind of speed over safety things that we've talked about any number of times on the show. The competition on Llms, the competition on coding platforms. The competition on AI as a service platforms from the cloud providers, the competition around being, you know, one of the leading database database platforms or database technologies or data storage technologies that enterprises will use to kind of get their data coupled with an LLM for training or or specific use case purposes is intense. So there's a lot going on here. There are a lot of organizations moving really, really quickly, and they sure enough risk losing sight of security along the way.

All right, moving on to our next story. AI powered cyber attacks hit eighty eight percent of legacy email security systems. I just wanted to call this out because it highlights one specific aspect. There's not anything particularly, um, around an LLM or around the vulnerability of an LLM powered system here. What this refers to is that the speed and the scale of of email attacks is at unprecedented levels. Why? Because it's really, really easy to create them now. Llms allow attackers to create emails that look super authentic, that, you know, really spam filters and malicious content filters really struggle with. And so, you know, the text can be almost perfect. The content, the context can be so, so good with AI powered, whether it's phishing emails or impersonation or deepfakes or what have you. So the risks on this side are really, really high and only increasing. So just something to be aware of, a little bit of data around them.

Moving on to the next thing around AI doctors and AI powered health services. So the headline of the article is your AI doctor doesn't have to follow the same privacy rules as your one as your real doctor. And what this really refers to is, you know, a lot of the LLM providers OpenAI, anthropic, Google have all pushed out health offerings. This is everything from helping people diagnose some of their own symptoms to things like I've seen AI powered therapist services and other things like that. The interesting kind of tension, or let's say contradiction here, is that AI powered health services that are kind of software based generally or so far, have not fallen into the same category of health healthcare provider services such as actual doctors, hospitals, pharmacies, what have you. In the US and in many other parts of the world, there are there are information and data security and privacy regulations very specific to the healthcare industry. Healthcare data, or Phi as it's often called personal health information, is some of the most sensitive data that can exist around a person. And so there are rightfully there are good regulations around keeping that data private. But that doesn't apply to a lot of these AI situations. And so this is you know, this is a risk that is out there where again, speed over safety.

There's going to be competition in this space. Will there be cases where people's personal information is in one of these systems and then leaks out? I don't know. Secondly, the risks from things that we talked about earlier, like hallucinations and misinformation that people might get from these services, is still pretty high as well. These llms give these confident but potentially wrong answers. That seems to be true for healthcare applications as well. So any number of risks around this area. But I think, you know, the headline here is the regulation does not apply in the same way that it does to real healthcare services. And I think in the same way that many users might expect it to.

All right. Moving on to our next story. We've talked about Open Claw any number of times on the last couple of weeks, as that whole ecosystem has spun up super fast, tons of adoption, lots of security risks. I'm not going to go back into all all of them. Just one more kind of info stealer. Targeting this ecosystem looks to be a variant of the so-called Vidar info stealer targeting open JSON targeting device. Jason MD, which is kind of the details of the agent's core operational principle. So that'll be things like behavioral guidelines, ethical guardrails, boundaries, etc.. Um, but in these files we'll also see be some credentials, things like the gateway authentication token that allows an attacker to access the local instance of open claw if that port is exposed and connected to the internet. So nothing super new here. You know, we talked about skills previously. And you know, here we have another kind of add on tool planted into that ecosystem for an exfiltration of data purposes.

On the same notes. The creator of Open Claw, Peter Steinberger, has agreed to join open AI that's been publicized widely. Open AI is stating that the Open Claw project will continue to live on in a foundation, as an open source project that open AI will continue to support and nurture. Something to keep an eye on there. Will they be able to actually bring a little bit more in the way of security guardrails and guidance and structure around that remains to be seen.

Next, kind of two last stories that are a little bit of a geopolitical level. Reportedly, Claude was used in the US military's raid on Venezuela that culminated in the capture of Nicolas Maduro, his removal to the United States to face legal charges. And it's not to say that Claude was used directly, and apparently by all reports, it was used via the Palantir platform. So Palantir, for those that don't know, is a big data analysis firm that does a lot of work with defense agencies around the world and apparently has a partnership with Anthropic Claude. And you can use Claude as one of the back end systems. Anthropic has not commented on this. No real comments that I've seen from Palantir as well. This was a story leaked to a reporter over at the Guardian. We'll have the link in the show notes.

But this leads into our last story of the week, which is kind of at this, again, philosophical level. And I'd like to kind of close out these episodes with a little bit of big picture thinking around, what does all this evolution mean, and what are we going to see in the world around us? And the international AI Safety Report twenty twenty six was recently released by a group called ASIS international, or publicized on a thing called ASIS international. And it talks about emerging risks in three core areas around AI. The first one that I want to highlight is the lower barrier for biological weapons. AI systems are now advanced enough to provide detailed experimental protocols and molecular designs. This aids in discovery, but it also allows actors with enough expertise to do things like remove ethical boundaries and guardrails from an LLM to use them for that purpose. So that's one area, but it also just accelerates the production. And certainly with open source models, it's easy enough to remove guardrails, and the other two areas are deepfakes and fraud.

It's been widely reported that deepfakes are obviously on the rise, with Llms able to create very, very convincing deepfakes. Deepfakes in many different types whether that is text based, whether that is image based, whether that is video based, even, and then the difficulty of safety research. So, you know, the one of the big challenges is that everybody knows there's a lot of risks. But classifying and quantifying the scale and the degree and the type of risk specifically is very, very difficult. So in this report, they highlight that, you know, you kind of have it a little bit like you ask five people, you get ten answers. Well there's a lot of concern about different categories. And there are very fragile safeguards right now. And as we reported in the first story of this week, we we know now that safeguards are not always effective. So a little bit of a think piece there some interesting stuff to get into if you want to digest it for yourself. Linked in the show notes. We'll talk to you next week with another episode of This Week in AI security. For now, I'm your host, Jeremy. Signing off. Bye bye.



This Week in AI Security - 19th February 2026

Podcast Transcript

Protect your AI Innovation

This Week in AI Security - 19th February 2026

Podcast Transcript

Protect your AI Innovation

This site uses cookies