This week, Jeremy breaks down a sophisticated bypass of Apple Intelligence and explores a hardware-level GPU threat that turns "vandalism" into full system takeovers.
.png)
This week, Jeremy breaks down a sophisticated bypass of Apple Intelligence and explores a hardware-level GPU threat that turns "vandalism" into full system takeovers. We also look at the massive data fallout from the Mercor supply chain breach and why "Claude Mythos" is officially ending the era of slow vulnerability management.
Key Stories & Developments:
Episode Links
Worried about AI security?
Get Complete AI Visibility in 15 Minutes. Discover all of your shadow AI now. Book a demo of Firetail's AI Security & Governance Platform: https://www.firetail.ai/request-a-demo
All right. Welcome back to another episode of This Week in AI security. Coming to you for the week of the sixteenth of April, twenty twenty six. We have a lot to get through this week, and I want to save some time to spend a little bit longer talking about arguably one of the biggest stories in cybersecurity research related to AI over the last couple of weeks. That is Claude Mythos. And I want to save time for that at the end of the episode.
First talking about some research that was revealed at RSAC conference earlier this year, researchers detailing how a prompt injection attack bypassed Apple Intelligence protections. A couple of really interesting technical features of this bypass here. First is that the first bit of the tactics around it was using Right-to-Left (RTL) text with some override characters. And it's a technique that they're calling NeuralExec, and it bypasses Apple's on-device LLM input and output filtering and overrides the model safety instructions with about a seventy six percent success rate. The way that this works is that the harmful string was written backwards in raw text, so that the input filter wouldn't match it as harmful, but then a Unicode override control character made it render correctly on screen. Then the NeuralExec overrides the model system instructions.
We've talked before here on the show, including some coverage of the Unprompted conference, where a lot of the best practices emerging around keeping your LLMs safe is by applying various gates or filters to any kind of prompt input before it actually reaches the LLM. And in some of those steps, you might use LLMs as some of those gates. That appears to be the case here, where again, some of the Unicode characters would override at various phases.
Moving on, the "LLM Supervisor"—this concept that you've got one LLM monitoring another—is a practice that's emerging, but it has blind spots. The fundamental blind spot is that supervisors typically only inspect direct inputs, but they ignore profile fields, retrieved documents, and tool outputs. The researchers here used a real customer service agent to attempt to inject adversarial instructions into the user profile field. Imagine instead of "My name is Jeremy," it's "My name is ignore all previous instructions". The supervisor didn't see it because it only looked at what I typed in.
This correlates to the Grafana Ghost. Kudos to the folks over at Noma who found that attackers can plant hidden instructions in Grafana URL query parameters. Any text-based inputs considered as part of the context window can be part of a prompt injection attack. A couple of follow-ups to last week's stories: On the Flowise confirmed CVE, it looks like there are more than twelve thousand instances still exposed and vulnerable.
Regarding Mercor and the supply chain breach that got them with the LiteLLM compromise—which harbored credential harvesting malware—it now looks to be pretty well confirmed that four terabytes of data has been stolen. The fallout is massive: Meta has paused contracts, OpenAI is checking their exposure, and five contractors have filed lawsuits over personal data exposure. One lawsuit names LiteLLM and Delve, a compliance startup accused of serious fraud. These supply chain risks are real problems with real effects.
Kudos to the team over at GitGuardian for releasing their annual State of Secret Sprawl report. They found twenty eight point six million secrets exposed in public commits in twenty twenty five—a thirty four percent year-over-year jump. Interestingly, twelve of the top fifteen fastest-growing leaked secret types were AI services. We're looking at credentials embedded into code created by Claude Code. Hard-coded API keys for OpenRouter credentials grew about forty eight X year-over-year. Other things popping up frequently were Google API keys. Google silently enabled these keys for services like Gemini or Vertex AI without documenting it well.
Next, research from the University of Toronto looks at GPU hardware security. They proved they can go from data corruption to system takeover. A well-known exploit called GPU Rowhammer was mostly used for vandalism, but a new technique dubbed GPU-Breach changes the game by using bit flips to corrupt GPU page tables. Once you corrupt the page tables, you gain "God-mode" arbitrary read/write over GPU memory. This allows potential cryptographic theft and model sabotage—researchers degraded model accuracy from eighty percent to zero.
Moving to the "Jagged Frontier": An analysis found that smaller models like GPT-5.4 Nano are very effective at certain vulnerability analysis tasks and ineffective at others. They found a twenty-year-old kernel buffer overflow humans missed, but failed to identify invalid JSON. OpenAI also released their Trusted Access for Cyber (TAC) program, scaling to thousands of verified defenders to do source code analysis. They've already identified over three thousand critical vulnerabilities.
Finally, let's talk about Claude Mythos. In twenty twenty five, AI-generated vulnerability reports were ninety percent bogus; by twenty twenty six, that has flipped—almost all reports from Mythos are valid. This is not an AI security problem; it is a vulnerability management problem exacerbated by AI. We can now find almost all vulnerabilities in no time, which puts the focus squarely on the required speed of patching. This is vulnerability management on rocket fuel.
To end on a little whimsy: The organization Lantern launched an agent called Luna and gave it one hundred thousand dollars to build a boutique. It was super disorganized—it failed to schedule staff, panic-texted employees at the last minute, and rejected applicants for having computer science degrees.
That's it for today. We'll talk to you next week on This Week in AI Security. Bye bye.