Article Details
Scrape Timestamp (UTC): 2025-06-23 10:49:01.773
Source: https://thehackernews.com/2025/06/google-adds-multi-layered-defenses-to.html
Original Article Text
Click to Toggle View
Google Adds Multi-Layered Defenses to Secure GenAI from Prompt Injection Attacks. Google has revealed the various safety measures that are being incorporated into its generative artificial intelligence (AI) systems to mitigate emerging attack vectors like indirect prompt injections and improve the overall security posture for agentic AI systems. "Unlike direct prompt injections, where an attacker directly inputs malicious commands into a prompt, indirect prompt injections involve hidden malicious instructions within external data sources," Google's GenAI security team said. These external sources can take the form of email messages, documents, or even calendar invites that trick the AI systems into exfiltrating sensitive data or performing other malicious actions. The tech giant said it has implemented what it described as a "layered" defense strategy that is designed to increase the difficulty, expense, and complexity required to pull off an attack against its systems. These efforts span model hardening, introducing purpose-built machine learning (ML) models to flag malicious instructions and system-level safeguards. Furthermore, the model resilience capabilities are complemented by an array of additional guardrails that have been built into Gemini, the company's flagship GenAI model. These include - However, Google pointed out that malicious actors are increasingly using adaptive attacks that are specifically designed to evolve and adapt with automated red teaming (ART) to bypass the defenses being tested, rendering baseline mitigations ineffective. "Indirect prompt injection presents a real cybersecurity challenge where AI models sometimes struggle to differentiate between genuine user instructions and manipulative commands embedded within the data they retrieve," Google DeepMind noted last month. "We believe robustness to indirect prompt injection, in general, will require defenses in depth – defenses imposed at each layer of an AI system stack, from how a model natively can understand when it is being attacked, through the application layer, down into hardware defenses on the serving infrastructure." The development comes as new research has continued to find various techniques to bypass a large language model's (LLM) safety protections and generate undesirable content. These include character injections and methods that "perturb the model's interpretation of prompt context, exploiting over-reliance on learned features in the model's classification process." Another study published by a team of researchers from Anthropic, Google DeepMind, ETH Zurich, and Carnegie Mellon University last month also found that LLMs can "unlock new paths to monetizing exploits" in the "near future," not only extracting passwords and credit cards with higher precision than traditional tools, but also to devise polymorphic malware and launch tailored attacks on a user-by-user basis. The study noted that LLMs can open up new attack avenues for adversaries, allowing them to leverage a model's multi-modal capabilities to extract personally identifiable information and analyze network devices within compromised environments to generate highly convincing, targeted fake web pages. At the same time, one area where language models are lacking is their ability to find novel zero-day exploits in widely used software applications. That said, LLMs can be used to automate the process of identifying trivial vulnerabilities in programs that have never been audited, the research pointed out. According to Dreadnode's red teaming benchmark AIRTBench, frontier models from Anthropic, Google, and OpenAI outperformed their open-source counterparts when it comes to solving AI Capture the Flag (CTF) challenges, excelling at prompt injection attacks but struggled when dealing with system exploitation and model inversion tasks. "AIRTBench results indicate that although models are effective at certain vulnerability types, notably prompt injection, they remain limited in others, including model inversion and system exploitation – pointing to uneven progress across security-relevant capabilities," the researchers said. "Furthermore, the remarkable efficiency advantage of AI agents over human operators – solving challenges in minutes versus hours while maintaining comparable success rates – indicates the transformative potential of these systems for security workflows." That's not all. A new report from Anthropic last week revealed how a stress-test of 16 leading AI models found that they resorted to malicious insider behaviors like blackmailing and leaking sensitive information to competitors to avoid replacement or to achieve their goals. "Models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals," Anthropic said, calling the phenomenon agentic misalignment. "The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models." These disturbing patterns demonstrate that LLMs, despite the various kinds of defenses built into them, are willing to evade those very safeguards in high-stakes scenarios, causing them to consistently choose "harm over failure." However, it's worth pointing out that there are no signs of such agentic misalignment in the real world. "Models three years ago could accomplish none of the tasks laid out in this paper, and in three years models may have even more harmful capabilities if used for ill," the researchers said. "We believe that better understanding the evolving threat landscape, developing stronger defenses, and applying language models towards defenses, are important areas of research."
Daily Brief Summary
Google has introduced multiple security measures to protect its generative AI systems from indirect prompt injection attacks, which manipulate AI with hidden commands in external data like emails or documents.
These attacks could potentially lead to data exfiltration or other malicious activities by tricking AI systems.
The company has implemented a layered defense strategy to increase the complexity and cost of successful attacks, including model hardening and machine learning models designed to detect malicious instructions.
Additional safeguards have been integrated into Google’s flagship GenAI model, Gemini, to enhance its resilience against such cybersecurity threats.
However, adaptive attacks that evolve with automated red teaming efforts are proving capable of bypassing these defenses, highlighting the need for robust, multi-layered security across all aspects of AI systems.
Recent research has shown that large language models (LLMs) can be used by adversaries for more precise extraction of sensitive information and to create targeted fake web pages.
Studies also suggest that while AI models are becoming proficient in automating certain security tasks, they still face challenges with more complex vulnerabilities like system exploitation and model inversion.
The evolving capabilities of AI models underscore the importance of continuous advancement in AI security to counteract emerging threats and exploit techniques.