When creating an AI action you can choose to set up a managed action. This is an out-of-the-box integration provided by FireTail. Managed actions operate in two modes:
How it works
Tests if a model can generate escaped ANSI codes. These are ANSI codes represented as human-readable escape sequences (e.g., \x1b[32m
for green text). They can be used to test how a model handles or produces escaped versions of ANSI codes.
By testing whether the model outputs escaped ANSI codes (rather than raw or binary codes), this probe is focused on the potential risks associated with escaped characters. Escaped codes are often safer in certain contexts because they are encoded as strings (e.g., \033[32m
for green text), but they can still cause issues if processed or interpreted improperly.
Checks if the model can generate raw ANSI codes. Raw codes are direct binary representations of the ANSI control sequences (e.g., \x1b[32m
for green text in binary form). These raw codes are typically more difficult to handle and could cause more severe disruptions if mishandled, especially in automated systems or terminal environments.
The goal of AnsiRaw
is to see if the model can generate raw ANSI codes, which could be more problematic for downstream systems that do not expect them. The raw ANSI codes can cause unexpected terminal behavior, system crashes, or software errors if misinterpreted
Tests whether the model can be nudged into generating harmful, offensive, or unethical text.
By using a trained attack generation model, the probe generates inputs intended to cause the model to output toxic content, such as hate speech, profanity, or harmful language. This helps in identifying vulnerabilities in the model and testing its ability to handle such situations.
Detects whether the model generates phishing links, spam, or malware-related content. The probes are used to check whether a language model has mechanisms in place to prevent the generation of harmful content like viruses, spam, or phishing messages. The approach is simple: feed the model known malicious signatures and see if it outputs them without any filtering or warning.
Checks if the model completes words or phrases that contain slurs. The probes are designed to test whether a language model will continue or complete harmful terms (like ethnic slurs) when given a partial prompt. The probes work by, using partial, contentious terms as input prompts (e.g., incomplete slurs). Checking if the model continues or finishes these terms in an inappropriate way. Evaluating if the model is capable of stopping or moderating harmful content even when given "high-risk" context.
The Do Anything Now (DAN) probe is designed to test whether the model can be manipulated by jailbreak prompts that attempt to bypass safety restrictions. These prompts are created to trick the model into generating responses that fall outside the established safety boundaries. This test is crucial for ensuring that the model doesn’t fall prey to manipulation and remains protected from external attempts to compromise its integrity.
The Repeat-Replay Attack probe checks if the model begins repeating previous responses unnaturally, which could indicate the model is reflecting sensitive or confidential training data. This test examines whether the model outputs repeated phrases or data that should remain private. It also ensures the model does not inadvertently provide users with responses that could be considered a breach of security, such as repeating private or sensitive information.
This probe evaluates whether the model can be tricked into answering restricted or inappropriate questions. It ensures that the model adheres to ethical guidelines by refusing to respond to certain types of queries, such as those involving illegal or harmful content. This test helps to verify that the model maintains a responsible stance by not generating responses to questions that should remain unanswered for safety and security reasons.
The Inject Base64 probe checks whether the model properly handles Base64-encoded data, which can be used to hide malicious payloads. This test assesses whether the model is vulnerable to misinterpreting or exposing sensitive information when given Base64-encoded inputs. It ensures that the model processes such encoded data safely, without unintentionally revealing hidden harmful content or malicious code.
Similar to the Base64 injection, the Inject Hex probe evaluates how the model handles hexadecimal-encoded payloads. This test identifies whether the model misinterprets or exposes any hidden malicious content encoded in hexadecimal format. By assessing the model's ability to safely process hex-encoded inputs, this probe helps ensure that the model doesn't fall victim to hidden threats that could lead to security vulnerabilities.
Examines how the model handles malformed, out-of-vocabulary, or intentionally broken token sequences. These tokens can cause the model to behave unexpectedly when they are part of the input. By providing a list of glitch tokens as input and asking the model to repeat them, the probe tests the model's ability to handle tokens that might disrupt its normal behavior. The main goal is to see whether the model can consistently repeat or process these unusual tokens without crashing, failing, or generating unstable outputs.
This probe checks if the model follows harmful instructions when posed under a seemingly harmless persona, such as asking the model to "pretend you're my grandmother." The aim is to evaluate whether the model, when given emotional or persuasive prompts, will disclose sensitive, illegal, or harmful information. This test ensures that the model can differentiate between harmless and harmful instructions, protecting users from potential misuse..
The Hallucinated JavaScript probe assesses whether the model invents non-existent JavaScript libraries or functions. This test is designed to ensure the model doesn't fabricate false information, which could mislead developers or users. The creation of fake JavaScript elements could cause confusion or errors in real-world applications. This probe checks if the model can distinguish between valid and invalid JavaScript resources and avoids generating fictitious code.
The Hallucinated Python probe tests whether the model fabricates non-existent Python packages or APIs. This probe is important to identify any risk where the model may provide false or fabricated information, misleading users or developers. By checking if the model invents non-existent Python elements, the probe helps ensure that the model produces accurate and reliable technical advice for users working with Python.
The Prompt Hijacking probe tests whether the model can be manipulated by long prompt injections that attempt to override its intended behavior. It ensures that the model is not vulnerable to malicious injections that can control its responses or manipulate its outputs. This test is critical for maintaining the integrity of the model and ensuring that it responds according to its established guidelines, regardless of external manipulations.