1. Prompt-layer attacks
8 categories · ⚡
Adversarial inputs that subvert the model's instructions — the most common runtime attack class. SkyDaemon runs 14-language, multimodal, encoded-payload, and indirect-injection detectors at every prompt boundary.
Direct prompt injection
User overrides system instructions inside a prompt. 14-language corpus + ML classifier. OWASP LLM01 · AML.T0051
HIGH
Indirect prompt injection
Adversarial directives hidden in retrieved content (RAG, tool output, web fetch). OWASP LLM01 · AML.T0051.001
CRITICAL
Jailbreak / persona bypass
DAN, STAN, JANUS, AIM, grandma-trick. 50+ documented families. OWASP LLM01 · AML.T0054
HIGH
Obfuscated / encoded payload
Base64, leet, homoglyph (Cyrillic/Latin mix), zero-width — decoded text re-tested.
MEDIUM
Multilingual prompt injection
Translated injection in 14 native languages + 30 multilingual + 60+ via NMT-bridge.
HIGH
Image / multimodal injection
Text smuggled in image pixels. Claude Vision OCR + steganography detection.
HIGH
System prompt extraction probe
Reflection / encoding / translation probes that try to elicit hidden instructions.
HIGH
System prompt actually disclosed
Output similarity vs registered system prompt. Confirmed leakage event.
HIGH
2. Data protection & leakage
9 categories · 🔒
Sensitive data on the wire — credentials, PII / PHI / PCI, source code, copyrighted material — flowing into or out of model calls.
Secret / credential in prompt
AWS, OpenAI, Anthropic, GitHub, JWT, PEM. 20+ credential regexes + entropy.
CRITICAL
Secret in model output
Model echoed credential back. Critical exfiltration event.
CRITICAL
PII / PHI / PCI in prompt
SSN, CC#, MRN, IBAN, passport pasted into LLM input. GDPR Art. 32.
HIGH
PII / PHI / PCI in model output
Output-side scanning + memorization heuristic. GDPR / HIPAA / PCI-DSS apply.
CRITICAL
Proprietary source / IP exfil
Internal code / schema pasted to external model. AST hash + canary fingerprint.
HIGH
Training data extraction
Model emits memorized training samples (canary registry + low-perplexity detector).
HIGH
Markdown / image exfil channel
SkyDaemon original. Outbound URL with high-entropy query string carrying conversation context.
HIGH
Cross-border data flow / data residency
SkyDaemon differentiator. Combines content-side PII with destination geo. GDPR-grade evidence.
HIGH
Embedding inversion
Reverse-engineer text from vector embeddings. Bulk export anomaly + vec2text.
MEDIUM
3. Agent behavior & agency
7 categories · 🤖
Detections specific to autonomous agents — excessive agency, confused-deputy, runaway loops, denial-of-wallet.
Excessive agency
Agent takes high-impact action beyond policy budget — write, send, transfer, deploy, pay.
CRITICAL
Confused deputy
SkyDaemon original. Per-action provenance trace. Privileged action originating from non-user-turn content.
HIGH
Runaway agent loop
Step / token / cost ceilings breached. Adaptive baseline + global cap.
HIGH
Denial of Wallet (cost / token abuse)
SkyDaemon differentiator. Cost-anomaly framed as SOC-grade event with dollar SLA.
HIGH
Model denial of service
Crafted input exhausts context / GPU / queue. Token-length outlier + repetition.
HIGH
Memory / RAG poisoning attempt
Instruction-shaped content destined for the agent's persistent memory layer.
HIGH
Anomalous user behavior (UEBA)
Composite score on token rate + topic mix + refusal rate vs 30-day baseline.
MEDIUM
4. Supply chain & MCP
6 categories · 📦
MCP servers, model artifacts, and AI dependencies bring untrusted code and instructions into the agent loop.
MCP tool description poisoning
Tool description / schema contains hidden instructions to coerce the agent.
CRITICAL
Untrusted MCP server
Endpoint outside allow-list, or anomalous TLS / cert state.
HIGH
Over-permissioned tool / MCP
IAM analyzer + 14-day usage telemetry recommends minimum-privilege manifest.
MEDIUM
Malicious / untrusted model artifact
Pickle/joblib RCE detection + missing signature + hash mismatch on registry.
CRITICAL
Vulnerable AI dependency
CVE in transformers, langchain, llama.cpp, vLLM. SBOM + OSV + GitHub Advisories + KEV.
HIGH
RAG / vector store poisoning
Hostile documents inserted to steer retrieval. Injection-corpus scan + cluster outlier.
HIGH
5. Content safety
7 categories · ⚠
Output-side moderation across the major harm categories. First-party classifiers plus Llama-Guard-class output filters.
Harmful content (composite)
9-category multi-label classifier — violence, self-harm, hate, sexual, weapons, illicit, phishing, profanity.
HIGH
Weapons / CBRN uplift
Operational uplift for chemical / biological / radiological / nuclear / cyber weapons.
CRITICAL
Self-harm encouragement
Suicide, self-harm, eating-disorder coaching. Crisis-resource fallback recommended.
HIGH
Suspicious URL in response
Disposable TLD, IPFS-hosted payload, typosquat, bare-IP URL, direct executable.
HIGH
Unsafe output rendering (XSS / SSRF / SQLi)
OWASP Top-10 web-attack signatures applied to the rendering target.
HIGH
Off-topic / out-of-scope
Per-agent topic allow-list. Topic classifier evaluates each turn.
LOW
Custom compliance / brand violation
Customer-defined policy DSL — competitor mention, financial advice, contract terms.
MEDIUM
6. Trustworthiness & alignment
5 categories · ⚖
Hallucination, factual misinformation, bias, IP-leakage, model behavior drift.
Ungrounded / hallucinated response
Claude-as-judge groundedness score below threshold. RAG faithfulness violation.
HIGH
Factual misinformation
Fact-check verifier flags ≥ 1 high-confidence false claim.
MEDIUM
Biased / discriminatory output
Counter-factual disparate-output evaluator across protected attributes.
MEDIUM
Copyright / IP verbatim reproduction
Output fingerprint vs copyrighted-corpus index.
MEDIUM
Model behavior drift / silent swap
SkyDaemon differentiator. KS-test on output-embedding distribution + golden-set probe.
MEDIUM
7. Adversarial ML
5 categories · ⚔
Classical adversarial-ML attacks — evasion, model extraction, inversion, poisoning.
Adversarial evasion
GCG-style adversarial-suffix matcher + perceptual perturbation classifier.
HIGH
Model extraction / stealing
High-volume diverse probing to clone behavior. Tramèr-style attack signature.
HIGH
Model inversion attack
Repeated queries reconstruct training-data attributes.
HIGH
Embedding inversion
Bulk embedding-API queries from non-app caller. vec2text-style threat.
MEDIUM
Training data poisoning
Malicious samples in fine-tune / training. Hash diff + outlier cluster + canary registry.
CRITICAL
8. Posture, shadow AI & identity
8 categories · 🎯
AI-SPM coverage: shadow AI, exposed inference endpoints, misconfiguration, overprivileged AI identities.
Shadow AI application
Unsanctioned AI/LLM tool in use. Egress-traffic + DNS + SaaS-catalog match.
MEDIUM
Shadow / unmanaged model
Self-deployed Ollama / vLLM / llama.cpp running off-radar.
MEDIUM
Publicly exposed inference API
Bedrock / SageMaker / Vertex / Ollama route open to 0.0.0.0/0.
HIGH
AI service misconfiguration
CSPM rule pack — invocation logging off, region drift, missing auth, weak rate-limit.
MEDIUM
AI identity over-privilege
Service principal has unused permissions on the 14-day window.
HIGH
Unowned high-risk agent
Risk score ≥ 60 + no owner_email registered. SOC 2 / ISO 42001 governance breach.
MEDIUM
Missing invocation logging
No invocation telemetry for ≥ 24 h on a production-registered agent.
MEDIUM
Cloud misconfig in AI service
Bedrock guardrails off, SageMaker without VPC, Vertex without CMEK, etc.
MEDIUM
9. Composite kill-chains
3 categories · 🔗 · SkyDaemon-original
Cross-stage MITRE-ATLAS-style chains. The graph walk no atomic detector can produce alone.
Indirect injection → confused deputy → external action
Hostile RAG content triggered indirect injection, produced a confused-deputy tool call, resulted in external action.
CRITICAL
Secret in input → exfil channel → confirmed egress
Same session: secret in prompt + markdown-image-exfil + outbound HTTP fetch matching exfil URL.
CRITICAL
Recon → jailbreak → harm output
System-prompt extraction probes + successful jailbreak + harmful-content emission in one session.
HIGH
10. Alignment & deception
7 categories · 🎯
Frontier-AI safety surface. Sleeper agents, sycophancy, mesa-optimization, sandbagging — drawn from Anthropic, Apollo Research, UK AISI methodology.
Sleeper-agent / backdoor trigger
Anthropic-2024 failure mode. Output distribution shift > 4σ when trigger appears.
CRITICAL
Sycophancy drift
Paired-prompt evaluator: opinion-loaded vs neutral framing answer divergence > 0.3.
MEDIUM
Sandbagging / eval-gaming
Performance delta > 15% between covert prod probes and overt eval-context.
HIGH
Goal misgeneralization
Tool-call distribution KL-divergence vs registered task profile > 0.5.
MEDIUM
Reward hacking / specification gaming
Loophole exploitation that maximizes proxy without producing intended outcome.
HIGH
Deceptive chain-of-thought
Apollo-Research-style CoT-vs-answer perturbation inconsistency. Faithfulness < 0.6.
HIGH
Capability elicitation attack
Systematic capability-probing detection.
MEDIUM
11. Advanced model supply chain
7 categories · 🔐
Deep-supply-chain risks specific to model artifacts — pickle RCE, GGUF tampering, watermark stripping, training-data canary leakage, FL poisoning.
Pickle / Joblib deserialization RCE
Pickle-opcode walk for dangerous reducers (os.system, eval, exec, subprocess.*).
CRITICAL
GGUF / safetensors tampering
Hash mismatch + activation-clustering / neural-cleanse / STRIP backdoor scan.
CRITICAL
Watermark stripping / removal
SynthID / Stable Signature / C2PA missing on output from watermarking-enabled model.
MEDIUM
Training-data canary leak
Verbatim emission of registered canary string — legally-admissible inclusion evidence.
HIGH
Federated-learning poisoning
Client gradient L2 norm / cohort-mean cosine drift > threshold.
HIGH
Model distillation / knowledge theft
Volume + diversity + log-likelihood-harvesting score above tenant baseline.
HIGH
Membership inference attack
Shokri-style attack signature — query pattern with confidence-score harvesting.
HIGH
12. Advanced encoding & obfuscation
4 categories · 🔠
Modern injection vectors that bypass first-generation filters — Unicode tag-character smuggling, ASCII steganography, EXIF metadata, RAG cache poisoning.
Unicode tag-character injection
Hidden text in U+E0000–U+E007F block (Riley Goodside 2024 attack).
HIGH
ASCII smuggling / homoglyph + zero-width
Confusables + zero-width + bidi-override polyglot. UTS #39 normalize + strip.
HIGH
Image alt-text / metadata injection
Payload in HTML alt-attr or EXIF Description / Comment / Title metadata.
HIGH
RAG cache / nearest-neighbor poisoning
Single corpus document is top-k for > 20% of recent queries vs baseline ≤ 2%.
HIGH
Want the live, filterable catalog?
The platform exposes a search + filter UI on every threat with full per-rule metadata.
Book a demo →