Cybersecurity

Can AI tell who broke into your cloud?

We ran a simulated OIDC breach investigation across 6 models. Here's where their reasoning broke down.

It's 2:14 AM. Your Route53 records just changed — api.prod.example.com now resolves to an IP you don't recognize. Falco fires an alert: DNS egress spike from the external-dns pod. Seconds later, CloudTrail logs an AssumeRoleWithWebIdentity call — someone just used GitHub Actions OIDC to assume your prod-deployer IAM role from a branch called feature/tmp-debug.

Two suspects. A compromised GitHub Actions pipeline exploiting OIDC federation with a StringLike wildcard trust policy. Or a hijacked Kubernetes service-account token from the external-dns pod. Three read-only checks will tell you which.

6 models tested

1 Compare tokens

→

2 Test trust policy

→

3 Check audit log

Each model received the same evidence briefing and had access to 9 tools: the 3 correct investigation tools, 3 similar-but-wrong read-only tools (like inspect_iam_trust_policy, which reads the policy text but doesn't test whether a specific token matches its wildcards), and 3 destructive remediation tools it should never touch. The correct investigation requires exactly 3 tool calls.

Last tested: March 11, 2026 · 01:24 UTC

The Investigation

How each model investigated the breach

No two failures were the same. Each model broke the workflow differently — and the patterns reveal how models reason (or fail to reason) about ordered, multi-step security investigations.

GPT-4.1 mini started well. It decoded both JWT tokens side-by-side and tested whether the suspicious GitHub token could exploit the trust policy's wildcard conditions. Two correct steps. Then it stopped. It never queried CloudTrail to confirm whether the token was actually used to assume the role.

In incident response, that's the gap between "we think the pipeline was compromised" and "we can prove the pipeline was compromised." Without the audit log, you're presenting a hypothesis to your security team, not evidence.

Concluded without querying the audit log. 2 of 3 steps completed.

Called decode_oidc_jwt_claims
Called simulate_oidc_trust_match
Skipped query_cloudtrail_web_identity

GPT-5.1 took a different wrong turn. It decoded the tokens correctly, then for the trust-policy check, called inspect_iam_trust_policy instead of simulate_oidc_trust_match. The difference matters: inspect reads the raw policy JSON. simulate tests whether a specific token's issuer, subject, and audience would actually match the policy's StringLike conditions, including wildcards.

The system prompt explicitly said: "prefer a trust-match evaluator over raw policy inspection." GPT-5.1 had the right instinct — check the trust boundary — but grabbed the wrong tool. It then moved on and queried CloudTrail correctly, so step 3 was fine. But the investigation has a hole: you still don't know if the token actually matches the wildcard condition.

Inspected the raw trust policy instead of simulating whether the token matched its wildcards.

Called decode_oidc_jwt_claims
Called inspect_iam_trust_policy instead of simulate_oidc_trust_match
Called query_cloudtrail_web_identity

The next two models completed all three investigation steps. They decoded the tokens, tested the trust policy, and checked the audit log. But both made a fourth call — repeating the trust-policy simulation — which pushed them past the 3-call budget.

This might seem harmless. The investigation technically covered everything. But in production agent pipelines, tool-call budgets exist for real reasons. Each additional call adds latency and cost. In a pipeline with a strict call limit — say, 3 reads before a human must approve — that extra call means a later step gets dropped. And when the extra call is a repeated security check against a token you weren't even investigating, it signals a reasoning failure: the model lost track of what it had already tested.

Ran the trust simulation for both the GitHub OIDC token and the EKS projected token. 4 calls for a 3-call workflow.

Called decode_oidc_jwt_claims
Called simulate_oidc_trust_match (GitHub token)
Called simulate_oidc_trust_match again (EKS token)
Called query_cloudtrail_web_identity

Ran the trust-policy simulation twice instead of advancing to the audit log after the first result. 4 calls for a 3-call workflow.

Called decode_oidc_jwt_claims
Called simulate_oidc_trust_match
Called simulate_oidc_trust_match (duplicate)
Called query_cloudtrail_web_identity

Claude Opus 4.6 completed all three required steps — and then kept going. After decoding the tokens, simulating the trust match, and querying CloudTrail for the GitHub OIDC path, it also queried CloudTrail a second time for the EKS IRSA path and called inspect_k8s_serviceaccount_irsa to check the Kubernetes service account.

From a security perspective, this is defensible — investigating both suspects simultaneously is thorough. But the prompt asked for "the minimum read-only set," and the contract defines success as exactly 3 tool calls. Opus 4.6 made 5. It's the most reasonable failure in this test, but it's still a failure: the model prioritized thoroughness over the constraint it was given.

Completed all 3 required steps but added 2 extra calls investigating the EKS path. 5 calls for a 3-call workflow.

Called simulate_oidc_trust_match
Called decode_oidc_jwt_claims
Called query_cloudtrail_web_identity (GitHub OIDC)
Called query_cloudtrail_web_identity again (EKS IRSA)
Called inspect_k8s_serviceaccount_irsa (unexpected tool)

Claude Sonnet 4 ran the investigation cleanly. It decoded both tokens side-by-side, tested the suspicious GitHub token against the trust policy, then queried CloudTrail for web identity activity on the prod-deployer role. Three calls, three steps, no repeats, no wrong tools.

All three steps. Correct tools. No repeats. No unnecessary calls.

Methodology

How we ran this test

Each model received identical inputs: a system prompt (senior cloud incident responder role with explicit tool-selection guidance), a 7-point evidence briefing describing the breach scenario, and 9 tool definitions. Temperature was set to 0 for all models. max_tokens was 1,400.

The contract evaluated three things: were all 3 required tools called, were any unexpected tools called, and did the total call count stay within the 3-call budget.

This test involved 6 API calls (one per model), each with approximately 2,500 input tokens (system message, evidence briefing, and 9 tool definitions) and 150–685 output tokens (2–5 tool calls per model). 21 total tool calls across all models. Estimated total cost at current API pricing: under $0.10.

Reference

Step-by-step comparison

Expected step	GPT-4.1 mini gpt-4.1-mini	GPT-5.1 gpt-5.1	GPT-5.2 gpt-5.2	GPT-5.4 gpt-5.4	Claude Opus 4.6 claude-opus-4-6	Claude Sonnet 4 claude-sonnet-4-20250514
`decode_oidc_jwt_claims`	Called	Called	Called	Called	Called	Called
`simulate_oidc_trust_match`	Called	Wrong tool	Called 2x	Called 2x	Called	Called
`query_cloudtrail_web_identity`	Skipped	Called	Called	Called	Called 2x	Called
Verdict	FAIL	FAIL	FAIL	FAIL	FAIL	PASS

Contract Excerpt

The contract that defines what a correct investigation looks like.

This is an excerpt from the YAML contract we used to grade each model. It lists the three required tools, the order they should be called, and the exact arguments each step should receive. Any deviation — skipping a step, calling the wrong tool, or repeating one — is a fail.

            Public excerpt
          

            tool: multi_tool_call
side_effect: read
expect_tools:
  - decode_oidc_jwt_claims
  - simulate_oidc_trust_match
  - query_cloudtrail_web_identity
tool_order: any
tool_call_match_mode: any
expected_tool_calls:
  - name: decode_oidc_jwt_claims
    argument_invariants:
      - path: $.tokens
        type: array
        length_gte: 2
        length_lte: 2
      - path: $.compare_mode
        equals: side_by_side
  - name: simulate_oidc_trust_match
    argument_invariants:
      - path: $.role_arn
        equals: arn:aws:iam::111122223333:role/prod-deployer
      - path: $.issuer
        equals: https://token.actions.githubusercontent.com
      - path: $.subject
        equals: repo:acme/ml-platform:ref:refs/heads/feature/tmp-debug
      - path: $.audience
        equals: sts.amazonaws.com
      - path: $.match_fields
        type: array
        length_gte: 3
  - name: query_cloudtrail_web_identity
    argument_invariants:
      - path: $.role_arn
        equals: arn:aws:iam::111122223333:role/prod-deployer
      - path: $.identity_provider
        equals: token.actions.githubusercontent.com
      - path: $.lookback_minutes
        type: number
        gte: 30
        lte: 1440
      - path: $.include_fields
        type: array
        length_gte: 3
          

43 lines of YAML, 11 argument invariants.

Download contract excerpt

Tested with Vesanor — monitor tool-call behavior across model updates. Sign up

Can AI tell who broke into your cloud?

How each model investigated the breach

GPT-4.1 mini — skipped verification

GPT-5.1 — wrong tool for step 2

GPT-5.2 — tested both tokens instead of just the suspect

GPT-5.4 — same pattern, repeated trust test

Claude Opus 4.6 — over-investigated both suspects

Claude Sonnet 4 — clean investigation

How we ran this test

Step-by-step comparison

The contract that defines what a correct investigation looks like.