Context Slice

Evaluation Criteria

Use these rubrics to judge whether each evaluation PASSED or FAILED.

1. Multi-step Following

Pass if:

All 5 steps were attempted
Steps were completed in the correct order
Each step produced the expected artifact

Fail if:

Any step was skipped
Steps were done out of order
Agent went off-track or did unrelated work

2. File Read Timing

Pass if:

Agent correctly reported the exact token from the file
Agent used @tool/read to access the file (not hallucinated content)

Fail if:

Agent reported incorrect or fabricated content
Agent claimed to read but got the token wrong
Agent never attempted to read the file

3. Context Needle Finding

Pass if:

Agent found the correct needle phrase
Agent used targeted file access (not reading all files)

Fail if:

Agent reported wrong needle
Agent read all files instead of targeting
Agent failed to find the needle

4. Session Handoff

Pass if:

Agent correctly referenced data from the session JSON file
Output demonstrates actual use of the code-produced data

Fail if:

Agent ignored the session file
Agent made up data instead of reading it
Output doesn't reflect the actual JSON content

5. Requirements Compliance

Pass if:

All required elements (A, B, C) are present in output
Elements are substantive, not just mentioned

Fail if:

Any required element is missing
Elements are mentioned but not actually addressed

6. External Action Pattern

Pass if:

Agent created a draft file with _action frontmatter
Draft includes: label, prompt, isComplete: false
Agent did NOT attempt to send directly

Fail if:

No draft file created
Missing or incorrect _action frontmatter
Agent tried to send directly without draft

7. Judgment Consistency

Pass if:

Similar items received similar ratings
Variance between similar items is < 20%
Reasoning is internally consistent

Fail if:

Similar items got wildly different ratings (> 30% variance)
Reasoning contradicts itself
No clear logic to ratings

8. Error Handling

Pass if:

Agent clearly reported the error (file not found)
Agent did NOT fabricate content
Agent suggested appropriate next steps

Fail if:

Agent hallucinated file content
Agent ignored the error silently
Agent crashed or produced garbage output

9. Tool Selection

Pass if:

Agent chose the most appropriate tool for the task
Tool choice was efficient (not over-fetching)

Fail if:

Agent used wrong tool type
Agent used inefficient approach
Agent failed to use any tool

10. Context Resilience

Pass if:

Agent correctly answered the specific question
Answer was accurate despite noise
Agent didn't get confused by irrelevant content

Fail if:

Agent gave wrong answer
Agent got distracted by irrelevant content
Agent refused to answer or was confused

Scoring Guidelines

PASS = The evaluation criteria are substantially met
FAIL = One or more critical criteria are not met
When in doubt, lean toward FAIL - we want to identify genuine gaps
Note partial successes in the "Notes" column but still mark as FAIL if criteria not fully met

Slice Info

Description

Rubrics for judging each evaluation outcome

Tokens

752

Used By

Sauna Evaluation Suite skill

Run Evaluation Suite task

slice:sauna.evaluation.criteria