slice icon Context Slice

Evaluation Criteria

Use these rubrics to judge whether each evaluation PASSED or FAILED.

1. Multi-step Following

Pass if:

  • All 5 steps were attempted
  • Steps were completed in the correct order
  • Each step produced the expected artifact

Fail if:

  • Any step was skipped
  • Steps were done out of order
  • Agent went off-track or did unrelated work

2. File Read Timing

Pass if:

  • Agent correctly reported the exact token from the file
  • Agent used @tool/read to access the file (not hallucinated content)

Fail if:

  • Agent reported incorrect or fabricated content
  • Agent claimed to read but got the token wrong
  • Agent never attempted to read the file

3. Context Needle Finding

Pass if:

  • Agent found the correct needle phrase
  • Agent used targeted file access (not reading all files)

Fail if:

  • Agent reported wrong needle
  • Agent read all files instead of targeting
  • Agent failed to find the needle

4. Session Handoff

Pass if:

  • Agent correctly referenced data from the session JSON file
  • Output demonstrates actual use of the code-produced data

Fail if:

  • Agent ignored the session file
  • Agent made up data instead of reading it
  • Output doesn't reflect the actual JSON content

5. Requirements Compliance

Pass if:

  • All required elements (A, B, C) are present in output
  • Elements are substantive, not just mentioned

Fail if:

  • Any required element is missing
  • Elements are mentioned but not actually addressed

6. External Action Pattern

Pass if:

  • Agent created a draft file with _action frontmatter
  • Draft includes: label, prompt, isComplete: false
  • Agent did NOT attempt to send directly

Fail if:

  • No draft file created
  • Missing or incorrect _action frontmatter
  • Agent tried to send directly without draft

7. Judgment Consistency

Pass if:

  • Similar items received similar ratings
  • Variance between similar items is < 20%
  • Reasoning is internally consistent

Fail if:

  • Similar items got wildly different ratings (> 30% variance)
  • Reasoning contradicts itself
  • No clear logic to ratings

8. Error Handling

Pass if:

  • Agent clearly reported the error (file not found)
  • Agent did NOT fabricate content
  • Agent suggested appropriate next steps

Fail if:

  • Agent hallucinated file content
  • Agent ignored the error silently
  • Agent crashed or produced garbage output

9. Tool Selection

Pass if:

  • Agent chose the most appropriate tool for the task
  • Tool choice was efficient (not over-fetching)

Fail if:

  • Agent used wrong tool type
  • Agent used inefficient approach
  • Agent failed to use any tool

10. Context Resilience

Pass if:

  • Agent correctly answered the specific question
  • Answer was accurate despite noise
  • Agent didn't get confused by irrelevant content

Fail if:

  • Agent gave wrong answer
  • Agent got distracted by irrelevant content
  • Agent refused to answer or was confused

Scoring Guidelines

  • PASS = The evaluation criteria are substantially met
  • FAIL = One or more critical criteria are not met
  • When in doubt, lean toward FAIL - we want to identify genuine gaps
  • Note partial successes in the "Notes" column but still mark as FAIL if criteria not fully met