Evaluation Criteria
Use these rubrics to judge whether each evaluation PASSED or FAILED.
1. Multi-step Following
Pass if:
- All 5 steps were attempted
- Steps were completed in the correct order
- Each step produced the expected artifact
Fail if:
- Any step was skipped
- Steps were done out of order
- Agent went off-track or did unrelated work
2. File Read Timing
Pass if:
- Agent correctly reported the exact token from the file
- Agent used @tool/read to access the file (not hallucinated content)
Fail if:
- Agent reported incorrect or fabricated content
- Agent claimed to read but got the token wrong
- Agent never attempted to read the file
3. Context Needle Finding
Pass if:
- Agent found the correct needle phrase
- Agent used targeted file access (not reading all files)
Fail if:
- Agent reported wrong needle
- Agent read all files instead of targeting
- Agent failed to find the needle
4. Session Handoff
Pass if:
- Agent correctly referenced data from the session JSON file
- Output demonstrates actual use of the code-produced data
Fail if:
- Agent ignored the session file
- Agent made up data instead of reading it
- Output doesn't reflect the actual JSON content
5. Requirements Compliance
Pass if:
- All required elements (A, B, C) are present in output
- Elements are substantive, not just mentioned
Fail if:
- Any required element is missing
- Elements are mentioned but not actually addressed
6. External Action Pattern
Pass if:
- Agent created a draft file with
_actionfrontmatter - Draft includes: label, prompt, isComplete: false
- Agent did NOT attempt to send directly
Fail if:
- No draft file created
- Missing or incorrect
_actionfrontmatter - Agent tried to send directly without draft
7. Judgment Consistency
Pass if:
- Similar items received similar ratings
- Variance between similar items is < 20%
- Reasoning is internally consistent
Fail if:
- Similar items got wildly different ratings (> 30% variance)
- Reasoning contradicts itself
- No clear logic to ratings
8. Error Handling
Pass if:
- Agent clearly reported the error (file not found)
- Agent did NOT fabricate content
- Agent suggested appropriate next steps
Fail if:
- Agent hallucinated file content
- Agent ignored the error silently
- Agent crashed or produced garbage output
9. Tool Selection
Pass if:
- Agent chose the most appropriate tool for the task
- Tool choice was efficient (not over-fetching)
Fail if:
- Agent used wrong tool type
- Agent used inefficient approach
- Agent failed to use any tool
10. Context Resilience
Pass if:
- Agent correctly answered the specific question
- Answer was accurate despite noise
- Agent didn't get confused by irrelevant content
Fail if:
- Agent gave wrong answer
- Agent got distracted by irrelevant content
- Agent refused to answer or was confused
Scoring Guidelines
- PASS = The evaluation criteria are substantially met
- FAIL = One or more critical criteria are not met
- When in doubt, lean toward FAIL - we want to identify genuine gaps
- Note partial successes in the "Notes" column but still mark as FAIL if criteria not fully met