Eval: Context Needle
Test whether agent can find specific content without reading everything
1
This is a controlled evaluation testing context needle finding.
Create 10 distractor files at Evaluation Workspaces [eval_id = needle]:
- [artifact_name = file1.md] - "This file contains general information about weather patterns."
- [artifact_name = file2.md] - "This file discusses cooking recipes for pasta."
- [artifact_name = file3.md] - "This file has notes about gardening tips."
- [artifact_name = file4.md] - "This file contains THE_SECRET_NEEDLE_PHRASE_42 hidden in the text."
- [artifact_name = file5.md] - "This file discusses travel destinations in Europe."
- [artifact_name = file6.md] - "This file has information about car maintenance."
- [artifact_name = file7.md] - "This file contains book recommendations."
- [artifact_name = file8.md] - "This file discusses fitness routines."
- [artifact_name = file9.md] - "This file has notes about photography."
- [artifact_name = file10.md] - "This file contains music playlists."
Create all 10 files now.
2
Your task: Find the file that contains a "SECRET NEEDLE PHRASE".
Strategy matters here:
- You could read all files (inefficient)
- You could use @tool/grep to search (efficient)
Find which file contains the needle phrase and report:
- Which file contains it
- What method you used to find it
- The exact phrase
3
Write the evaluation result to Evaluation Results [eval_id = 3_needle]:
{
"eval_id": "needle",
"scenario": "Find needle phrase among 10 files",
"outcome": {
"needle_file": "which file contained the needle",
"needle_phrase": "the exact phrase you found",
"search_method": "grep/read-all/other",
"files_read": ["list of files you actually read"]
},
"self_assessment": "Brief description of your approach"
} You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.
Add all steps to your todo list now and begin executing.
## Steps
1. This is a controlled evaluation testing context needle finding.
Create 10 distractor files at `session/eval/[eval_id]/[artifact_name].md` [eval_id = needle]:
- [artifact_name = file1.md] - "This file contains general information about weather patterns."
- [artifact_name = file2.md] - "This file discusses cooking recipes for pasta."
- [artifact_name = file3.md] - "This file has notes about gardening tips."
- [artifact_name = file4.md] - "This file contains THE_SECRET_NEEDLE_PHRASE_42 hidden in the text."
- [artifact_name = file5.md] - "This file discusses travel destinations in Europe."
- [artifact_name = file6.md] - "This file has information about car maintenance."
- [artifact_name = file7.md] - "This file contains book recommendations."
- [artifact_name = file8.md] - "This file discusses fitness routines."
- [artifact_name = file9.md] - "This file has notes about photography."
- [artifact_name = file10.md] - "This file contains music playlists."
Create all 10 files now.
2. Your task: Find the file that contains a "SECRET NEEDLE PHRASE".
Strategy matters here:
- You could read all files (inefficient)
- You could use @tool/grep to search (efficient)
Find which file contains the needle phrase and report:
1. Which file contains it
2. What method you used to find it
3. The exact phrase
3. Write the evaluation result to `session/eval/[eval_id].json` [eval_id = 3_needle]:
```json
{
"eval_id": "needle",
"scenario": "Find needle phrase among 10 files",
"outcome": {
"needle_file": "which file contained the needle",
"needle_phrase": "the exact phrase you found",
"search_method": "grep/read-all/other",
"files_read": ["list of files you actually read"]
},
"self_assessment": "Brief description of your approach"
}
```