Task

Eval: Context Needle

Test whether agent can find specific content without reading everything

This is a controlled evaluation testing context needle finding.

Create 10 distractor files at Evaluation Workspaces [eval_id = needle]:

[artifact_name = file1.md] - "This file contains general information about weather patterns."
[artifact_name = file2.md] - "This file discusses cooking recipes for pasta."
[artifact_name = file3.md] - "This file has notes about gardening tips."
[artifact_name = file4.md] - "This file contains THE_SECRET_NEEDLE_PHRASE_42 hidden in the text."
[artifact_name = file5.md] - "This file discusses travel destinations in Europe."
[artifact_name = file6.md] - "This file has information about car maintenance."
[artifact_name = file7.md] - "This file contains book recommendations."
[artifact_name = file8.md] - "This file discusses fitness routines."
[artifact_name = file9.md] - "This file has notes about photography."
[artifact_name = file10.md] - "This file contains music playlists."

Create all 10 files now.

Your task: Find the file that contains a "SECRET NEEDLE PHRASE".

Strategy matters here:

You could read all files (inefficient)
You could use @tool/grep to search (efficient)

Find which file contains the needle phrase and report:

Which file contains it
What method you used to find it
The exact phrase

Write the evaluation result to Evaluation Results [eval_id = 3_needle]:

{
  "eval_id": "needle",
  "scenario": "Find needle phrase among 10 files",
  "outcome": {
    "needle_file": "which file contained the needle",
    "needle_phrase": "the exact phrase you found",
    "search_method": "grep/read-all/other",
    "files_read": ["list of files you actually read"]
  },
  "self_assessment": "Brief description of your approach"
}

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. This is a controlled evaluation testing context needle finding.

Create 10 distractor files at `session/eval/[eval_id]/[artifact_name].md` [eval_id = needle]:

- [artifact_name = file1.md] - "This file contains general information about weather patterns."
- [artifact_name = file2.md] - "This file discusses cooking recipes for pasta."
- [artifact_name = file3.md] - "This file has notes about gardening tips."
- [artifact_name = file4.md] - "This file contains THE_SECRET_NEEDLE_PHRASE_42 hidden in the text."
- [artifact_name = file5.md] - "This file discusses travel destinations in Europe."
- [artifact_name = file6.md] - "This file has information about car maintenance."
- [artifact_name = file7.md] - "This file contains book recommendations."
- [artifact_name = file8.md] - "This file discusses fitness routines."
- [artifact_name = file9.md] - "This file has notes about photography."
- [artifact_name = file10.md] - "This file contains music playlists."

Create all 10 files now.


2. Your task: Find the file that contains a "SECRET NEEDLE PHRASE".

Strategy matters here:
- You could read all files (inefficient)
- You could use @tool/grep to search (efficient)

Find which file contains the needle phrase and report:
1. Which file contains it
2. What method you used to find it
3. The exact phrase


3. Write the evaluation result to `session/eval/[eval_id].json` [eval_id = 3_needle]:

```json
{
  "eval_id": "needle",
  "scenario": "Find needle phrase among 10 files",
  "outcome": {
    "needle_file": "which file contained the needle",
    "needle_phrase": "the exact phrase you found",
    "search_method": "grep/read-all/other",
    "files_read": ["list of files you actually read"]
  },
  "self_assessment": "Brief description of your approach"
}
```

Task Info

Steps

Tokens

520

Used By

Run Evaluation Suite task

task:sauna.eval.needle