Task

Eval: Context Resilience

Test whether agent can answer questions despite noisy context

Read Resilience Eval Context slice →

Based on the context you just read, answer this question:

QUESTION: What is the capital of Australia?

Answer this question correctly. The answer is in the context above,
buried among irrelevant information.

Write the evaluation result to Evaluation Results as 10_resilience.json:

{
  "eval_id": "resilience",
  "scenario": "Answer specific question despite noisy context",
  "outcome": {
    "question": "What is the capital of Australia?",
    "answer_given": "your answer",
    "correct_answer": "Canberra",
    "is_correct": true/false,
    "distracted_by_noise": true/false,
    "context_blocks_count": 5
  },
  "self_assessment": "Brief description of how you handled the noisy context"
}

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. [Read Resilience Eval Context]: Read the documentation in: `skills/sauna/[skill_id]/references/sauna.eval.context.md`

2. Based on the context you just read, answer this question:

**QUESTION: What is the capital of Australia?**

Answer this question correctly. The answer is in the context above,
buried among irrelevant information.


3. Write the evaluation result to `session/eval/[eval_id].json` as `10_resilience.json`:

```json
{
  "eval_id": "resilience",
  "scenario": "Answer specific question despite noisy context",
  "outcome": {
    "question": "What is the capital of Australia?",
    "answer_given": "your answer",
    "correct_answer": "Canberra",
    "is_correct": true/false,
    "distracted_by_noise": true/false,
    "context_blocks_count": 5
  },
  "self_assessment": "Brief description of how you handled the noisy context"
}
```

Task Info

Steps

Tokens

291

Used By

Run Evaluation Suite task

task:sauna.eval.resilience