Task

Eval: Multi-step Following

Test whether agent can complete 5 numbered steps in order

This is a controlled evaluation. You must complete exactly these 5 steps
in order. Do not skip any step. Do not add extra steps.

Write to Evaluation Workspaces [eval_id = multistep]:

Step 1: Create [artifact_name = step1.md] containing only "Step 1 complete"
Step 2: Create [artifact_name = step2.md] containing only "Step 2 complete"
Step 3: Create [artifact_name = step3.md] containing only "Step 3 complete"
Step 4: Create [artifact_name = step4.md] containing only "Step 4 complete"
Step 5: Create [artifact_name = step5.md] containing only "Step 5 complete"

Complete these steps now, in order.

Write the evaluation result to Evaluation Results [eval_id = 1_multistep]:

{
  "eval_id": "multistep",
  "scenario": "Complete 5 numbered steps in order",
  "outcome": {
    "steps_completed": ["list of steps you completed"],
    "artifacts": ["list of files you created"]
  },
  "self_assessment": "Brief description of what you did"
}

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. This is a controlled evaluation. You must complete exactly these 5 steps
in order. Do not skip any step. Do not add extra steps.

Write to `session/eval/[eval_id]/[artifact_name].md` [eval_id = multistep]:

**Step 1:** Create [artifact_name = step1.md] containing only "Step 1 complete"
**Step 2:** Create [artifact_name = step2.md] containing only "Step 2 complete"
**Step 3:** Create [artifact_name = step3.md] containing only "Step 3 complete"
**Step 4:** Create [artifact_name = step4.md] containing only "Step 4 complete"
**Step 5:** Create [artifact_name = step5.md] containing only "Step 5 complete"

Complete these steps now, in order.


2. Write the evaluation result to `session/eval/[eval_id].json` [eval_id = 1_multistep]:

```json
{
  "eval_id": "multistep",
  "scenario": "Complete 5 numbered steps in order",
  "outcome": {
    "steps_completed": ["list of steps you completed"],
    "artifacts": ["list of files you created"]
  },
  "self_assessment": "Brief description of what you did"
}
```

Task Info

Steps

Tokens

354

Used By

Run Evaluation Suite task

task:sauna.eval.multistep