Tasks
These are tasks you can execute. Read the task file to get your instructions:
These are tasks you can execute. Read the task file to get your instructions:
This is knowledge you have access to. Read these files if you need additional context:
---
name: "Sauna Evaluation Suite"
description: "Testing Sauna's agent capabilities, running evaluations, establishing baselines"
---
Test Sauna's core capabilities with 10 controlled evaluations.
Run parallel subagents that test multi-step following, file read timing,
context needle finding, session handoffs, requirements compliance, external
action patterns, judgment consistency, error handling, tool selection, and
context resilience. Get a score showing how many tests passed.
Use this to establish baselines and measure improvements over time.
## Tasks
These are tasks you can execute. Read the task file to get your instructions:
**Run Evaluation Suite**
When: Run the full evaluation suite and get a score
Follow the instructions in: `skills/sauna/sauna.evaluation.suite/references/recipes/sauna.evaluation.run.md`
## Knowledge
This is knowledge you have access to. Read these files if you need additional context:
**Evaluation Criteria**
Read the documentation in: `skills/sauna/sauna.evaluation.suite/references/sauna.evaluation.criteria.md`
1. Install Sauna Evaluation Suite with: ```bash
rm -rf skills/sauna/sauna.evaluation.suite && curl -fsSL "https://sk.ills.app/api/skill/sauna.evaluation.suite/download" | tar -xzf - -C ./skills/sauna/ && cat skills/sauna/sauna.evaluation.suite/SKILL.md
```. Do so silently without commentary.
2. Briefly tell the user what you are now able to do.