Metric Selection Guide for LLM Evaluation
This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.
Metric Categories
Classification Metrics
Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).
Precision
Precision = True Positives / (True Positives + False Positives)Interpretation: Of all responses the judge said were good, what fraction were actually good?
Use when: False positives are costly (e.g., approving unsafe content)
def precision(predictions, ground_truth):
true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
predicted_positives = sum(predictions)
return true_positives / predicted_positives if predicted_positives > 0 else 0Recall
Recall = True Positives / (True Positives + False Negatives)Interpretation: Of all actually good responses, what fraction did the judge identify?
Use when: False negatives are costly (e.g., missing good content in filtering)
def recall(predictions, ground_truth):
true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
actual_positives = sum(ground_truth)
return true_positives / actual_positives if actual_positives > 0 else 0F1 Score
F1 = 2 * (Precision * Recall) / (Precision + Recall)Interpretation: Harmonic mean of precision and recall
Use when: You need a single number balancing both concerns
def f1_score(predictions, ground_truth):
p = precision(predictions, ground_truth)
r = recall(predictions, ground_truth)
return 2 * p * r / (p + r) if (p + r) > 0 else 0Agreement Metrics
Use for comparing automated evaluation with human judgment.
Cohen's Kappa (κ)
κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)Interpretation: Agreement adjusted for chance
- κ > 0.8: Almost perfect agreement
- κ 0.6-0.8: Substantial agreement
- κ 0.4-0.6: Moderate agreement
- κ < 0.4: Fair to poor agreement
Use for: Binary or categorical judgments
def cohens_kappa(judge1, judge2):
from sklearn.metrics import cohen_kappa_score
return cohen_kappa_score(judge1, judge2)Weighted Kappa
For ordinal scales where disagreement severity matters:
def weighted_kappa(judge1, judge2):
from sklearn.metrics import cohen_kappa_score
return cohen_kappa_score(judge1, judge2, weights='quadratic')Interpretation: Penalizes large disagreements more than small ones
Correlation Metrics
Use for ordinal/continuous scores.
Spearman's Rank Correlation (ρ)
Interpretation: Correlation between rankings, not absolute values
- ρ > 0.9: Very strong correlation
- ρ 0.7-0.9: Strong correlation
- ρ 0.5-0.7: Moderate correlation
- ρ < 0.5: Weak correlation
Use when: Order matters more than exact values
def spearmans_rho(scores1, scores2):
from scipy.stats import spearmanr
rho, p_value = spearmanr(scores1, scores2)
return {'rho': rho, 'p_value': p_value}Kendall's Tau (τ)
Interpretation: Similar to Spearman but based on pairwise concordance
Use when: You have many tied values
def kendalls_tau(scores1, scores2):
from scipy.stats import kendalltau
tau, p_value = kendalltau(scores1, scores2)
return {'tau': tau, 'p_value': p_value}Pearson Correlation (r)
Interpretation: Linear correlation between scores
Use when: Exact score values matter, not just order
def pearsons_r(scores1, scores2):
from scipy.stats import pearsonr
r, p_value = pearsonr(scores1, scores2)
return {'r': r, 'p_value': p_value}Pairwise Comparison Metrics
Agreement Rate
Agreement = (Matching Decisions) / (Total Comparisons)Interpretation: Simple percentage of agreement
def pairwise_agreement(decisions1, decisions2):
matches = sum(1 for d1, d2 in zip(decisions1, decisions2) if d1 == d2)
return matches / len(decisions1)Position Consistency
Consistency = (Consistent across position swaps) / (Total comparisons)Interpretation: How often does swapping position change the decision?
def position_consistency(results):
consistent = sum(1 for r in results if r['position_consistent'])
return consistent / len(results)Selection Decision Tree
What type of evaluation task?
│
├── Binary classification (pass/fail)
│ └── Use: Precision, Recall, F1, Cohen's κ
│
├── Ordinal scale (1-5 rating)
│ ├── Comparing to human judgments?
│ │ └── Use: Spearman's ρ, Weighted κ
│ └── Comparing two automated judges?
│ └── Use: Kendall's τ, Spearman's ρ
│
├── Pairwise preference
│ └── Use: Agreement rate, Position consistency
│
└── Multi-label classification
└── Use: Macro-F1, Micro-F1, Per-label metricsMetric Selection by Use Case
Use Case 1: Validating Automated Evaluation
Goal: Ensure automated evaluation correlates with human judgment
Recommended Metrics:
- Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
- Secondary: Per-criterion agreement
- Diagnostic: Confusion matrix for systematic errors
def validate_automated_eval(automated_scores, human_scores, criteria):
results = {}
# Overall correlation
results['overall_spearman'] = spearmans_rho(automated_scores, human_scores)
# Per-criterion agreement
for criterion in criteria:
auto_crit = [s[criterion] for s in automated_scores]
human_crit = [s[criterion] for s in human_scores]
results[f'{criterion}_spearman'] = spearmans_rho(auto_crit, human_crit)
return resultsUse Case 2: Comparing Two Models
Goal: Determine which model produces better outputs
Recommended Metrics:
- Primary: Win rate (from pairwise comparison)
- Secondary: Position consistency (bias check)
- Diagnostic: Per-criterion breakdown
def compare_models(model_a_outputs, model_b_outputs, prompts):
results = []
for a, b, p in zip(model_a_outputs, model_b_outputs, prompts):
comparison = await compare_with_position_swap(a, b, p)
results.append(comparison)
return {
'a_wins': sum(1 for r in results if r['winner'] == 'A'),
'b_wins': sum(1 for r in results if r['winner'] == 'B'),
'ties': sum(1 for r in results if r['winner'] == 'TIE'),
'position_consistency': position_consistency(results)
}Use Case 3: Quality Monitoring
Goal: Track evaluation quality over time
Recommended Metrics:
- Primary: Rolling agreement with human spot-checks
- Secondary: Score distribution stability
- Diagnostic: Bias indicators (position, length)
class QualityMonitor:
def __init__(self, window_size=100):
self.window = deque(maxlen=window_size)
def add_evaluation(self, automated, human_spot_check=None):
self.window.append({
'automated': automated,
'human': human_spot_check,
'length': len(automated['response'])
})
def get_metrics(self):
# Filter to evaluations with human spot-checks
with_human = [e for e in self.window if e['human'] is not None]
if len(with_human) < 10:
return {'insufficient_data': True}
auto_scores = [e['automated']['score'] for e in with_human]
human_scores = [e['human']['score'] for e in with_human]
return {
'correlation': spearmans_rho(auto_scores, human_scores),
'mean_difference': np.mean([a - h for a, h in zip(auto_scores, human_scores)]),
'length_correlation': spearmans_rho(
[e['length'] for e in self.window],
[e['automated']['score'] for e in self.window]
)
}Interpreting Metric Results
Good Evaluation System Indicators
| Metric | Good | Acceptable | Concerning |
|---|---|---|---|
| Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 |
| Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 |
| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |
| Length correlation | < 0.2 | 0.2-0.4 | > 0.4 |
Warning Signs
- High agreement but low correlation: May indicate calibration issues
- Low position consistency: Position bias affecting results
- High length correlation: Length bias inflating scores
- Per-criterion variance: Some criteria may be poorly defined
Reporting Template
## Evaluation System Metrics Report
### Human Agreement
- Spearman's ρ: 0.82 (p < 0.001)
- Cohen's κ: 0.74
- Sample size: 500 evaluations
### Bias Indicators
- Position consistency: 91%
- Length-score correlation: 0.12
### Per-Criterion Performance
| Criterion | Spearman's ρ | κ |
|-----------|--------------|---|
| Accuracy | 0.88 | 0.79 |
| Clarity | 0.76 | 0.68 |
| Completeness | 0.81 | 0.72 |
### Recommendations
- All metrics within acceptable ranges
- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement