Test Time Compute, Part 2

Verifiers

Oct 04, 2024

This is part 2 of 2 in a series on test time compute. I look at how parallel sampling, followed by best answer selection with verifiers can allow for improved LLM performance!

With thanks to Joe Runde, Cyrus Leung and Nick Hill in the @vllm_project github for help getting guided decoding working.

➡️ Key Points:

- Increasing test compute can significantly boost model performance
- Parallel sampling generates multiple answers for better results
- Verifiers are crucial for identifying the best answer among samples

🔍 Techniques Explored:

- Simple sampling with perfect verifier assumption
- Voting-based verification
- Scoring-based verification (1-10 scale)
- Binary classification verification

📊 Results (Llama 3.2 1B):

- Single-shot baseline: ~50% accuracy
- 4-sample perfect verifier: ~70% accuracy
- Voting and scoring verifiers: Marginal improvements

with Llama 3.2 3B, the verifiers start to show improvements (see link to slides)

🧪 Experiment Details:

- Model: LLaMA 3.2 1B
- Dataset: GSM8K (grade school math problems)
- 32 questions, 5 runs for statistical significance

🔬 Key Insights:

- Verifier performance is crucial and often challenging
- Larger models tend to have better verifiers
- Performance varies significantly between easy and hard problems

That’s it for this week, let me know any Qs in the comments, cheers, Ronan

More Resources: Trelis.com/About

Trelis Research Updates

Discussion about this post