This is part 2 of 2 in a series on test time compute. I look at how parallel sampling, followed by best answer selection with verifiers can allow for improved LLM performance!
With thanks to Joe Runde, Cyrus Leung and Nick Hill in the @vllm_project github for help getting guided decoding working.
β‘οΈ Key Points:
- Increasing test compute can significantly boost model performance
- Parallel sampling generates multiple answers for better results
- Verifiers are crucial for identifying the best answer among samples
π Techniques Explored:
- Simple sampling with perfect verifier assumption
- Voting-based verification
- Scoring-based verification (1-10 scale)
- Binary classification verification
π Results (Llama 3.2 1B):
- Single-shot baseline: ~50% accuracy
- 4-sample perfect verifier: ~70% accuracy
- Voting and scoring verifiers: Marginal improvements
with Llama 3.2 3B, the verifiers start to show improvements (see link to slides)
π§ͺ Experiment Details:
- Model: LLaMA 3.2 1B
- Dataset: GSM8K (grade school math problems)
- 32 questions, 5 runs for statistical significance
π¬ Key Insights:
- Verifier performance is crucial and often challenging
- Larger models tend to have better verifiers
- Performance varies significantly between easy and hard problems
Thatβs it for this week, let me know any Qs in the comments, cheers, Ronan
More Resources: Trelis.com/About