I present three techniques to improve your fact checking:
1️⃣ Citation Verification:
- Ask the model to provide verbatim citations from source documents
- Cross-reference citations with the original text to confirm accuracy
- If citations are fabricated, prompt the model to rewrite its answer
2️⃣ Structured Generation:
- Use JSON schemas or regex forcing to constrain model outputs
- This narrows the possible outcomes, leading to improved response quality
- Particularly effective with newer models like GPT-4o and Gemini Pro that support strict output formatting
3️⃣ Context Focusing:
- Limit context length to ~16K tokens where possible
- Models perform best with information closer to the start of the context
- For longer documents, use a paged retrieval approach:
-- Process document page-by-page
-- Find relevant chunks using BM25 and cosine similarity
-- Combine current page with top relevant chunks for analysis
➡️ Key Findings:
- Paged retrieval outperformed long-context approaches in identifying document inconsistencies
- Structured responses improved accuracy even in long-context scenarios
- Combining these techniques can significantly enhance fact-checking performance
I then compare the approaches (long context vs long context + structure vs paged-retrieval + structure) on Gemini Pro, Claude Sonnet and GPT-4o.
Cheers, Ronan
More Resources: Trelis.com/About