Choosing the fastest and cheapest GPU
Plus! Early Access to "Trelis Endpoints" for Newsletter Subscribers
🚂 Inference Engine Showdown🚂
I tested vLLM, TGI, SGLang, and NVIDIA NIM side-by-side!
I also tested A40, A6000, A100 and H100 GPUs side-by-side.
💡 Key Findings:
➡ Quality vs Speed: 8-bit and larger formats maintain good quality, while smaller formats (INT4) trade quality for speed
➡ SGLang emerged as the fastest, especially for larger batch sizes
➡ H100 GPUs offer native FP8 support, giving a significant speed boost
💰 Cost Analysis:
➡ I compared costs for LAMA 8B, 70B, and 405B models
➡ I tested A40, A6000, A100, and H100 GPUs
➡ Result: A40 often provides the best cost-per-token for smaller models
🔑 Overall Notes:
➡ FP8 format offers a sweet spot of quality and speed
➡ Cheaper GPUs can work well for smaller models, but larger models benefit from more powerful GPUs to have a reasonable token generation speed
➡ SGLang and NVIDIA NIM handle larger batch sizes better than vLLM and TGI
👨💻 Extra Tip:
➡ When optimizing, consider not just cost-per-token, but make sure tokens per second are high enough for a good user experience.
Introducing “Trelis Endpoints”
This is a new product - in alpha - that Rohan Sharma and I have built. I’m just opening the product up to this email list for now.
It’s a one-click RAG api for your documents. You upload documents and immediately get an api endpoint you can query with natural language. Responses will cite verified passages from your documents.
Here is my 3 min video intro.
Here is the alpha product to try out (you’ll get 50+ requests for free).
Here is the Product Hunt pre-launch page if you would like to follow and get notified of the official launch.
Rohan and I would like feedback, so Trelis will give $25 in free credits to the best three video reviews - just record yourself using Loom or Descript and reply to this email with the link by end of day tomorrow (Wed 31st Aug).
Cheers, Ronan