Live now on Github! - see TrelisResearch/one-click-llms for the links.
8B Model (in FP8)
- Run on A6000, A100 or H100.
- Add HF_TOKEN to env variables
Speed on 1xA100 SXM:
89 toks for batch size of 1
38 toks for batch of 64
Speed on 1xA6000 SXM:
60 toks for batch size of 1
30 toks for batch of 64
70B Model (in FP8)
- Run on 4xA6000 or 2xA100 SXM or 2xH100.
- Add HF_TOKEN to env variables
Speed on 2xH100:
29 toks for batch size of 1
16 toks for batch of 64
405B Model in FP8
- Run on 8xA100 SXM or 8xH100.
- Add HF_TOKEN to env variables
Speed on 8xA100 SXM:
15 toks for batch size of 1
11 toks for batch of 64
Note: Takes ~30 mins to load.
405B Model in INT4
- Run on 4xA100 SXM or 4xH100.
Speed on 4xA100 SXM:
15 toks for batch size of 1
6 toks for batch of 64
Note: Takes ~20 mins to load.
Notes
All toks are per prompt input and measured for a short prompt, and 500 tok completion.
BONUS: If running on an H100, you can swap in Neural Magic FP8 models (neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8, neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8) to get 2x faster direct FP8 weight downloads.
Hat tip to Casper Hansen and HuggingFace on the int4 quant.
Cheers, Ronan
Trelis’ Runpod Affiliate Link (supports the channel): https://runpod.io?ref=jmfkcdio
The LLama 3 405B model has caught up with the current best proprietary models such as GPT 4o and Claude 3.5, which is a significant event in the open-source community. The technical report is nearly 100 pages long, full of rich information, and after a quick glance, it is very inspiring.