Deploying Serverless Endpoints
💫 Scaling LLM Inference with Serverless Endpoints on RunPod 💫
Serverless endpoints allow you to automatically scale the number of GPUs based on incoming requests. This is great for production use cases or testing scenarios where you want to avoid needlessly leaving GPUs running.
I recently experimented with setting up a serverless endpoint on RunPod to run Mistral, Mixtral and Llama 70B language models.
➡️ Some key benefits of the serverless approach:
- Automatically scales workers (GPUs) up and down based on load
- Avoids wasted costs from forgetting to shut down GPUs
- Supports high concurrency (tested 64 parallel requests)
➡️ How to set up Serverless Endpoints on RunPod:
- Create a storage volume to efficiently share model weights across workers
- Configure serverless settings (GPU types, min/max workers, timeouts, etc)
- Provide a Docker image and set key environment variables
➡️ Cost considerations:
- Serverless GPUs cost 2-3X more per-second than standard rentals
- Makes economic sense if utilizing <35% of a dedicated GPU's capacity
- For high-volume production, custom scaling logic may be more cost effective
Lmk in the comments if this is something you have tried out or considered trying.
Cheers, Ronan
➡️ ADVANCED-inference Repo (incl. setup and inference scripts)
➡️ ADVANCED-transcription Repo
➡️ Trelis Function-calling Models
➡️ Tip Jar