/u/vast_ai on Instance not using the whole GPU

U

/u/vast_ai

Guest
Ollama may not saturate your GPU if the model or batch size is small, if concurrency is limited, or if there’s some CPU overhead. Check whether Ollama is actually using multiple GPUs by defaultβ€”often it uses one GPU or partial GPU resources. Try increasing concurrency (sending multiple inference requests simultaneously), boosting batch size, or using a larger model that needs more compute. Also make sure your system is not bottlenecked by CPU or I/O. In the end, measure your tokens/sec or total throughput rather than strictly looking for 100% usage in nvidia-smi

Continue reading...
 


Join 𝕋𝕄𝕋 on Telegram
Channel PREVIEW:
Back
Top