U
/u/vast_ai
Guest
Ollama may not saturate your GPU if the model or batch size is small, if concurrency is limited, or if thereβs some CPU overhead. Check whether Ollama is actually using multiple GPUs by defaultβoften it uses one GPU or partial GPU resources. Try increasing concurrency (sending multiple inference requests simultaneously), boosting batch size, or using a larger model that needs more compute. Also make sure your system is not bottlenecked by CPU or I/O. In the end, measure your tokens/sec or total throughput rather than strictly looking for 100% usage in
Continue reading...
nvidia-smi
Continue reading...