/u/vast_ai on Instance not using the whole GPU

/u/vast_ai · Friday at 11:42 PM

Ollama may not saturate your GPU if the model or batch size is small, if concurrency is limited, or if there’s some CPU overhead. Check whether Ollama is actually using multiple GPUs by default—often it uses one GPU or partial GPU resources. Try increasing concurrency (sending multiple inference requests simultaneously), boosting batch size, or using a larger model that needs more compute. Also make sure your system is not bottlenecked by CPU or I/O. In the end, measure your tokens/sec or total throughput rather than strictly looking for 100% usage in nvidia-smi

Continue reading...

/u/vast_ai on Instance not using the whole GPU

/u/vast_ai

Guest