vllm-lab
Measured vLLM's core engine mechanisms one at a time on a 6 GB laptop GPU — continuous batching (39 → 2,405 tok/s, 62×), prefix caching, PagedAttention preemption, fp8 KV cache, chunked prefill, speculative decoding, and multi-LoRA. Eight self-contained demos, each with a live dashboard and honest tradeoffs.
vLLMCUDAPythonGPU
VibeThinker-3B-W4A16
Quantized a 3B reasoning model from 5.8 GB BF16 down to 2.0 GB W4A16 (GPTQ via llmcompressor) so it fits and serves on a 6 GB GPU at ~67 tok/s in vLLM with reasoning intact. Published to the Hugging Face Hub with a model card.
llmcompressorGPTQvLLMHF Hub