building llms
llm-from-scratch
A GPT-2 class language model built from first principles in 8 stages — tokenizer, embeddings, attention, transformer blocks, pretraining, and sampling (greedy / temperature / top-k / top-p). pytest-covered, with guided CodeTours teaching every line.
PythonPyTorchTransformers
llm serving & inference
vllm-lab
Measured vLLM's core engine mechanisms one at a time on a 6 GB laptop GPU — continuous batching (39 → 2,405 tok/s, 62×), prefix caching, PagedAttention preemption, fp8 KV cache, chunked prefill, speculative decoding, and multi-LoRA. Eight self-contained demos, each with a live dashboard and honest tradeoffs.
vLLMCUDAPythonGPU
VibeThinker-3B-W4A16
Quantized a 3B reasoning model from 5.8 GB BF16 down to 2.0 GB W4A16 (GPTQ via llmcompressor) so it fits and serves on a 6 GB GPU at ~67 tok/s in vLLM with reasoning intact. Published to the Hugging Face Hub with a model card.
llmcompressorGPTQvLLMHF Hub