syedazeez337 / projects.sh

— building llms

llm-from-scratch

A GPT-2 class language model built from first principles in 8 stages — tokenizer, embeddings, attention, transformer blocks, pretraining, and sampling (greedy / temperature / top-k / top-p). pytest-covered, with guided CodeTours teaching every line.

PythonPyTorchTransformers

code →

— llm serving & inference

vllm-lab

Measured vLLM's core engine mechanisms one at a time on a 6 GB laptop GPU — continuous batching (39 → 2,405 tok/s, 62×), prefix caching, PagedAttention preemption, fp8 KV cache, chunked prefill, speculative decoding, and multi-LoRA. Eight self-contained demos, each with a live dashboard and honest tradeoffs.

vLLMCUDAPythonGPU

code →

VibeThinker-3B-W4A16

Quantized a 3B reasoning model from 5.8 GB BF16 down to 2.0 GB W4A16 (GPTQ via llmcompressor) so it fits and serves on a 6 GB GPU at ~67 tok/s in vLLM with reasoning intact. Published to the Hugging Face Hub with a model card.

llmcompressorGPTQvLLMHF Hub

model →