L3.0: 500M-1B LLM training framework

Features:
- GQA (Grouped Query Attention) 24Q/8KV
- QK-Norm + Z-loss regularization
- KV-Cache inference (10-50x speedup)
- FSDP2 multi-GPU distributed training
- SentencePiece Unigram tokenizer
- Data packing (100% token utilization)
- LoRA fine-tuning + DPO alignment
- Unified web UI (pretrain/SFT/DPO)
- AutoTuner: adaptive grad clip/LR/batch
- A800-ready config for 500M params (~27h, ¥162)
