Problem
Domain experts at the org couldn't answer questions across hundreds of internal docs without re-reading them. Generic LLMs hallucinated and lacked citations, making answers untrustworthy.
Approach
- Ingested 100+ documents and chunked them into ~512-token segments with overlap to preserve context across chunk boundaries.
- Generated dense embeddings using sentence-transformers and indexed them in FAISS for sub-second cosine-similarity search.
- Built a LangChain agent that decomposes a user query into sub-questions, retrieves top-k chunks per sub-question, and synthesizes a grounded answer with citations.
- Fine-tuned a HF transformer on internal QA pairs with LoRA adapters — 4-bit quantized base + adapter weights only — to keep training under 8 GB VRAM on a single GPU.
- Wrapped the pipeline with tool-calling so the LLM can invoke functions (e.g., date lookup, table query) instead of guessing structured data.
Outcomes
- Sub-second semantic retrieval over 100+ docs.
- ~60% reduction in training memory footprint via LoRA.
- Every answer cites the source chunks, eliminating hallucination on questions inside the corpus.
Learnings
LoRA + 4-bit quantization is the right starting point when GPU is the bottleneck. Tool-calling beats prompting for anything structured (dates, IDs, calculations). FAISS is overkill for <10k chunks but cheap to start with.