AI RAG Assistant with PyTorch and LangChain

Problem

Domain experts at the org couldn't answer questions across hundreds of internal docs without re-reading them. Generic LLMs hallucinated and lacked citations, making answers untrustworthy.

Approach

Ingested 100+ documents and chunked them into ~512-token segments with overlap to preserve context across chunk boundaries.
Generated dense embeddings using sentence-transformers and indexed them in FAISS for sub-second cosine-similarity search.
Built a LangChain agent that decomposes a user query into sub-questions, retrieves top-k chunks per sub-question, and synthesizes a grounded answer with citations.
Fine-tuned a HF transformer on internal QA pairs with LoRA adapters — 4-bit quantized base + adapter weights only — to keep training under 8 GB VRAM on a single GPU.
Wrapped the pipeline with tool-calling so the LLM can invoke functions (e.g., date lookup, table query) instead of guessing structured data.

Outcomes

Sub-second semantic retrieval over 100+ docs.
~60% reduction in training memory footprint via LoRA.
Every answer cites the source chunks, eliminating hallucination on questions inside the corpus.

Learnings

LoRA + 4-bit quantization is the right starting point when GPU is the bottleneck. Tool-calling beats prompting for anything structured (dates, IDs, calculations). FAISS is overkill for <10k chunks but cheap to start with.