karan.dev
All projects

AI RAG Assistant with PyTorch and LangChain

Mar 2026 — Apr 2026

Retrieval-Augmented Generation chatbot with sub-second semantic search and citation-backed answers.

PythonPyTorchLangChainHugging FaceFAISSOpenAI API

Problem

Domain experts at the org couldn't answer questions across hundreds of internal docs without re-reading them. Generic LLMs hallucinated and lacked citations, making answers untrustworthy.

Approach

  1. Ingested 100+ documents and chunked them into ~512-token segments with overlap to preserve context across chunk boundaries.
  2. Generated dense embeddings using sentence-transformers and indexed them in FAISS for sub-second cosine-similarity search.
  3. Built a LangChain agent that decomposes a user query into sub-questions, retrieves top-k chunks per sub-question, and synthesizes a grounded answer with citations.
  4. Fine-tuned a HF transformer on internal QA pairs with LoRA adapters — 4-bit quantized base + adapter weights only — to keep training under 8 GB VRAM on a single GPU.
  5. Wrapped the pipeline with tool-calling so the LLM can invoke functions (e.g., date lookup, table query) instead of guessing structured data.

Outcomes

  • Sub-second semantic retrieval over 100+ docs.
  • ~60% reduction in training memory footprint via LoRA.
  • Every answer cites the source chunks, eliminating hallucination on questions inside the corpus.

Learnings

LoRA + 4-bit quantization is the right starting point when GPU is the bottleneck. Tool-calling beats prompting for anything structured (dates, IDs, calculations). FAISS is overkill for <10k chunks but cheap to start with.