AI Study

Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

jimmmy_jin 2025. 6. 4. 16:52

🧩 Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

 

In the previous post, I built a simple RAG chatbot using LangChain, FAISS, and OpenAI to read my README.txt file. This time, I wanted to take it a step further:

 

What if my chatbot could read multiple documents, like PDFs, Markdown files, and more?

 

That’s exactly what I’ve implemented in Step 2 of my GenAI learning journey.

 

💡 What’s New in This Step?

 

  • ✅ Support for multiple file types (PDF, TXT, MD)
  • ✅ Batch loading of documents from a folder
  • ✅ Unified vector index for all content
  • ✅ Real-time QA across documents using LangChain + OpenAI

 

This simulates a real-world use case: feeding internal documents, team wikis, or multiple reports into a searchable, intelligent chatbot.

 

🔧 Tools & Libraries

 

  • langchain-community – for document loaders and vector store
  • langchain-openai – for GPT and OpenAI embeddings
  • FAISS – fast vector similarity search
  • dotenv – to securely load API keys

 

Install them with:

pip install langchain-community langchain-openai faiss-cpu python-dotenv tiktoken pypdf

📁 Folder Structure

rag-pdf-chatbot/
├── docs/
│   └── Resume.pdf
├── .env
├── main.py

📄 Final Working Code (with Latest LangChain Standards)

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Load API Key
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# 2. Load Document
loader = PyPDFLoader("Resume.pdf")
documents = loader.load()

# 3. Split into chunks
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(documents)

# 4. Embed and store in FAISS
embedding = OpenAIEmbeddings(api_key=openai_api_key)
db = FAISS.from_documents(docs, embedding)

# 5. Setup Retrieval QA Chain
retriever = db.as_retriever()
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, api_key=openai_api_key)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# 6. Ask question
query = "Summarize this document in one sentence."
answer = qa_chain.invoke(query)

print(f"Q: {query}")
print(f"A: {answer}")

💬 Prompt Design and Output

 

One important thing I learned during this experiment is that LLMs are sensitive to prompt complexity. When using LangChain’s RetrievalQA, very complex prompts (like “organize skills into a table”) may fail silently if the model can’t generate a structured response.

 

By starting with a simple prompt:

"Summarize this document in one sentence."

The chatbot returned:

 

Jeongjin Lee is a passionate full-stack developer with 3+ years of experience, specializing in AI API development and ML engineering, showcasing expertise in various programming languages, frameworks, and tools, with notable projects like an AI-powered restaurant discovery service and a part-time job search platform.

 

🧠 What I Learned

 

  • Prompt quality and complexity impact LangChain response behavior.
  • Use invoke() instead of run() for RetrievalQA.
  • Use simple prompts first, then refine with structure (e.g., bullet points or JSON).
  • LangChain 0.2+ encourages use of langchain_openai and langchain_community packages separately.

 

📌 Next Steps

 

  • Try metadata-aware prompts (“Only show AI-related projects”)
  • Display answers in tables or markdown for blog-ready formatting
  • Add frontend via FastAPI or Next.js to make it interactive

 


🇰🇷 한글 요약

 

이 블로그에서는 LangChain 최신 구조에 맞춰 PDF 문서를 읽고 요약하는 챗봇을 구현했습니다. 복잡한 프롬프트보다는 간단하고 명확한 질문부터 시작하는 것이 안정적인 응답을 받는 데 효과적이라는 것도 확인했습니다.

 

향후엔 테이블 형식의 출력, 프론트엔드 연동, 메타데이터 기반 문서 탐색까지 확장해볼 예정입니다.