Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

AI Study

Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

jimmmy_jin 2025. 6. 4. 16:52

🧩 Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

In the previous post, I built a simple RAG chatbot using LangChain, FAISS, and OpenAI to read my README.txt file. This time, I wanted to take it a step further:

What if my chatbot could read multiple documents, like PDFs, Markdown files, and more?

That’s exactly what I’ve implemented in Step 2 of my GenAI learning journey.

💡 What’s New in This Step?

✅ Support for multiple file types (PDF, TXT, MD)
✅ Batch loading of documents from a folder
✅ Unified vector index for all content
✅ Real-time QA across documents using LangChain + OpenAI

This simulates a real-world use case: feeding internal documents, team wikis, or multiple reports into a searchable, intelligent chatbot.

🔧 Tools & Libraries

langchain-community – for document loaders and vector store
langchain-openai – for GPT and OpenAI embeddings
FAISS – fast vector similarity search
dotenv – to securely load API keys

Install them with:

pip install langchain-community langchain-openai faiss-cpu python-dotenv tiktoken pypdf

📁 Folder Structure

rag-pdf-chatbot/
├── docs/
│   └── Resume.pdf
├── .env
├── main.py

📄 Final Working Code (with Latest LangChain Standards)

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Load API Key
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# 2. Load Document
loader = PyPDFLoader("Resume.pdf")
documents = loader.load()

# 3. Split into chunks
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(documents)

# 4. Embed and store in FAISS
embedding = OpenAIEmbeddings(api_key=openai_api_key)
db = FAISS.from_documents(docs, embedding)

# 5. Setup Retrieval QA Chain
retriever = db.as_retriever()
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, api_key=openai_api_key)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# 6. Ask question
query = "Summarize this document in one sentence."
answer = qa_chain.invoke(query)

print(f"Q: {query}")
print(f"A: {answer}")

💬 Prompt Design and Output

One important thing I learned during this experiment is that LLMs are sensitive to prompt complexity. When using LangChain’s RetrievalQA, very complex prompts (like “organize skills into a table”) may fail silently if the model can’t generate a structured response.

By starting with a simple prompt:

"Summarize this document in one sentence."

The chatbot returned:

Jeongjin Lee is a passionate full-stack developer with 3+ years of experience, specializing in AI API development and ML engineering, showcasing expertise in various programming languages, frameworks, and tools, with notable projects like an AI-powered restaurant discovery service and a part-time job search platform.

🧠 What I Learned

Prompt quality and complexity impact LangChain response behavior.
Use invoke() instead of run() for RetrievalQA.
Use simple prompts first, then refine with structure (e.g., bullet points or JSON).
LangChain 0.2+ encourages use of langchain_openai and langchain_community packages separately.

📌 Next Steps

Try metadata-aware prompts (“Only show AI-related projects”)
Display answers in tables or markdown for blog-ready formatting
Add frontend via FastAPI or Next.js to make it interactive

🇰🇷 한글 요약

이 블로그에서는 LangChain 최신 구조에 맞춰 PDF 문서를 읽고 요약하는 챗봇을 구현했습니다. 복잡한 프롬프트보다는 간단하고 명확한 질문부터 시작하는 것이 안정적인 응답을 받는 데 효과적이라는 것도 확인했습니다.

향후엔 테이블 형식의 출력, 프론트엔드 연동, 메타데이터 기반 문서 탐색까지 확장해볼 예정입니다.

'AI Study' 카테고리의 다른 글

Step 3 Multi-Document RAG Chatbot (0)	2025.06.04
반드시 알고가자! (0)	2025.06.04
What is RAG? Building my first LangChain-based document chatbot (0)	2025.06.02
Embracing the New Era of Development: Why AI-Native Engineering is My Path (0)	2025.05.30
프로그래밍 수학 - 확률과 선형대수, 어디에 쓰일까? (0)	2025.03.30

현재글Step 2: Expanding RAG – Multi-Document Chatbot with PDF Support

jin's Blog

개발하고 기록하는 블로그 입니다.

React Query, loader, 자바스크립트, database, ㅇ, 1강, 피라미드문제, UTS, query key, 기초의중요성, Action,

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

jin's Blog