AI Study

Step 3 Multi-Document RAG Chatbot

jimmmy_jin 2025. 6. 4. 17:42

🧠 Building a Multi-Document RAG Chatbot with LangChain and OpenAI

 

In this post, I’ll walk through how I upgraded my document-based chatbot into a multi-document RAG (Retrieval-Augmented Generation) system using LangChain, FAISS, and OpenAI. This chatbot can now ingest multiple .pdf, .txt, and .md files, generate vector embeddings, and answer questions by retrieving and reasoning over the content.

 


 

🔍 What’s New?

 

Unlike the earlier version, which handled a single PDF file, this version:

 

  • Loads multiple files of various formats from a folder (/documents)
  • Automatically detects file types and parses them accordingly
  • Creates semantic vector embeddings using OpenAI Embeddings
  • Uses FAISS for fast similarity-based document retrieval
  • Supports natural language querying via gpt-3.5-turbo

 


 

🧰 Tech Stack

ToolPurpose

LangChain Document loaders, chunking, chains
FAISS In-memory vector search engine
OpenAI GPT-3.5 Answer generation
dotenv Securely manage API keys
.pdf/.txt/.md Input document formats

 


 

🗂️ Folder Structure

rag-pdf-chatbot/
│
├── documents/             # Input folder for .pdf, .txt, .md files
├── main.py                # Core script
├── .env                   # Stores OpenAI API key
├── About_Project.txt      # Sample input
└── Roadmap.md             # Sample input

 


 

📄 main.py — Full Source Code

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_community.document_loaders.markdown import UnstructuredMarkdownLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# 2. Load all documents in /documents folder
docs_folder = "documents"
loaders = []
for filename in os.listdir(docs_folder):
    path = os.path.join(docs_folder, filename)
    if filename.endswith(".pdf"):
        loaders.append(PyPDFLoader(path))
    elif filename.endswith(".txt"):
        loaders.append(TextLoader(path))
    elif filename.endswith(".md"):
        loaders.append(UnstructuredMarkdownLoader(path))

documents = []
for loader in loaders:
    documents.extend(loader.load())

# 3. Split content into small chunks
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(documents)

# 4. Create embeddings and vector DB
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
db = FAISS.from_documents(docs, embedding)

# 5. Create retrieval-based QA chain
retriever = db.as_retriever()
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai_api_key)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# 6. Ask a question
query = "Summarize this document in one sentence."
answer = qa_chain.invoke({"query": query})

print(f"Q: {query}")
print(f"A: {answer}")

 


 

✅ Execution Output

Q: Summarize this document in one sentence.
A: {'query': 'Summarize this document in one sentence.',
    'result': 'Jeongjin Lee is a passionate full-stack developer with 3+ years of experience, specializing in AI API development and ML engineering, showcasing expertise in various programming languages, frameworks, and tools, with notable projects like an AI-powered restaurant discovery service and a part-time job search platform.'}

 


 

💬 Can Prompts Be Customized?

 

Yes — and they should be. Prompt tuning is essential for aligning the model’s output to your goals. You can use prompts like:

 

  • “Extract only project names and their purpose.”
  • “Create a markdown table listing key technologies and where they were used.”
  • “Simulate an HR interviewer and ask 3 critical questions based on the resume.”

 

These natural language prompts can help you evaluate how different prompt strategies (short/long, specific/general) affect the chatbot’s behavior.

 


 

📌 What I Learned

 

  • Creating a robust document QA system is straightforward using LangChain’s modular components.
  • Multi-format document parsing requires careful loader configuration.
  • FAISS is lightweight yet effective for local vector search.
  • Prompt design significantly affects LLM outputs.
  • Always stay updated with breaking changes in the LangChain ecosystem (langchain_community, langchain_openai).

 


 

🛠️ Coming Next

 

I plan to:

 

  • Add a web interface using Streamlit or Next.js
  • Allow file uploads and real-time chat
  • Extend to RAG + Agent pattern (e.g. with LangGraph)
  • Integrate Gemini or Claude for model diversity

 


 

📝 한글 요약

 

이번 프로젝트에서는 여러 문서를 동시에 읽고 요약하거나 질문에 답할 수 있는 RAG(Chatbot) 시스템을 구축했습니다. PDF, TXT, MD 파일을 한 폴더에 넣으면 자동으로 로딩 및 임베딩되고, OpenAI GPT-3.5를 활용해 자연어로 질문에 답합니다.

 

주요 기술 스택:

 

  • LangChain (문서 처리, 체인 구성)
  • FAISS (로컬 벡터 검색)
  • OpenAI (임베딩 및 질문 응답)
  • .env 환경 변수로 API 키 관리

 

이전에는 단일 PDF만 처리했다면, 이번엔 다양한 문서 형식을 지원하고, 멀티문서 검색 및 요약까지 확장했습니다. 프롬프트에 따라 다양한 방식으로 정보를 추출할 수 있으며, 이후 웹 인터페이스나 사용자 업로드 기능도 추가할 계획입니다.