RAG Agent using Docling and Weaviate

An LLM chat-like question-answering system with RAG (Retrieval-Augmented Generation) to provide accurate answers from PDF documents. The system leverages Docling to parse and intelligently chunk PDF documents, Weaviate as a vector database to store vectorized chunks, and OpenAI for embeddings and text generation.

Let's build a PDF RAG Agent with:

PDF Document Processing: Efficiently parses and chunks PDF documents for analysis.
Vector Storage with Weaviate: Stores and manages vectorized document chunks.
Docling for Advanced Parsing: Utilizes Docling for intelligent PDF parsing and hybrid chunking.
OpenAI Integration: Leverages OpenAI for creating embeddings and generating text.
RAG Pattern for Q&A: Implements Retrieval-Augmented Generation for accurate question answering.

The Steps

api-process-pdfs.step.ts

api-query-rag.step.ts

init-weaviate.step.ts

load-weaviate.step.ts

process-pdfs.step.py

🚀 Features

PDF document processing and chunking: Efficiently parse and chunk PDF documents.
Vector storage using Weaviate: Store and manage vectorized document chunks.
Docling for PDF parsing and hybrid chunking: Uses Docling for advanced document chunking.
OpenAI integration for embeddings and text generation: Leverage OpenAI for creating embeddings and generating text.
Question answering using RAG pattern: Retrieval-Augmented Generation for accurate question answering.

📋 Prerequisites

Node.js v18 or later
npm or pnpm
API keys for:
- OpenAI (Embeddings and text generation)
- Weaviate (Vector database)

🛠️ Installation

Clone the repository:
git clone https://github.com/MotiaDev/motia-examples cd examples/rag-docling-weaviate-agent
Install dependencies:
pnpm install # or npm install
Configure environment variables:
cp .env.example .env
Update .env with your API keys:
# Required OPENAI_API_KEY=your-openai-api-key-here WEAVIATE_API_KEY=your-weaviate-api-key-here WEAVIATE_URL=your-weaviate-url-here

🏗️ Architecture

RAG Docling Weaviate Agent

🏗️ Technologies

TypeScript
Python
Docling
Weaviate
OpenAI

🚦 API Endpoints

Process PDFs

POST /api/rag/process-pdfs
Content-Type: application/json

{
  "folderPath": "path/to/pdf/folder"
}

Response:

{
  "message": "PDF processing workflow started",
  "folderPath": "path/to/pdf/folder"
}

Query RAG System

POST /api/rag/query
Content-Type: application/json

{
  "query": "Your question about the PDF content",
  "limit": 5  // Optional, defaults to 5
}

Response:

{
  "query": "Your question about the PDF content",
  "answer": "Generated answer based on the PDF content",
  "chunks": [
    {
      "text": "Relevant text chunk from the document",
      "title": "Document title",
      "metadata": {
        "source": "Document source",
        "page": 1
      }
    }
    // ... additional chunks up to the specified limit
  ]
}

Error Response:

{
  "error": "Failed to process RAG query",
  "message": "Error details"
}

🏃‍♂️ Running the Application

Start the development server:
pnpm dev
Access the Motia Workbench:
http://localhost:3000

Make test requests:

# Process PDFs
curl --request POST \
--url http://localhost:3000/api/rag/process-pdfs \
--header 'Content-Type: application/json' \
--data '{
   "folderPath": "path/to/pdf/folder"
}'
 
# Query the RAG system
curl --request POST \
--url http://localhost:3000/api/rag/query \
--header 'Content-Type: application/json' \
--data '{
   "query": "Your question about the PDF content",
   "limit": 5
}'

🙏 Acknowledgments

Motia Framework for the event-driven workflow engine
Docling for PDF parsing and hybrid chunking
Weaviate for Vector Database
OpenAI for AI analysis

Need help? See our Community Resources for questions, examples, and discussions.

RAG Agent using Docling and Weaviate

On this page