Motia Icon

RAG Agent using Docling and Weaviate

An LLM chat-like question-answering system with RAG (Retrieval-Augmented Generation) to provide accurate answers from PDF documents. The system leverages Docling to parse and intelligently chunk PDF documents, Weaviate as a vector database to store vectorized chunks, and OpenAI for embeddings and text generation.

Let's build a PDF RAG Agent with:

  • PDF Document Processing: Efficiently parses and chunks PDF documents for analysis.
  • Vector Storage with Weaviate: Stores and manages vectorized document chunks.
  • Docling for Advanced Parsing: Utilizes Docling for intelligent PDF parsing and hybrid chunking.
  • OpenAI Integration: Leverages OpenAI for creating embeddings and generating text.
  • RAG Pattern for Q&A: Implements Retrieval-Augmented Generation for accurate question answering.

The Steps

api-process-pdfs.step.ts
api-query-rag.step.ts
init-weaviate.step.ts
load-weaviate.step.ts
process-pdfs.step.py

🚀 Features

  • PDF document processing and chunking: Efficiently parse and chunk PDF documents.
  • Vector storage using Weaviate: Store and manage vectorized document chunks.
  • Docling for PDF parsing and hybrid chunking: Uses Docling for advanced document chunking.
  • OpenAI integration for embeddings and text generation: Leverage OpenAI for creating embeddings and generating text.
  • Question answering using RAG pattern: Retrieval-Augmented Generation for accurate question answering.

📋 Prerequisites

  • Node.js v18 or later
  • npm or pnpm
  • API keys for:

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/MotiaDev/motia-examples
    cd examples/rag-docling-weaviate-agent
  2. Install dependencies:

    pnpm install
    # or
    npm install
  3. Configure environment variables:

    cp .env.example .env

    Update .env with your API keys:

    # Required
    OPENAI_API_KEY=your-openai-api-key-here
    WEAVIATE_API_KEY=your-weaviate-api-key-here
    WEAVIATE_URL=your-weaviate-url-here
    

🏗️ Architecture

RAG Docling Weaviate Agent

🏗️ Technologies

  • TypeScript
  • Python
  • Docling
  • Weaviate
  • OpenAI

🚦 API Endpoints

Process PDFs

POST /api/rag/process-pdfs
Content-Type: application/json

{
  "folderPath": "path/to/pdf/folder"
}

Response:

{
  "message": "PDF processing workflow started",
  "folderPath": "path/to/pdf/folder"
}

Query RAG System

POST /api/rag/query
Content-Type: application/json

{
  "query": "Your question about the PDF content",
  "limit": 5  // Optional, defaults to 5
}

Response:

{
  "query": "Your question about the PDF content",
  "answer": "Generated answer based on the PDF content",
  "chunks": [
    {
      "text": "Relevant text chunk from the document",
      "title": "Document title",
      "metadata": {
        "source": "Document source",
        "page": 1
      }
    }
    // ... additional chunks up to the specified limit
  ]
}

Error Response:

{
  "error": "Failed to process RAG query",
  "message": "Error details"
}

🏃‍♂️ Running the Application

  1. Start the development server:

    pnpm dev
  2. Access the Motia Workbench:

    http://localhost:3000
  3. Make test requests:

    # Process PDFs
    curl --request POST \
    --url http://localhost:3000/api/rag/process-pdfs \
    --header 'Content-Type: application/json' \
    --data '{
       "folderPath": "path/to/pdf/folder"
    }'
     
    # Query the RAG system
    curl --request POST \
    --url http://localhost:3000/api/rag/query \
    --header 'Content-Type: application/json' \
    --data '{
       "query": "Your question about the PDF content",
       "limit": 5
    }'

🙏 Acknowledgments

Need help? See our Community Resources for questions, examples, and discussions.

On this page