Motia Icon

Intelligent Document Processing: Building a RAG Agent with Motia, Docling & Weaviate

In the era of AI-powered applications, the ability to extract insights from documents is crucial. Whether you're building a knowledge base, a research assistant, or a customer support system, you need to transform static PDFs into queryable, intelligent systems. This is where Retrieval-Augmented Generation (RAG) architecture shines, and where the Motia framework provides an elegant solution.

This comprehensive guide explores how to build a production-ready RAG system that intelligently processes PDFs and answers questions about their content. We'll cover:

  1. The RAG Architecture: Understanding how document processing, vector storage, and AI generation work together.
  2. Motia's Event-Driven Approach: How steps create a scalable, maintainable RAG pipeline.
  3. Building the Workflow: A detailed walkthrough of our polyglot processing pipeline.
  4. Advanced Features: Real-time progress tracking, error handling, and production considerations.
  5. Hands-On Testing: How to ingest documents and query your knowledge base.

Let's transform your documents into an intelligent AI assistant.


The Power of Intelligent Document Processing

RAG Workflow in Motia Workbench

At its core, our RAG agent solves a fundamental challenge: how do you make unstructured documents searchable and queryable by AI? Traditional approaches often involve complex, monolithic systems that are difficult to scale and maintain. Our Motia-powered solution breaks this down into discrete, event-driven steps that each handle a specific aspect of the pipeline.

The magic happens through the integration of three powerful technologies:

  • Docling: Advanced PDF parsing with intelligent chunking that preserves document structure
  • Weaviate: Cloud-native vector database with built-in OpenAI integration
  • Motia: Event-driven framework that orchestrates the entire pipeline

Instead of a brittle, tightly-coupled system, we get a resilient architecture where each component can be scaled, modified, or replaced independently.


The Anatomy of Our RAG Pipeline

Our application consists of seven specialized steps, each handling a specific part of the document processing and querying workflow. Let's explore the complete architecture.

api-process-pdfs.step.ts
api-query-rag.step.ts
init-weaviate.step.ts
read-pdfs.step.ts
process-pdfs.step.py
load-weaviate.step.ts

The entry point for document ingestion. This API endpoint receives a folder path, kicks off the processing pipeline, and returns immediately with a tracking ID for real-time progress monitoring.

import { Handlers } from 'motia'
import { z } from 'zod'
import { v4 as uuidv4 } from 'uuid'
 
export const config = {
  type: 'api',
  name: 'api-process-pdfs',
  description: 'API endpoint to start PDF processing pipeline',
  path: '/api/rag/process-pdfs',
  method: 'POST',
  emits: ['rag.read.pdfs'],
  bodySchema: z.object({
    folderPath: z.string().min(1, 'folderPath is required'),
  }),
  flows: ['rag-workflow'],
} as const
 
export const handler: Handlers['api-process-pdfs'] = async (req, { emit, logger }) => {
  const { folderPath } = req.body
  const streamId = uuidv4()
 
  logger.info('Starting PDF processing pipeline', { folderPath, streamId })
 
  // Emit event to start the processing chain
  await emit({
    topic: 'rag.read.pdfs',
    data: { folderPath, streamId },
  })
 
  return {
    status: 200,
    body: { 
      message: 'PDF processing started',
      streamId,
      status: 'processing'
    },
  }
}

Explore the Workbench

The Motia Workbench provides a visual representation of your RAG pipeline, making it easy to understand the flow and debug any issues.

RAG Workflow in Motia Workbench

You can monitor real-time processing, view logs, and trace the execution of each step directly in the Workbench interface. This makes development and debugging significantly easier compared to traditional monolithic approaches.


Key Features & Benefits

🚀 Event-Driven Architecture

Each step is independent and communicates through events, making the system highly scalable and maintainable.

🧠 Intelligent Document Processing

Docling's hybrid chunking preserves document structure while creating optimal chunks for embedding.

Weaviate's cloud-native architecture provides fast, scalable similarity search with built-in OpenAI integration.

🔄 Real-Time Progress Tracking

Monitor document processing progress with detailed logging and status updates.

🌐 Polyglot Support

Seamlessly combine Python (Docling) and TypeScript (orchestration) in a single workflow.

🛡️ Production-Ready

Built-in error handling, batch processing, and resource cleanup ensure reliability.


Trying It Out

Ready to build your own intelligent document assistant? Let's get the system running.

Install Dependencies

Install both Node.js and Python dependencies. The prepare script automatically sets up the Python virtual environment.

npm install

Set Your Environment Variables

You'll need API keys for OpenAI and Weaviate Cloud. Create a .env file:

OPENAI_API_KEY="sk-..."
WEAVIATE_URL="https://your-cluster.weaviate.network"
WEAVIATE_API_KEY="your-weaviate-api-key"

Run the Project

Start the Motia development server to begin processing documents.

npm run dev

Process Your First Documents

Add some PDF files to the docs/pdfs/ folder, then start the ingestion pipeline:

curl -X POST http://localhost:3000/api/rag/process-pdfs \
  -H "Content-Type: application/json" \
  -d '{"folderPath":"docs/pdfs"}'

Watch the logs as your documents are processed through the pipeline:

  1. PDF Reading: Files are discovered and queued
  2. Docling Processing: Intelligent chunking with structure preservation
  3. Weaviate Loading: Chunks are embedded and stored

Query Your Knowledge Base

Once processing is complete, you can ask questions about your documents:

General Query

curl -X POST http://localhost:3000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query":"What are the main topics covered in these documents?","limit":3}'

Specific Question

curl -X POST http://localhost:3000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query":"What methodology was used in the research?","limit":5}'

The response includes both a generated answer and the source chunks with page numbers for verification.


Advanced Usage

Custom Chunking Strategies

Modify the Python processing step to implement custom chunking logic:

# In process-pdfs.step.py
chunker = HybridChunker(
    tokenizer="cl100k_base",
    max_tokens=1024,  # Larger chunks for more context
    overlap_tokens=100,  # More overlap for better continuity
    heading_hierarchies=True,
    split_by_page=True  # Preserve page boundaries
)

Batch Processing Optimization

Adjust batch sizes in the Weaviate loading step for optimal performance:

// In load-weaviate.step.ts
const BATCH_SIZE = 50  // Smaller batches for large documents

Multi-Collection Support

Extend the system to handle different document types by creating separate Weaviate collections:

const COLLECTIONS = {
  research: 'ResearchPapers',
  manuals: 'TechnicalManuals', 
  reports: 'BusinessReports'
}

Troubleshooting

Common Issues

ENOENT Path Errors: The system automatically handles path normalization, but ensure your folderPath is relative to the project root.

Empty Answers: Check that documents were successfully processed by examining the logs. Verify your OpenAI API key is valid.

Weaviate Connection Issues: Ensure your WEAVIATE_URL and WEAVIATE_API_KEY are correct and your cluster is running.

Performance Tips

  • Document Size: For large PDFs, consider preprocessing to split them into smaller files
  • Batch Size: Adjust the Weaviate batch size based on your cluster's capacity
  • Chunking Strategy: Experiment with different chunk sizes and overlap for your specific use case

💻 Dive into the Code

Want to explore the complete RAG implementation? Check out the full source code, including all steps, configuration files, and setup instructions:

Complete RAG Implementation

Access the full source code for this RAG agent, including Python processing scripts, TypeScript orchestration, and production configuration.


Conclusion: The Future of Document Intelligence

This RAG system demonstrates the power of combining best-in-class technologies with Motia's event-driven architecture. By breaking down complex document processing into discrete, manageable steps, we've created a system that's not only powerful but also maintainable and scalable.

The polyglot nature of the solution: Python for document processing, TypeScript for orchestration, shows how Motia enables you to use the right tool for each job without sacrificing integration or maintainability.

From here, you can extend the system by:

  • Adding support for other document formats (Word, PowerPoint, etc.)
  • Implementing document classification and routing
  • Adding real-time document updates and synchronization
  • Building a web interface for document management
  • Integrating with existing business systems

The event-driven architecture makes all of these extensions straightforward to implement without disrupting the existing pipeline.

Ready to transform your documents into intelligent, queryable knowledge bases? Start building with Motia today!

Need help? See our Community Resources for questions, examples, and discussions.