Most teams have knowledge scattered across Google Docs, Notion, PDFs, and internal wikis. When someone asks "what's our refund policy?" or "what were the Q1 numbers?", the answer exists somewhere — but finding it takes time, and asking a generic AI chatbot just produces confident-sounding guesses.

Retrieval-Augmented Generation (RAG) fixes this. Instead of relying on a model's training data, RAG retrieves the most relevant passages from your own documents before generating a response. The LLM answers based on what you've stored, not what it was trained on.

With n8n, you can build a production-ready RAG chatbot without writing a Python pipeline. n8n has native nodes for OpenAI embeddings, Pinecone vector operations, HTTP webhooks, and error routing — everything you need to go from documents to a working chatbot.

What RAG Actually Does

RAG works in two phases:

Ingestion: Documents are split into chunks, converted to vector embeddings, and stored in a vector database like Pinecone.
Query: When a user asks a question, the question is embedded and used to find the most semantically similar chunks. Those chunks are passed to an LLM as context, and the model generates a grounded answer.

The result is a chatbot that answers accurately from your documents rather than hallucinating. When a chunk is irrelevant to the question, the relevant context simply isn't retrieved — and the LLM can say it doesn't know rather than inventing an answer.

What You Need

n8n (self-hosted or n8n Cloud)
A Pinecone account (free tier covers most use cases)
An OpenAI API key (for embeddings and chat completions)
Documents to index: Google Docs, PDFs, plain text files, or any text-accessible source

Part 1: The Ingestion Workflow

This workflow loads documents into Pinecone once, and again whenever the source documents change.

Step 1: Trigger the ingestion

Use a Manual Trigger node for initial indexing. For recurring updates, swap to a Schedule Trigger node set to run daily or weekly. For continuous sync, a Webhook Trigger node lets you trigger re-indexing whenever a document is updated in your CMS or Google Drive.

Step 2: Fetch document content

Use the Google Drive node to list all files in your target folder, then download each file as text. For PDFs, chain an Extract From File node (set to PDF mode) after the download to extract readable text. For Notion databases, use the Notion node to retrieve page content in plain text format.

Step 3: Split text into chunks

Use a Code node (JavaScript) to split each document into overlapping chunks of 300–500 tokens. An overlap of 50–100 tokens between adjacent chunks prevents relevant content from being cut at arbitrary boundaries. A simple approach: split on double newlines, then combine sentences until a token threshold is reached.

Step 4: Generate embeddings

Use a Split In Batches node to process chunks in groups of 10. For each batch, use the OpenAI node in Embeddings mode with the text-embedding-3-small model to generate a 1536-dimension vector for each chunk. This converts text into a numerical representation that captures semantic meaning.

Step 5: Upsert vectors to Pinecone

Use the Pinecone node in Upsert mode to store each vector. Along with the embedding, pass metadata: the original chunk text, the document title, the source URL, and the chunk index. This metadata is returned with query results so the LLM knows what it's reading and your frontend can show attribution links.

Part 2: The Query Workflow

This workflow runs in real time whenever a user asks a question.

Step 6: Receive the question

Use a Webhook Trigger node to accept POST requests containing a question field. This endpoint becomes what your chat interface, Slack bot, or internal tool calls. Set the webhook to respond synchronously so the caller waits for the answer.

Step 7: Embed the question

Pass the question through the OpenAI node in Embeddings mode using the same text-embedding-3-small model used during ingestion. This is critical — the question must be embedded in the same vector space as your stored documents for similarity search to work.

Step 8: Query Pinecone for relevant chunks

Use the Pinecone node in Query mode. Pass the question embedding and set topK to 5, returning the five most semantically similar chunks from your index. Enable includeMetadata: true so you get the original text back alongside the vector IDs.

Step 9: Build the prompt

Use a Set node to assemble the LLM prompt. Concatenate the retrieved chunk texts as a context block, then append the user's question. A reliable template:

You are a helpful assistant. Use the following excerpts to answer the question. If the answer is not in the excerpts, say you don't know.

Context: [chunks from Pinecone]

Question: [user's question]

Step 10: Call the LLM

Use the OpenAI node in Chat mode with gpt-4o-mini — it's fast, accurate, and inexpensive for retrieval-grounded tasks. Pass the assembled prompt. The model generates an answer that draws only from the retrieved context.

Step 11: Return the answer

Use a Respond to Webhook node to send the answer back to the caller. Include the source metadata (document titles and URLs) in the response body so your frontend can display which documents the answer came from.

Example n8n Workflow: Company HR Policy Bot

A startup wants employees to ask HR policy questions without digging through Google Drive. Here's the full two-workflow setup:

Ingestion (runs every Monday at 9 AM):

Schedule Trigger — fires weekly
Google Drive node — lists all files in /HR Policies/
Google Drive node (download) — fetches each file as plain text
Code node — splits each document into 400-token chunks, 80-token overlap
Split In Batches node — processes 10 chunks per iteration
OpenAI node (embeddings) — text-embedding-3-small, generates vector per chunk
Pinecone node (upsert) — stores vector + chunk text + document title + Drive URL

Query (runs on every employee question):

Webhook Trigger — listens at /hr-bot, called by a Slack slash command
OpenAI node (embeddings) — embeds the employee's question
Pinecone node (query) — retrieves top 5 matching HR policy chunks
Set node — assembles RAG prompt with retrieved chunks
OpenAI node (chat) — gpt-4o-mini generates the grounded answer
Respond to Webhook node — returns answer and source document links to Slack

API cost per query runs around $0.001. A team of 50 people asking 500 questions per month spends under $0.50 in API fees.

Practical Benefits of Building RAG in n8n

Visual debugging. Every node shows its inputs and outputs, so you can inspect exactly which chunks were retrieved for any question. Diagnosing a poor answer means looking at step 10, not adding print statements to a Python script.

No extra infrastructure. n8n handles scheduling, batching, retries, and error routing. Your vector data lives in Pinecone. You don't manage queues, workers, or embedding servers.

Flexible document sources. Swapping Google Drive for Notion, Confluence, or a plain HTTP endpoint requires changing the first two nodes in the ingestion workflow — nothing else changes.

Built-in error handling. Use n8n's error workflow feature to route failed embedding jobs to a Slack alert instead of silently dropping chunks. Pair with an IF node to skip documents that return empty text after extraction.

The ingestion and query patterns here apply to any domain: customer support knowledge bases, product documentation, legal contract Q&A, internal sales playbooks, and onboarding guides.

Browse ready-to-import RAG and AI automation templates at n8nresources.dev/templates. For a broader overview of what's possible with n8n's AI capabilities, see the AI Agents use case page.

Build a RAG Chatbot with n8n and Pinecone