Tutorial 5: RAG — Retrieval-Augmented Generation¶

Retrieval-Augmented Generation (RAG) lets a node answer questions from external context rather than relying solely on the LLM's training data. In KeGAL, retrieved content is a single string (retrieved_chunks) that the compiler injects into any node whose prompt has retrieved_chunks: true.

1. Basic: direct assignment¶

The simplest approach is to assign the retrieved text in Python before calling compile().

models:
  - llm: "ollama"
    model: "qwen2.5:7b"
    host: "http://localhost:11434"

prompts:
  - template:
      system_template:
        role: |
          You are a helpful assistant. Answer only from the context provided.
          If the context does not contain enough information, say so clearly.
      prompt_template:
        context: |
          Context:
          {retrieved_chunks}
        question: |
          {user_message}

nodes:
  - id: "rag_node"
    model: 0
    temperature: 0.2
    max_tokens: 512
    show: true
    prompt:
      template: 0
      user_message: true
      retrieved_chunks: true   # enables {retrieved_chunks} injection

edges:
  - node: "rag_node"

from kegal import Compiler

def retrieve(query: str) -> str:
    # your retrieval logic: vector search, BM25, database lookup, etc.
    return "... relevant document chunks ..."

with Compiler(uri="rag.yml") as compiler:
    question = "What is the return policy?"
    compiler.retrieved_chunks = retrieve(question)
    compiler.user_message = question
    compiler.compile()

    for node in compiler.get_outputs().nodes:
        for msg in node.response.messages or []:
            print(msg)

Static context in YAML: you can also declare retrieved_chunks directly in the graph YAML for content that never changes:
retrieved_chunks: |
  Return policy: items may be returned within 30 days of purchase.
  Refunds are processed within 5–7 business days.

2. Intermediate: loading from a file or URL¶

add_retrieved_chunks is a convenience helper that accepts a local file path, a remote https:// URL, or a plain string — exactly one source per call. This is useful when chunks are prepared by a separate process and written to disk, or served from a remote endpoint.

from pathlib import Path
from kegal import Compiler

with Compiler(uri="rag.yml") as compiler:
    # from a local text file
    compiler.add_retrieved_chunks(file=Path("context/retrieved.txt"))
    compiler.user_message = "What is the return policy?"
    compiler.compile()

with Compiler(uri="rag.yml") as compiler:
    # from a remote URL (https only)
    compiler.add_retrieved_chunks(
        uri="https://knowledge-base.example.com/api/chunks?q=return+policy"
    )
    compiler.user_message = "What is the return policy?"
    compiler.compile()

with Compiler(uri="rag.yml") as compiler:
    # from an already-retrieved string (same as direct assignment)
    chunks = retrieve(user_question)
    compiler.add_retrieved_chunks(chunks=chunks)
    compiler.user_message = user_question
    compiler.compile()

Passing more than one source argument, or none at all, raises ValueError.

3. Intermediate: RAG + structured extraction¶

Combine RAG with structured_output to extract typed information from retrieved documents rather than generating free-form text.

models:
  - llm: "ollama"
    model: "qwen2.5:7b"
    host: "http://localhost:11434"

prompts:
  - template:
      system_template:
        role: |
          You are a data extraction specialist.
          Extract the requested fields from the context.
          Return only the JSON object — no prose.
      prompt_template:
        context: |
          {retrieved_chunks}
        instruction: |
          From the context above, extract the product specifications.

nodes:
  - id: "spec_extractor"
    model: 0
    temperature: 0.0
    max_tokens: 256
    show: true
    prompt:
      template: 0
      retrieved_chunks: true
    structured_output:
      description: "Product specification extraction"
      parameters:
        product_name:
          type: "string"
        price_usd:
          type: "number"
        warranty_years:
          type: "integer"
        features:
          type: "array"
          items: { type: "string" }
      required: ["product_name", "price_usd"]

edges:
  - node: "spec_extractor"

with Compiler(uri="rag_extract.yml") as compiler:
    compiler.add_retrieved_chunks(file=Path("product_sheet.txt"))
    compiler.compile()

    data = compiler.get_outputs().nodes[0].response.json_output
    print(data["product_name"])   # "UltraWidget X200"
    print(data["price_usd"])      # 299.99

4. Advanced: guard → RAG pipeline¶

Validate the user query before performing retrieval. The guard runs first; if the query is irrelevant, the RAG node never executes and no retrieval is needed.

flowchart TD
    QG[query_guard] -->|validation=true| RAG[rag_node]
    QG -->|validation=false| STOP([Abort])

prompts:
  # 0 — guard
  - template:
      system_template:
        role: |
          Determine whether the question can be answered from a
          software product knowledge base. Approve only technical
          questions about the product.
      prompt_template:
        query: "{user_message}"

  # 1 — RAG answer
  - template:
      system_template:
        role: |
          Answer the question using only the context below.
      prompt_template:
        context: "{retrieved_chunks}"
        question: "{user_message}"

nodes:
  - id: "query_guard"
    model: 0
    temperature: 0.0
    max_tokens: 128
    show: false
    prompt:
      template: 0
      user_message: true
    structured_output:
      description: "Query relevance check"
      parameters:
        validation:
          type: "boolean"
      required: ["validation"]

  - id: "rag_node"
    model: 0
    temperature: 0.2
    max_tokens: 512
    show: true
    prompt:
      template: 1
      user_message: true
      retrieved_chunks: true

edges:
  - node: "query_guard"
  - node: "rag_node"

with Compiler(uri="guarded_rag.yml") as compiler:
    query = "How do I reset my password?"
    compiler.user_message = query
    compiler.retrieved_chunks = retrieve(query)  # retrieve before compiling
    compiler.compile()

    executed = {n.node_id for n in compiler.get_outputs().nodes}
    if "rag_node" not in executed:
        print("Query rejected by guard — not a product question.")

5. Advanced: multi-node RAG pipeline¶

Use message passing to chain a RAG node (raw answer) into a refinement node (polished answer). Each node focuses on one task.

flowchart LR
    RAG["rag_node\noutput: true"] -->|raw answer| REF["refiner\ninput: true"]

prompts:
  # 0 — RAG: produce a raw, fact-dense answer
  - template:
      system_template:
        role: |
          Answer the question from the context. Be exhaustive — include
          all relevant details even if the answer is long.
      prompt_template:
        context: "{retrieved_chunks}"
        question: "{user_message}"

  # 1 — refiner: polish the raw answer into a concise response
  - template:
      system_template:
        role: |
          You receive a detailed but possibly verbose answer.
          Rewrite it as a clear, concise response suitable for a
          customer-facing chatbot. Max 3 sentences.
      prompt_template:
        raw: "{message_passing}"

nodes:
  - id: "rag_node"
    model: 0
    temperature: 0.1
    max_tokens: 1024
    show: false
    message_passing:
      output: true
    prompt:
      template: 0
      user_message: true
      retrieved_chunks: true

  - id: "refiner"
    model: 0
    temperature: 0.4
    max_tokens: 256
    show: true
    message_passing:
      input: true
    prompt:
      template: 1

edges:
  - node: "rag_node"
  - node: "refiner"

Key points¶

retrieved_chunks is a single string — chunk separation, ordering, and truncation are entirely the caller's responsibility.
Set prompt.retrieved_chunks: true on every node that needs the content; nodes without this flag do not receive it.
add_retrieved_chunks accepts exactly one of file, uri, or chunks. Only https:// URLs are permitted for remote sources.
The same retrieved_chunks value is shared by all nodes in the graph. If different nodes need different context, encode both in the single string or use message passing to pass context explicitly.
Retrieval should happen before compile() — KeGAL does not perform retrieval internally.

Related tutorials: 03 Guard nodes — validating queries before retrieval
02 Structured output — extracting typed data from retrieved context
01 Message passing — chaining a RAG node to a refinement step