Tutorial 7: Multimodal — Images and PDFs¶
KeGAL nodes can receive images and PDF documents alongside the text prompt. The media is attached to the LLM call directly — no special placeholder is needed in the prompt template. Nodes with vision or document understanding capabilities will process the attached media as part of the message.
1. Basic: a single image¶
Declare an image at the graph level and reference it by index on the node.
models:
- llm: "ollama"
model: "qwen2.5-vl:7b" # a vision-capable model
host: "http://localhost:11434"
images:
- uri: "./assets/architecture_diagram.png" # local file
prompts:
- template:
system_template:
role: |
You are a technical architect. Analyse the attached diagram.
prompt_template:
task: |
{user_message}
nodes:
- id: "diagram_analyst"
model: 0
temperature: 0.3
max_tokens: 512
show: true
images: [0] # index 0 from the images list
prompt:
template: 0
user_message: true
edges:
- node: "diagram_analyst"
from kegal import Compiler
with Compiler(uri="vision.yml") as compiler:
compiler.user_message = "Identify any bottlenecks visible in the diagram."
compiler.compile()
for msg in compiler.get_outputs().nodes[0].response.messages:
print(msg)
Remote images: use an
https://URI. Only HTTPS is allowed for remote sources —http://raisesValueErrorbefore any network call is made.
2. Intermediate: PDF documents¶
Documents (PDFs) work the same way as images but are declared under
documents:.
models:
- llm: "anthropic"
model: "claude-sonnet-4-6"
api_key: "sk-ant-..."
documents:
- uri: "./reports/q3_report.pdf"
prompts:
- template:
system_template:
role: |
You are a financial analyst. Answer questions about the
attached earnings report.
prompt_template:
question: |
{user_message}
nodes:
- id: "report_analyst"
model: 0
temperature: 0.2
max_tokens: 1024
show: true
documents: [0] # index 0 from the documents list
prompt:
template: 0
user_message: true
edges:
- node: "report_analyst"
with Compiler(uri="document_qa.yml") as compiler:
compiler.user_message = "What was the operating margin in Q3?"
compiler.compile()
3. Intermediate: multiple images and documents on one node¶
A node can receive multiple images, multiple documents, or a combination. List the indices of every item you want to include.
images:
- uri: "./assets/floor_plan.png" # index 0
- uri: "./assets/elevation_view.png" # index 1
documents:
- uri: "./specs/building_specs.pdf" # index 0
nodes:
- id: "architect_review"
model: 0
temperature: 0.3
max_tokens: 1024
show: true
images: [0, 1] # both images
documents: [0] # the PDF
prompt:
template: 0
user_message: true
Different nodes in the same graph can receive different subsets of the declared media:
nodes:
- id: "floor_plan_analyst"
images: [0] # only the floor plan
- id: "elevation_analyst"
images: [1] # only the elevation view
- id: "spec_reader"
documents: [0] # only the PDF
4. Advanced: base64-encoded media¶
When media is not stored on disk or at a URL — for example, a screenshot captured at runtime — encode it as base64 and pass it directly:
Or inject it in Python, replacing an image slot at runtime:
import base64
from pathlib import Path
from kegal import Compiler
from kegal.graph import GraphInputData
raw_bytes = Path("screenshot.png").read_bytes()
b64 = base64.b64encode(raw_bytes).decode()
with Compiler(uri="vision.yml") as compiler:
# replace all images with the runtime screenshot
compiler.images = [GraphInputData(base64=b64)]
compiler.user_message = "Describe what you see in this screenshot."
compiler.compile()
compiler.imagesandcompiler.documentsare lists ofGraphInputData. Assigning to them replaces the list loaded from the YAML graph. The indices used by nodes still refer to positions in whichever list is active atcompile()time.
5. Advanced: parallel multi-modal analysis¶
Fan out to multiple specialist nodes, each receiving different media, and fan in to a synthesiser.
flowchart LR
I["image_analyst\nimages: [0]"] --> S["synthesizer\ninput: true"]
D["doc_analyst\ndocuments: [0]"] --> S
models:
- llm: "ollama"
model: "qwen2.5-vl:7b"
host: "http://localhost:11434"
images:
- uri: "./assets/product_photo.jpg"
documents:
- uri: "./specs/product_spec.pdf"
prompts:
- template: # 0 — image analyst
system_template:
role: Describe the product shown in the image. Focus on appearance and visible features.
prompt_template:
task: "{user_message}"
- template: # 1 — doc analyst
system_template:
role: Summarise the key technical specifications from the attached document.
prompt_template:
task: "{user_message}"
- template: # 2 — synthesizer
system_template:
role: |
You receive a visual description and a technical specification.
Combine them into a single product description for a customer.
prompt_template:
findings: "{message_passing}"
nodes:
- id: "image_analyst"
model: 0
temperature: 0.3
max_tokens: 512
show: false
images: [0]
message_passing: { output: true }
prompt: { template: 0, user_message: true }
- id: "doc_analyst"
model: 0
temperature: 0.2
max_tokens: 512
show: false
documents: [0]
message_passing: { output: true }
prompt: { template: 1, user_message: true }
- id: "synthesizer"
model: 0
temperature: 0.5
max_tokens: 512
show: true
message_passing: { input: true }
prompt: { template: 2 }
edges:
- node: "synthesizer"
fan_in:
- node: "image_analyst"
- node: "doc_analyst"
with Compiler(uri="multimodal_pipeline.yml") as compiler:
compiler.user_message = "Prepare a product listing."
compiler.compile()
Key points¶
- Images and documents are declared at the graph level and referenced by index on each node.
- A node can reference multiple items from each list:
images: [0, 1, 2]. - Different nodes can receive different subsets of the same graph-level lists.
- Only
https://URIs are permitted for remote media;http://raisesValueError. base64anduriare alternatives within a singleGraphInputDataentry — don't combine them.- Media is attached to the LLM call directly — no
{placeholder}is needed in the prompt template. - At runtime, replacing
compiler.imagesorcompiler.documentsbeforecompile()overrides the YAML declarations.
Related tutorials: 11 Multi-provider graphs — choosing a vision-capable provider
04 Fan-out and fan-in — parallel specialist nodes