Miriam Kümmel & Mathis Lucka•2026-05-07•12 min read

Improving knowledge graph creation in life sciences through agent steering

Agent steering intercepts agents mid-run to provide state-specific feedback, improving completeness, hallucination rates, and entity resolution by up to 14 percentage points for knowledge graph creation in life sciences.

Intro

Biomedical R&D generates increasingly diverse datasets across modalities like clinical, preclinical, omics, and real-world data, often in heterogeneous formats. Knowledge graphs are a way to harmonize this data, but creating them is costly and time-consuming. Many teams turn to LLMs to speed up the process, and the most capable setups use agents. While agents are a step-up from single-pass LLM workflows, they are not without fault. Errors may potentiate in long agent trajectories and, unlike coding agents, domain-specific agents lack verification loops.

In this post, we demonstrate empirically how agent steering improves factors like completeness, hallucinations, and entity resolution in an agent for knowledge graph creation.

What is agent steering?

Instead of front-loading all instructions into the initial prompt, agent steering intercepts the agent mid-run to provide feedback specific to its current state. The agent self-corrects and yields better outcomes than what would be achieved through prompting alone.

In our setup, an evaluator detects issues like deviations from the system prompt, missed nodes, or hallucinations. It pinpoints each issue to a specific text span and provides an explanation. This information is injected into the agent's trajectory as a correction prompt. The correction loop can run multiple times: if the evaluator finds new issues after the initial correction, the agent is prompted to correct them too.

Creating a knowledge graph from unstructured documents

To evaluate agent steering, we built an agent that creates a knowledge graph from Summary of Product Characteristics (SmPCs). SmPCs are regulatory documents that describe prescription medicines for medical professionals. The agent extracts nodes and edges from SmPCs and stores them in a Neo4j graph database.

Here is a simplified excerpt of the schema:

from dataclasses import dataclass
from typing import Literal
 
@dataclass
class ClinicalCondition:
    node_id: str
    # SNOMED CT is a database for clinical terminology
    # the agent has a tool to search it and uses SNOMED codes for entity resolution
    snomed_code: int
    canonical_name: str
 
@dataclass
class Substance:
    node_id: str
    snomed_code: int
    canonical_name: str
 
@dataclass
class CausesAdverseReactionEdge:
    from_node: str
    to_node: str
    frequency: Literal[
        "Very common",
        "Common",
        "Uncommon",
        "Rare",
        "Very rare",
        "Not known"
    ]

The agent receives a PDF and extracts nodes and edges using the tools we provide as its harness. The harness is critical to the system's success but out of scope for this post (we cover it in the webinar slide deck).

As an example, here is a sub-graph of some adverse reactions extracted from two SmPCs:

Sub-graph: drugs and adverse reactions

Drag to rotate, scroll to zoom, click an edge or node to inspect its grounding.

Loading graph…

Extracted edge

Aerinazecauses adverse reactionInsomnia

Frequency: CommonPage 6

“Insomnia, somnolence, sleep disorder,”

aerinaze-epar-product-information_en.pdf

Edge metadata

Source node: drug_aerinaze
Target node: cc_insomnia
Edge type: CausesAdverseReaction

Even with a well-designed harness, the extraction is far from perfect. The main issues we observed with the plain agent are:

Completeness

The agent underextracts nodes and edges, it misses information that it should extract and only creates a fraction of the expected graph.
Hallucinations

The agent fabricates node or edge attributes, for example assigning the wrong frequency to an adverse reaction. This is also the type of issue we cover in PlaceboBench (our hallucination benchmark for life sciences).
Entity Resolution

The agent uses the wrong concept ID or no ID at all, which leads to duplicate nodes and merge failures across documents.

Steering the extraction agent

While the agent runs, traces are sent to Blue Guardrails. An evaluator using an Agent-as-a-Judge approach detects issues, pinpoints them to specific text spans, and returns explanations. The agent receives a correction prompt based on this data. During our experiments, we limit the system to two correction loops.

The agent steering loop: traces flow to Blue Guardrails, an evaluator detects issues, and an issue prompt is injected back into the agent run as a correction.

Impact on extraction quality

We evaluate agent steering against a ground truth dataset of seven medicines spanning various drug classes and complexities. The dataset contains 1,622 nodes and 1,865 edges. We run graph creation with two models in different capability tiers: Deepseek v4 Flash and GPT 5.4 Nano. Each model is a realistic and economically feasible choice for AI engineering teams dealing with large-scale data extraction and harmonization tasks.

The graph below is pulled live from the Neo4j instance that backs this experiment. It shows every extracted product, their active ingredients, and their adverse reactions. This is a sub-graph of the full extraction. The complete graph additionally contains information such as contraindications, precautions, interactions, and posologies.

Live graph: products, active ingredients, and adverse reactions

Pulled live from Neo4j. Drag to rotate, scroll to zoom, click a node or edge to inspect it.

Loading graph from Neo4j…

Select a node or an edge to inspect its identifier and relationship details.

Completeness & entity resolution

We use node recall to measure completeness. Since nodes are matched on their SNOMED codes, recall also captures correct entity resolution: a node only counts as a match if the agent resolved it to the right concept. We report F1 alongside recall to confirm that higher recall does not come at the cost of precision.

Node recallNode F1•light = without steering · dark = with steering

Node recall and F1 on the medication knowledge graph extraction task, with and without agent steering, across two models.

Agent steering bumps recall from 83% to 91% for Deepseek v4 Flash. The F1-score rises to 92%.

For GPT 5.4 Nano, the effect is even more pronounced. Recall increases by 14 percentage points from 58 percent to 72 percent. This shows that even a small model like GPT 5.4 Nano can self-improve if given feedback through agent steering.

Hallucinations

For the graph-creation agent, hallucinations happen primarily on edge attributes. As shown in the schema above, the edges in our graph hold information such as the frequency of an adverse reaction. The agent might hallucinate wrong edge attributes. For example, it might extract an adverse reaction as "common" instead of "rare".

We express the veracity of extracted edge attributes through the F1-score.

Without steeringWith steering

Edge attribute F1 on the medication knowledge graph extraction task, with and without agent steering, across two models.

Attribute F1 increases by more than 10 percentage points for GPT 5.4 Nano and by 4 points for Deepseek v4 Flash. Agent steering is an effective measure to reduce attribute hallucinations.

Conclusion

Agent steering improves knowledge graph creation across all three dimensions we measured: completeness, hallucination rates, and entity resolution. The improvements range from 4 to 14 percentage points depending on model and metric, with smaller models benefiting the most. The approach is not specific to knowledge graphs or life sciences. Any agent that operates on long trajectories without built-in verification can benefit from mid-run feedback. We see agent steering as a general pattern for closing the gap between what agents can do in theory and what they deliver in practice.

Get the webinar slides: A better way to build knowledge graphs in life sciences

The live session has concluded — submit the form to receive the deck on harness engineering, steering loops, and evaluation results.

Get slides →

Miriam Kümmel

Co-Founder at Blue Guardrails

Mathis Lucka

Co-Founder at Blue Guardrails