Improving knowledge graph creation in life sciences through agent steering
Agent steering intercepts agents mid-run to provide state-specific feedback, improving completeness, hallucination rates, and entity resolution by up to 14 percentage points for knowledge graph creation in life sciences.
Intro
Biomedical R&D generates increasingly diverse datasets across modalities like clinical, preclinical, omics, and real-world data, often in heterogeneous formats. Knowledge graphs are a way to harmonize this data, but creating them is costly and time-consuming. Many teams turn to LLMs to speed up the process, and the most capable setups use agents. While agents are a step-up from single-pass LLM workflows, they are not without fault. Errors may potentiate in long agent trajectories and, unlike coding agents, domain-specific agents lack verification loops.
In this post, we demonstrate empirically how agent steering improves factors like completeness, hallucinations, and entity resolution in an agent for knowledge graph creation.
What is agent steering?
Instead of front-loading all instructions into the initial prompt, agent steering intercepts the agent mid-run to provide feedback specific to its current state. The agent self-corrects and yields better outcomes than what would be achieved through prompting alone.
In our setup, an evaluator detects issues like deviations from the system prompt, missed nodes, or hallucinations. It pinpoints each issue to a specific text span and provides an explanation. This information is injected into the agent's trajectory as a correction prompt. The correction loop can run multiple times: if the evaluator finds new issues after the initial correction, the agent is prompted to correct them too.
Creating a knowledge graph from unstructured documents
To evaluate agent steering, we built an agent that creates a knowledge graph from Summary of Product Characteristics (SmPCs). SmPCs are regulatory documents that describe prescription medicines for medical professionals. The agent extracts nodes and edges from SmPCs and stores them in a Neo4j graph database.
Here is a simplified excerpt of the schema:
from dataclasses import dataclass
from typing import Literal
@dataclass
class ClinicalCondition:
node_id: str
# SNOMED CT is a database for clinical terminology
# the agent has a tool to search it and uses SNOMED codes for entity resolution
snomed_code: int
canonical_name: str
@dataclass
class Substance:
node_id: str
snomed_code: int
canonical_name: str
@dataclass
class CausesAdverseReactionEdge:
from_node: str
to_node: str
frequency: Literal[
"Very common",
"Common",
"Uncommon",
"Rare",
"Very rare",
"Not known"
]The agent receives a PDF and extracts nodes and edges using the tools we provide as its harness. The harness is critical to the system's success but out of scope for this post (we cover it in our upcoming live stream).
As an example, here is a sub-graph of some adverse reactions extracted from two SmPCs:
Sub-graph: drugs and adverse reactions
Drag to rotate, scroll to zoom, click an edge or node to inspect its grounding.
“Insomnia, somnolence, sleep disorder,”
aerinaze-epar-product-information_en.pdf
- Source node
- drug_aerinaze
- Target node
- cc_insomnia
- Edge type
- CausesAdverseReaction
Even with a well-designed harness, the extraction is far from perfect. The main issues we observed with the plain agent are:
-
Completeness
The agent underextracts nodes and edges, it misses information that it should extract and only creates a fraction of the expected graph.
-
Hallucinations
The agent fabricates node or edge attributes, for example assigning the wrong frequency to an adverse reaction. This is also the type of issue we cover in PlaceboBench (our hallucination benchmark for life sciences).
-
Entity Resolution
The agent uses the wrong concept ID or no ID at all, which leads to duplicate nodes and merge failures across documents.
Steering the extraction agent
While the agent runs, traces are sent to Blue Guardrails. An evaluator using an Agent-as-a-Judge approach detects issues, pinpoints them to specific text spans, and returns explanations. The agent receives a correction prompt based on this data. During our experiments, we limit the system to two correction loops.
Impact on extraction quality
We evaluate agent steering against a ground truth dataset of seven medicines spanning various drug classes and complexities. The dataset contains 1,622 nodes and 1,865 edges. We run graph creation with two models in different capability tiers: Deepseek v4 Flash and GPT 5.4 Nano. Each model is a realistic and economically feasible choice for AI engineering teams dealing with large-scale data extraction and harmonization tasks.
The graph below is pulled live from the Neo4j instance that backs this experiment. It shows every extracted product, their active ingredients, and their adverse reactions. This is a sub-graph of the full extraction. The complete graph additionally contains information such as contraindications, precautions, interactions, and posologies.
Live graph: products, active ingredients, and adverse reactions
Pulled live from Neo4j. Drag to rotate, scroll to zoom, click a node or edge to inspect it.
Completeness & entity resolution
We use node recall to measure completeness. Since nodes are matched on their SNOMED codes, recall also captures correct entity resolution: a node only counts as a match if the agent resolved it to the right concept. We report F1 alongside recall to confirm that higher recall does not come at the cost of precision.
Agent steering bumps recall from 83% to 91% for Deepseek v4 Flash. The F1-score rises to 92%.
For GPT 5.4 Nano, the effect is even more pronounced. Recall increases by 14 percentage points from 58 percent to 72 percent. This shows that even a small model like GPT 5.4 Nano can self-improve if given feedback through agent steering.
Hallucinations
For the graph-creation agent, hallucinations happen primarily on edge attributes. As shown in the schema above, the edges in our graph hold information such as the frequency of an adverse reaction. The agent might hallucinate wrong edge attributes. For example, it might extract an adverse reaction as "common" instead of "rare".
We express the veracity of extracted edge attributes through the F1-score.
Attribute F1 increases by more than 10 percentage points for GPT 5.4 Nano and by 4 points for Deepseek v4 Flash. Agent steering is an effective measure to reduce attribute hallucinations.
Conclusion
Agent steering improves knowledge graph creation across all three dimensions we measured: completeness, hallucination rates, and entity resolution. The improvements range from 4 to 14 percentage points depending on model and metric, with smaller models benefiting the most. The approach is not specific to knowledge graphs or life sciences. Any agent that operates on long trajectories without built-in verification can benefit from mid-run feedback. We see agent steering as a general pattern for closing the gap between what agents can do in theory and what they deliver in practice.
June 10, 2026 — see how agent steering solves completeness, entity resolution, and hallucinations in agentic graph creation.

