Managing Conversation History: Summarization and Trimming
Aria has come a long way — she checks the inbox, searches the web, queries a database, searches the handbook, remembers conversations, and even delegates to specialists. Now picture Julie using her all day, every day, in one long-running thread: dozens of messages, plus every tool call and every tool result along the way, all stacking up in that same conversation's memory.
Here's the problem: every single one of those past messages gets sent back to the model again with every new turn, because that's how memory works (Article 4). A long enough conversation eventually runs into the model's context window limit — and even before that, it gets slower and more expensive than it needs to be. This article introduces middleware, LangChain's mechanism for managing exactly this kind of problem, and the first of several production-focused tools we'll cover in this final stretch of the series.
🟡 Skill level: Intermediate.
Quick Reference
When to use this: Whenever a long-running conversation risks growing large enough to strain the model's context window, slow things down, or increase cost unnecessarily.
Basic syntax:
from langchain.agents.middleware import SummarizationMiddleware
agent = create_agent(
model="gpt-5-nano",
checkpointer=InMemorySaver(),
middleware=[
SummarizationMiddleware(model="gpt-4o-mini", trigger=("tokens", 100), keep=("messages", 1)),
],
)
Common patterns:
SummarizationMiddlewareautomatically condenses older messages into a summary once a trigger threshold is hit- A custom
@before_agentfunction can remove specific messages outright usingRemoveMessage - Middleware runs at defined points in the agent's processing loop, without you having to rewrite the agent's core logic
Gotchas:
- ⚠️ Summarization only kicks in once the trigger condition is actually met — it doesn't run on every single message.
- ⚠️ Trimming or summarizing removes messages from the agent's working conversation state — it has no effect on the real-world data those messages referenced (an email isn't un-sent, a database row isn't deleted).
See also: Memory and Threads: Agents That Remember
What You Need to Know First
- Everything from Articles 1–11, especially memory (Article 4) and custom state (Article 8)
What We'll Cover in This Article
- What middleware means in the context of LangChain agents
- How to automatically summarize older parts of a conversation
- How to remove specific messages from the conversation entirely
What We'll Explain Along the Way
- A revisit of the context window concept from Article 10, applied to conversation length this time
- What "trigger" and "keep" mean for summarization specifically
What Is Middleware, and Why Do We Need It Here?
Middleware is code that runs at specific, defined points in an agent's processing loop — before the agent starts working on a new message, before or after it calls the model, and so on — without you needing to rewrite the agent's core logic to insert it. Think of it like airport security checkpoints along a travel route: they're inserted at specific points in the journey, they can inspect or modify what's allowed to continue, and they don't change where you're actually flying to.
We've actually already brushed up against a primitive version of this idea, without naming it — but starting with this article, middleware becomes an explicit, named concept we'll build on repeatedly through the rest of this series (human approval steps, dynamic prompts, dynamic tool access — all middleware).
For managing long conversations specifically, there are two complementary strategies:
- Summarize older parts of the conversation into a condensed form, once it gets long enough to matter
- Trim specific messages out entirely — useful for things like large tool results you don't need to keep around long-term
Summarizing Older Messages Automatically
SummarizationMiddleware watches the conversation and, once a trigger condition is met, automatically replaces older messages with a generated summary — keeping the most recent messages intact, since recent context usually matters most.
# Purpose: Automatically summarize older messages once the conversation grows
# Context: Keeps long conversations efficient without losing earlier context entirely
# Input: A long sequence of messages, sent in a single invoke() call for this demo
# Output: A response where earlier messages have been condensed into a summary
from dotenv import load_dotenv
load_dotenv()
from langchain.agents import create_agent
from langgraph.checkpoint.memory import InMemorySaver
from langchain.agents.middleware import SummarizationMiddleware
agent = create_agent(
model="gpt-5-nano",
checkpointer=InMemorySaver(),
middleware=[
SummarizationMiddleware(
model="gpt-4o-mini", # a smaller, cheaper model just for summarizing
trigger=("tokens", 100), # summarize once the conversation exceeds ~100 tokens
keep=("messages", 1), # always keep the most recent message intact
)
],
)
Two parameters matter most here:
triggerdecides when summarization kicks in —("tokens", 100)means "once the conversation's token count exceeds 100." (This is a deliberately small number for demonstration; a real conversation threshold would typically be much larger.)keepdecides what's protected from being summarized away —("messages", 1)means the single most recent message always stays intact, exactly as it was.
Let's see it in action with a longer, multi-turn conversation:
# Purpose: Trigger summarization with a long enough conversation
# Context: Demonstrates older content getting condensed automatically
# Input: A multi-turn conversation with Julie, long enough to cross the token trigger
# Output: response["messages"][0] is now a generated summary, not the original first message
from langchain.messages import HumanMessage, AIMessage
response = agent.invoke(
{"messages": [
HumanMessage(content="Hi Aria, can you check my inbox?"),
AIMessage(content="You have one new email from Jane asking about coffee next week."),
HumanMessage(content="Can you check if there are any events happening that week?"),
AIMessage(content="There's a jazz festival downtown that same week."),
HumanMessage(content="Good to know. Draft a reply confirming coffee."),
AIMessage(content="Draft ready: 'Hi Jane, coffee sounds great — let's pick a day!'"),
HumanMessage(content="Perfect. Also, what's our PTO policy for new hires again?"),
AIMessage(content="New hires accrue 10 days of PTO per year in their first year."),
HumanMessage(content="Thanks. One more thing — can you summarize everything so far?"),
]},
{"configurable": {"thread_id": "1"}},
)
print(response["messages"][0].content)
Once the conversation crosses the trigger threshold, response["messages"][0] is a generated summary covering the older turns — not the literal original first message anymore. The model still has access to what was discussed, just in a condensed form, freeing up space for new conversation to continue without growing unbounded.
Trimming Specific Messages
Sometimes you don't want a summary — you want certain messages gone entirely. A common case: tool results that were genuinely useful in the moment (like raw search results or SQL output) but don't need to stick around once the agent has already used them to answer.
This uses a different kind of middleware — a function decorated with @before_agent, which runs right before the agent processes each new message:
# Purpose: Remove all tool result messages before each new agent step
# Context: Keeps the conversation lean by dropping bulky tool output once it's served its purpose
# Input: The current conversation state
# Output: A set of messages to remove from that state
from typing import Any
from langchain.agents import AgentState
from langchain.messages import RemoveMessage, ToolMessage
from langgraph.runtime import Runtime
from langchain.agents.middleware import before_agent
@before_agent
def trim_tool_messages(state: AgentState, runtime: Runtime) -> dict[str, Any] | None:
"""Remove all tool result messages from the conversation before each new step."""
messages = state["messages"]
tool_messages = [m for m in messages if isinstance(m, ToolMessage)]
return {"messages": [RemoveMessage(id=m.id) for m in tool_messages]}
RemoveMessage(id=...) tells LangGraph "delete the message with this specific ID from state" — you're not editing message content, you're removing entries outright. Let's wire this in and see it work:
# Purpose: Confirm tool messages get removed before each new agent turn
# Context: Useful when large tool results (like raw SQL or search output)
# shouldn't linger in the conversation once they've served their purpose
# Input: A conversation that includes some ToolMessage entries
# Output: A response generated without those tool messages present anymore
from langchain.agents import create_agent
from langgraph.checkpoint.memory import InMemorySaver
from langchain.messages import HumanMessage, AIMessage, ToolMessage
agent = create_agent(
model="gpt-5-nano",
checkpointer=InMemorySaver(),
middleware=[trim_tool_messages],
)
response = agent.invoke(
{"messages": [
HumanMessage(content="Which artist has the most tracks in our catalog?"),
ToolMessage(content="[raw SQL result: 47 rows returned...]", tool_call_id="1"),
AIMessage(content="Iron Maiden has the most tracks in the catalog."),
HumanMessage(content="How many vacation days do new hires get?"),
ToolMessage(content="[raw handbook chunk: PTO Policy section, 800 characters...]", tool_call_id="2"),
AIMessage(content="New hires get 10 days of PTO in their first year."),
HumanMessage(content="Can you remind me what you told me about the artist?"),
]},
{"configurable": {"thread_id": "2"}},
)
print(response["messages"][-1].content)
By the time this runs, both ToolMessage entries have been stripped from the working conversation before the agent even started processing this turn — the agent answers based on what's left, without the bulky raw tool output cluttering things up.
Common Misconceptions
❌ Misconception: Trimming or summarizing deletes the underlying real-world data
Reality: This only affects the agent's working conversation state — the messages it keeps track of for context. It has no effect whatsoever on whatever real-world thing a message referenced — an email Aria checked isn't un-sent, a database row a SQL query returned isn't deleted.
Why this matters: Don't reach for message trimming as a way to "undo" or "clean up" real actions — it only manages how much conversational context the model sees going forward.
❌ Misconception: Summarization happens on every single message
Reality: SummarizationMiddleware only triggers once the conversation crosses the threshold you set in trigger — short conversations are never touched.
Why this matters: If you don't see summarization happening in a short test conversation, that's expected behavior, not a bug — try a longer conversation or a lower trigger threshold to actually observe it.
Troubleshooting Common Issues
Problem: Summarization never seems to trigger
Symptoms: response["messages"][0] is still the original first message, even in what feels like a long conversation.
Common Causes:
- The conversation hasn't actually crossed the
triggerthreshold yet (most common — token counts can be higher than expected, but also lower than you'd guess for short exchanges) - The
middleware=[...]list wasn't actually passed tocreate_agent
Diagnostic Steps:
# Step 1: Temporarily lower the trigger threshold to confirm the mechanism works at all
SummarizationMiddleware(model="gpt-4o-mini", trigger=("tokens", 20), keep=("messages", 1))
# Step 2: Confirm middleware is actually attached
agent = create_agent(model="gpt-5-nano", middleware=[your_middleware]) # ✅
Solution: Temporarily lower the trigger threshold to confirm summarization works at all, then tune it back up to a sensible production value.
Problem: An error referencing a missing message ID after trimming
Symptoms: An error about a tool call referencing a message that no longer exists, after trimming has run.
Common Causes:
- A
ToolMessagewas removed, but theAIMessagethat originally requested that tool call wasn't, leaving a dangling reference
Solution: Be careful about what you trim — removing a tool's result while keeping the message that requested it can leave the conversation in an inconsistent state for some models. Test trimming logic against realistic conversation shapes before relying on it in production.
Check Your Understanding
Quick Quiz
-
What does
trigger=("tokens", 100)actually control?Show Answer
It sets the condition under which
SummarizationMiddlewareactivates — once the conversation's token count exceeds 100, older messages get condensed into a summary. Below that threshold, nothing is summarized. -
Does removing a
ToolMessagewith trimming middleware affect the real action that tool performed?Show Answer
No — trimming only affects the agent's working conversation state. It has no effect on whatever real-world action or data the tool call originally touched.
-
What's the general definition of "middleware" in this context?
Show Answer
Code that runs at specific, defined points in an agent's processing loop — without requiring you to rewrite the agent's core logic — to inspect or modify what happens at that point.
Hands-On Exercise
Challenge: Modify trim_tool_messages to only remove ToolMessage entries longer than 100 characters, leaving short tool results intact.
Show Solution
from typing import Any
from langchain.agents import AgentState
from langchain.messages import RemoveMessage, ToolMessage
from langgraph.runtime import Runtime
from langchain.agents.middleware import before_agent
@before_agent
def trim_long_tool_messages(state: AgentState, runtime: Runtime) -> dict[str, Any] | None:
"""Remove only long tool result messages, keeping short ones intact."""
messages = state["messages"]
long_tool_messages = [
m for m in messages
if isinstance(m, ToolMessage) and len(m.content) > 100
]
return {"messages": [RemoveMessage(id=m.id) for m in long_tool_messages]}
Explanation: Adding a length check to the existing filter condition is enough — the rest of the pattern (build a list of messages to remove, return them as RemoveMessage objects) stays exactly the same.
Summary: Key Takeaways
- Middleware is code that runs at defined points in an agent's processing loop, without requiring changes to the agent's core logic
SummarizationMiddlewareautomatically condenses older messages once atriggerthreshold is crossed, whilekeepprotects recent messages- A custom
@before_agentfunction withRemoveMessagecan trim specific messages out of the conversation entirely - Neither approach affects real-world data — only the agent's working conversation state
- Aria's conversations can now stay efficient even as they grow long, instead of accumulating indefinitely
Version Information
Tested with:
- Python:
>=3.10, <4.0 langchain:>=1.1.3(latest stable as of writing:1.3.4) —SummarizationMiddlewareandbefore_agentare both part of corelangchainlanggraph:>=1.0.3—RemoveMessageandRuntime
Known issues:
- None specific to this article's functionality at the time of writing.
What's Next?
You now understand middleware as a concept, and two ways to keep long conversations under control.
The natural next step is Human-in-the-Loop: Approve, Reject, Edit — Aria has been allowed to send emails on her own this whole time. That article covers building in a real approval step before anything irreversible actually happens.
References
- LangChain Academy: Introduction to LangChain (Python) — this section is inspired by and adapted from this course
- LangChain Docs: Middleware — official guide to the middleware system
- LangChain Docs: Context Engineering — official guide covering summarization and conversation management strategies
langgraphon PyPI — latest version and release history