You Don't Know What Your Agent Will Do. But You Can Know What It Can Do.

LangChain is right that production reveals surprises. But observability alone is not enough. Enterprises need layers: boundaries, capability visibility, risk assessment, and runtime monitoring.

Production tells you what happened. Governance starts with what could happen.

LangChain's new post, "You don't know what your agent will do until it's in production," gets an important point right: agent behavior is harder to predict than traditional software behavior.

They identify two real causes:

  • Infinite input space: users can ask for the same thing in endlessly different ways.
  • Non-deterministic model behavior: small prompt or context changes can produce meaningfully different outputs and tool choices.

We agree with that. Production does reveal behavior you will not see in a staging environment.

But the phrase is also incomplete. There is a difference between:

  • what an agent might say
  • what an agent is structurally capable of doing
  • what an agent actually did in production

Those are three different governance problems. If you collapse them into one, you end up over-investing in post-hoc observability and under-investing in control.


What you can know before production

You may not know the exact sentence an agent will generate. You may not know every failure mode its users will induce. But you can know a lot before it takes a single production action:

  • what tools it has access to
  • what credentials, networks, data stores, or SaaS systems it can reach
  • whether it can take write actions or only make recommendations
  • whether it can delegate to other agents or compose new actions
  • what safeguards exist when it is wrong
  • what the credible blast radius looks like if it fails

That is the point of GRASP: make the risk posture of the agent legible before you learn about it the hard way.

This is why we push capability-first governance. For enterprises, the first question is not "what surprising thing did the agent do?" It is "what was this agent able to do in the first place?"


Layer 1: Hard boundaries still matter

Perimeters, sandboxing, scoped credentials, tool allowlists, approval gates, network restrictions, and proxy controls all matter. They are the first layer of defense, and teams should keep building them.

But they are not enough.

They are a classic swiss cheese strategy: useful slices, each with holes.

  • the sandbox is misconfigured
  • the "read-only" token can still trigger downstream effects
  • the approved assistant is used for an unapproved workflow
  • the local agent never touches the network control point you were relying on
  • the risk posture changes quietly when a new tool or permission is added

This is the same theme we argued in The Shadow is Cast by the Object: the hard problem is often not rogue AI. It is approved AI being used in ways nobody explicitly re-assessed.

Boundaries reduce exposure. They do not remove the need to understand exposure.


Layer 2: Observe capability, not just behavior

This is where ARIS sits.

LangSmith is focused on runtime observability: the traces, trajectories, outputs, clustering, and evaluations that help teams understand what agents did after they ran.

ARIS addresses a different question: what agents exist, what can they reach, how much agency do they have, what safeguards constrain them, and how much damage could they plausibly cause?

That is not post-hoc observability. It is capability observability.

ARIS discovers agents across code, CI/CD, cloud, and SaaS environments, then builds a GRASP profile so risk can be assessed systematically rather than anecdotally. In other words:

  • LangSmith helps you review production behavior.
  • ARIS helps you review production capability.

This middle layer matters because scalable governance cannot depend on reading every trace by hand. LangChain says humans can only review a limited number of traces per hour. Exactly. That is why enterprises need a way to triage which agents deserve the deepest scrutiny before waiting for behavior to surface in logs.


Layer 3: Runtime observability is still essential

None of this is an argument against LangSmith-style observability. Quite the opposite.

Once an agent is live, you still need to understand:

  • what users are actually asking it to do
  • where tool selection fails
  • which prompts drift
  • where safety or compliance issues emerge
  • which edge cases only appear under real traffic

That is all real, and LangChain is right to emphasize it.

The mistake is treating runtime observability as the whole answer.

If observability is your only line of defense, then the enterprise learns from production after the agent has already exercised the permissions it was given. Sometimes that is acceptable. In regulated or high-impact environments, often it is not.


The right model is layers

The strongest governance stack looks like this:

  1. Boundaries and perimeters to constrain what is allowed.
  2. Capability discovery and risk assessment to understand what each agent could do if those boundaries fail or drift.
  3. Runtime observability to learn what the agent actually does under real usage.

That is the nuance missing from the simpler "you don't know until production" framing.

You do not know everything until production. True.

But you absolutely can know enough to make better decisions before production:

  • whether the agent should run at all
  • where it should run
  • what it should be allowed to touch
  • which agents need tighter controls
  • which ones are safe enough to let move faster

That is how you turn governance from a blanket brake into targeted leverage.


Complementary, not competing

We do not see this as ARIS versus LangSmith.

They operate at different layers:

  • LangSmith is a strong dev and runtime observability layer for understanding agent quality and behavior.
  • ARIS is the governance and risk layer for discovering agents, assessing exposure, and informing where control should tighten.

An enterprise may want both. One helps teams improve agents. The other helps the organization decide which agents should be trusted, constrained, or escalated.


Conclusion

LangChain is right about the existence of production surprises.

Where we differ is the implied lesson. The answer is not just "observe better after the fact." The answer is to build layered governance:

  • hard boundaries because controls matter
  • capability visibility because boundaries are porous
  • runtime observability because real users always surprise you

You may not know exactly what your agent will do. But you can know what it can do, and for enterprise governance, that is where the work starts.

Related