It’s 2026, and thanks in no small part to Jaya Gupta and Ashu Garg of Foundation Capital, this is shaping up to be the year of the context graph.
Their argument, that context graphs are AI's trillion dollar opportunity, landed because it names a gap that every organisation can relate to. We have systems of record for data. We have increasingly capable agents acting across those systems. But we lack a durable, queryable record of how decisions are made—not just the outcome, but the reasoning, exceptions, approvals, and precedents that turn data into action.
Jaya and Ashu call this missing layer the context graph: a record of decision traces capturing not just what happened, but why it was allowed to happen.
Where I want to push this conversation further is here: capturing decision traces is necessary, but not sufficient. Once context graphs exist, the hard problem becomes operationalising them. Turning accumulated decision history into model-ready context under cost, latency, and reliability constraints is the point where most systems, demoing smoothly on slides, will struggle in practice. Context, unmanaged, risks becoming the next form of enterprise technical debt.
The idea of decision traces may sound new in AI, but it has a familiar analogue in financial markets. In electronic trading, PnL explain reports reconstruct outcomes tick by tick: showing exactly how a trade was priced, what data and models were used, what algorithmic decisions were taken, and why money was made or lost in the seconds and minutes after execution. These reports show data lineage, but they also make decisions replayable, outcomes attributable, and behaviour auditable under extreme constraints of latency, cost, and precision.
Lineage alone is never enough. PnL explain became indispensable because it connected data, decisions, timing, and outcomes into a coherent narrative the business could trust. Startups building decision traces for AI systems would do well to borrow from this playbook when designing context graphs that need to operate in enterprise workflows.
What caught my attention about context graphs isn’t just the idea of capturing decision traces; it’s what happens next. Once decision history becomes durable and queryable, context stops being a short-lived prompt input and becomes an accumulating substrate. Every new trace expands the pool of information an agent could draw on. At that point, the limiting factor is no longer access to context, but how effectively, efficiently, and predictably that context can be turned into model input.
Against that backdrop, I wanted to share a related thread of work we’ve been pursuing at STAC throughout 2025: a context engineering benchmark. It’s not a context graph in the way Gupta and Garg describe it, but I think it is increasingly likely to serve as a substrate for that world.
Where their work focuses on capturing decision lineage at the workflow level, our focus has been on a more foundational question: how effectively do different approaches to providing context work, and what do they cost?
For many real-world tasks, model choice matters less than context quality: what information you retrieve, how you structure it, how much you include, when you include it, and how consistently it can be delivered under operational constraints.
Over the past year, we’ve researched and benchmarked a range of context engineering techniques across different models and platforms. That includes variations in retrieval strategies, summarisation approaches, memory windows, prompt construction, and hybrid techniques combining structured and unstructured inputs.
The headline finding won’t surprise anyone working hands-on with these systems: better context almost always produces better outputs—more accurate responses, more grounded reasoning, and more graceful handling of edge cases.
What is less well understood is the shape of the tradeoff curve. Quality gains are rarely free. Richer context tends to drive up costs, whether through increased latency, higher token usage, more aggressive retrieval, or additional infrastructure to prepare and serve that context. Marginal improvements in output quality can come with disproportionate increases in cost or response time.
By systematically testing different context engineering approaches across multiple models and execution environments, we can start answering questions that enterprises now grapple with in production:
- Where does additional context meaningfully improve outcomes, and where does it plateau?
- Which techniques deliver the best quality-per-token or quality-per-millisecond?
- How sensitive are different models to context volume versus structure?
- What combinations of retrieval, compression, and prompting strike the most reliable balance between accuracy, cost, and latency?
These questions matter even more in a world of context graphs. If decision traces become first-class data, agents will increasingly operate over richer, deeper contextual substrates. Feeding that context into models efficiently and predictably will be critical. Without discipline, the cost of “knowing why” can quickly overwhelm the value of that knowledge.
Our view is that context engineering will become a core engineering competency rather than a prompt-tuning afterthought. Like any other engineering discipline, it benefits from measurement, comparability, and shared benchmarks. The context graph tells us what happened and why; a context engineering benchmark determines how effectively that accumulated knowledge can be turned back into action.
2026 may well be the year the context graph enters the enterprise mainstream. If it does, understanding the tradeoffs beneath it will matter just as much as the vision at the top.

