DeepSeek has started the year to much fanfare; releasing two papers within days of one another while rumours of a V4 model continue to circulate. Previous research had produced models with quicker inference with strong reasoning capability. However, a year on since their “Giant Killing” moment, the gap between open and closed sourced LLM quality (or reasoning capability) has widened[1].
The two recent papers, “Engram” and “manifold-constrained Hyper Connections” (mHC) is a step towards addressing this gap, and as usual, was designed with infrastructure at the forefront.
The Engram paper introduces "Conditional Memory," an additional module that helps LLM learn static ‘knowledge’ (like facts and named entities) easier while simultaneously enhancing its reasoning capabilities. "Conditional Memory," is represented by a static lookup table hence is computationally inexpensive (compared to Attention Modules); but adds to the GPU “Memory Wall” problem; meaning a naïve implementation of this module would increase inference latency. However, with careful design, a system can perform this lookup (Conditional Memory) from cheaper host DRAM or even SSDs via PCIe ahead of time when needed, thus only marginally increasing inference latency.
Complementing this, the mHC paper refines "Hyper-Connections". “Hyper-Connection” can be seen as an extension of residual stream; a fundamental Deep Learning trick used to stabilize learning (via back propagation) and allows a “highway” for signals/information to pass between a model’s layers. Their refinement of “Hyper-Connections”, dubbed manifold constrained, expands residual stream width (4x) thus improving signals/information passing between layers without inflating the overall size of the model or training stability. The larger residual stream width adds additional demand on inter accelerator network in cases where model weights are sharded between accelerators, increasing inference latency. However, through advanced scheduling (DualPipe) these latencies can be masked.
The two papers follow "Algorithm-System co-design," philosophy where model architectures are explicitly engineered to maximize the utility of today's infrastructure. Furthermore, the company has shown that model quality (potentially) scales with conditional memory size and number of hyper connections. The wider industry is already obsessing over memory bandwidth and interconnects; evidenced by the push for HBM4 in next-gen Rubin GPUs and after inter accelerator networking. However, these infrastructure advancements are largely driven by the needs of frontier US AI labs
Crucially, these hardware advancements may not be readily accessible to DeepSeek due to the current geopolitical landscape. This raises a fascinating question: Is this a replay of the silicon manufacturing story, but applied to model architecture? Perhaps not, but this does question the longevity of novel architectures given the evolving landscape (which likely isn’t within their sphere of influence). The anticipated V4 model, which is likely to incorporate these design elements, may prove otherwise. If successful and widely adopted, it could shift influence towards open-source model developers.
[1] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

