DeepSeek V4 Open Weights Drop as Independent Benchmark Verification Begins

The Weights Are Out. The Verdicts Are Coming.

Two days after DeepSeek's formal V4 announcement, the AI research community is doing what it does best: downloading a trillion-parameter model and putting it through its paces. The full open weights — released under Apache 2.0 — are now available, and early results from independent evaluators are beginning to surface. The picture is promising but complicated.

DeepSeek's internal benchmarks claimed 90% on HumanEval and over 80% on SWE-bench Verified. Those scores, if confirmed, would place V4 alongside the best proprietary models from Anthropic, OpenAI, and Google. But as the community has learned from previous high-profile releases, self-reported numbers do not always survive independent scrutiny.

What Early Testing Shows

Preliminary results from researchers running V4 on standardized evaluation harnesses paint a nuanced picture. The model's coding capabilities appear genuinely strong — HumanEval scores in the high 80s to low 90s are being reported across multiple independent runs, consistent with DeepSeek's claims. On SWE-bench, the numbers are harder to pin down due to the benchmark's complexity and variance across different evaluation setups.

The more interesting findings involve capabilities DeepSeek did not heavily market. V4's performance on long-context reasoning tasks — where the model must synthesize information across hundreds of thousands of tokens — appears to be a genuine step forward. The Engram conditional memory architecture seems to deliver on its promise of maintaining coherence across the full million-token context window, though edge cases and failure modes are still being catalogued.

Where V4 stumbles, according to early reports, is on certain types of nuanced instruction following and tasks requiring careful calibration of confidence. The model occasionally exhibits the overconfidence that has characterized previous MoE architectures, asserting incorrect answers with high apparent certainty. This is not unique to DeepSeek — it is a known challenge with expert routing systems — but it tempers some of the benchmark enthusiasm.

Running a Trillion Parameters on Consumer Hardware

One of V4's most consequential features is its accessibility at inference time. The MoE architecture activates only 37 billion parameters per token, and the quantized versions dramatically reduce hardware requirements. Independent testers have confirmed that the INT8 quantized model runs on dual RTX 4090s with 48GB of combined VRAM, delivering usable throughput for development and experimentation.

An INT4 quantized version fits on a single RTX 5090 with 32GB of VRAM. While quality degradation at INT4 is measurable — particularly on math-heavy and complex reasoning tasks — it remains sufficient for many practical applications including code generation, summarization, and conversational use.

This matters because it puts a frontier-class model into the hands of individual researchers, small startups, and university labs that cannot afford multi-node GPU clusters. The democratization effect is real: a model that two years ago would have required millions of dollars in infrastructure to run can now be accessed on hardware that costs under $4,000.

The Huawei Ascend Training Story Gets Scrutiny

DeepSeek's claim that V4 was trained entirely on Huawei Ascend 910B accelerators and Cambricon MLU chips — without any Nvidia hardware — continues to draw both admiration and skepticism. Hardware analysts have pointed out that the Ascend 910B's theoretical peak performance should make trillion-parameter training feasible but significantly less efficient than comparable Nvidia H100 clusters.

The training cost question remains unresolved. Earlier reports suggested approximately $5.2 million, which would be extraordinary for a model of this scale. DeepSeek has not released a detailed technical report with training compute metrics, making independent cost estimation difficult. Some researchers have suggested the true cost may be higher, with the $5.2 million figure potentially reflecting only marginal compute costs and excluding infrastructure, personnel, and earlier experimental runs.

Regardless of the exact figure, the strategic implication is clear: China's domestic chip ecosystem can support frontier-scale training. For U.S. policymakers who viewed export controls as a throttle on Chinese AI development, V4 is an inconvenient data point.

API Pricing Puts Pressure on Proprietary Labs

DeepSeek is offering hosted API access at approximately $0.30 per million tokens — a price point that significantly undercuts comparable proprietary models. For context, frontier API pricing from major Western labs typically ranges from $3 to $15 per million tokens for their most capable models.

Even accounting for potential differences in quality and reliability, a 10x to 50x price differential creates real competitive pressure. Developers building AI-powered applications — particularly those in cost-sensitive markets or processing high token volumes — now have a credible open-source alternative at a fraction of the price.

The downstream effects could be significant. If V4's quality holds up to sustained real-world use, it may accelerate the pricing pressure that has already been driving API costs down across the industry. OpenAI, Anthropic, and Google have all reduced prices in recent months; V4 gives them another reason to continue.

What the Next Weeks Will Reveal

The AI community's full assessment of V4 will take time. Large-scale benchmark suites from organizations like LMSYS, Scale AI, and Hugging Face's Open LLM Leaderboard are expected in the coming weeks. These evaluations will test V4 across a broader range of tasks than DeepSeek's internal benchmarks, including multilingual performance, safety alignment, and real-world application scenarios.

The multimodal capabilities that DeepSeek has teased — text, image, and video generation from a single model — remain largely untested by the community. If those features work as described, V4 would be the first open-source model to credibly compete across all major modalities.

For now, the release represents a genuine inflection point for open-source AI. Not because V4 is definitively better than proprietary alternatives — that remains to be established — but because it demonstrates that trillion-parameter open models are viable, accessible, and improving faster than many in the industry expected. The era of open-source AI playing catch-up may be ending.