GLM-5.1 Launches, Surpassing Opus 4.6 in Open Source Models

Introduction

The silent launch of GLM-5.1 has stirred waves in the tech community. This domestic large model has not only pushed AI programming capabilities from ‘minute-level’ to ‘8-hour-level’ but has also surpassed GPT-5.4 and Claude Opus 4.6 in hard-core rankings like SWE-Bench Pro. Its 10% price increase signifies a shift from price competition to value competition in domestic AI.

A Major Event Without a Launch

On the night of March 27, 2026, Zhiyu quietly released early access to GLM-5.1 without a launch event, PPT, or technical report. On April 8, GLM-5.1 was officially released. This ‘silent launch’ created a bigger ripple in the tech community, with developers integrating it into OpenRouter, testing it in Claude Code, and discussing it on platforms like X and Weibo. This lack of ceremony has made it one of the hottest topics in the domestic large model scene.

Core Breakthrough: From ‘Minute-Level’ to ‘8-Hour Level’

To understand the significance of GLM-5.1, one must grasp the evolution of AI programming capabilities. Over the past two years, the industry has been competing on ‘single-turn intelligence’: who can generate higher quality code? Who can create more impressive interfaces with a single sentence? This era, termed ‘Vibe Coding’, treats AI as a smart assistant that generates code and waits for feedback.

GLM-5 made its first step in February this year, pushing the capability boundary to ‘Agentic Engineering’: completing a full system engineering task within 30 minutes autonomously planning, executing, and testing. GLM-5.1 has extended this boundary to 8 hours.

This is not a metaphor but a validated number through benchmark testing:

KernelBench Level 3: GLM-5.1 independently optimized for over 24 hours on 50 real machine learning workloads, completing 655 iterations, improving vector database query throughput to 6.9 times that of the initial version.
Linux Desktop Build: Built a complete Linux desktop system from scratch in 8 hours.
METR Rankings: GLM-5.1 is one of the few models globally, aside from Claude Opus 4.6, that has validated 8-hour continuous working capability and is the only open-source model to achieve this.

Its operation has shifted from ‘generate code → wait for feedback’ to a complete loop of ’experiment → analyze → optimize’. The model actively runs benchmarks, identifies bottlenecks, adjusts strategies, and continuously improves through multiple iterations—more like an engineer who drives work forward rather than a tool waiting for commands.

Benchmark Battle: Domestic Models Reach the Top

While scores are not everything, they are the clearest language. In the comprehensive average scores of three representative code evaluation benchmarks—SWE-Bench Pro, Terminal-Bench 2.0, and NL2Repo—GLM-5.1 achieved: third globally, first among domestic models, and first among open-source models.

Notably, in the SWE-Bench Pro single-item ranking, which requires models to locate and fix high-difficulty engineering bugs in real GitHub repositories, GLM-5.1 set a new global best score, surpassing GPT-5.4 and Claude Opus 4.6. This is the first time a domestic open-source model has reached the top position in a core programming ranking.

A year ago, GLM-5.0 scored 35.4 in SWE-Bench; this time, GLM-5.1 jumped to 45.3, an increase of over 30%. The gap with Claude Opus 4.6 has narrowed to less than 3 points.

Price Increase Signal: A Shift in Domestic AI Confidence

Another noteworthy detail of this release is that Zhiyu has raised prices by 10%. The token price for caching in coding scenarios has approached that of Claude Sonnet 4.6. This is a signal, even a turning point.

Just a year ago, domestic large model manufacturers competed by slashing prices by over 90% to attract users. Now, Zhiyu has chosen to increase prices—anchoring performance premiums to international benchmarks rather than relying on low prices to maintain market share.

What does this mean? It indicates that domestic models have begun to gain confidence in their pricing power. A model that dares to raise prices must meet two prerequisites: performance that does not lag behind competitors and the ability to retain users. GLM-5.1 meets both.

This marks a genuine transition from ‘price competition’ to ‘value competition’.

In-Depth Evaluation: The Significance and Boundaries of This Breakthrough

What is the true significance?

The most important significance of GLM-5.1 is not merely surpassing others or ranking but defining a new evaluation dimension: “how long it can work”, not just “how smart it is.”

In the past, we used benchmarks to measure how smart a model is in a single interaction. However, real engineering tasks are not single interactions; they involve continuous hours of decision-making, execution, debugging, and fixing. The breakthrough of GLM-5.1 in this dimension means AI is one step closer to truly ‘replacing junior engineers’. More directly: AI tools have begun to show the capability to undertake complete engineering projects.

For developers, this signifies a new workflow—no longer ’let AI help me write this piece of code’, but ’throw this task to AI and check the results tomorrow’.

Where are the boundaries and limitations?

Of course, it is essential to note a few points:

There is still a gap between scores and real-world performance. Benchmarks like SWE-Bench are designed with clear metrics, while real projects often involve ambiguous requirements and implicit constraints. How GLM-5.1 performs on tasks without defined numerical metrics, such as Linux desktop builds, still requires more practical validation.
“8 hours” is a milestone, not the endpoint. Zhiyu acknowledges that maintaining execution consistency after thousands of tool calls, escaping local optima earlier, and establishing self-evaluation mechanisms without numerical indicators are significant technical challenges that still need to be addressed.
Price alignment is a double-edged sword. While raising prices represents confidence, it also means actively giving up the ’low-price advantage’. In the early stages, when user stickiness is not fully established, this requires stronger product capabilities to support.
The computing power ecosystem remains a variable. Zhiyu announced it is urgently expanding using domestic chip WanKa clusters, which is an important strategic signal, but the actual carrying capacity and stability of domestic computing power still require time to validate.

The Direction of This Competition

The launch of GLM-5.1 is embedded in a larger narrative. In 2026, global AI competition has entered a new phase: it is no longer about ‘who can create a smarter model’ but ‘who can create a more capable agent’. From Anthropic’s Claude Opus series to OpenAI’s GPT-5.x, and China’s DeepSeek, Qwen, and GLM, the focus of competition has clearly shifted to autonomous execution capabilities.

Within this framework, GLM-5.1’s 8-hour continuous working capability is not an isolated technical number but a key proof of domestic AI’s position in the ‘Agent era’.

Moreover, GLM-5.1’s choice to be open-source means global developers can build applications on it, iterate continuously, and give back to the community—this ecological effect is a form of long-term competitiveness.