ERNIE 5.1 officially launched: parameters reduced to one-third of 5.0, pre-training cost only 6%

📄Full Article· Automatically extracted by trafilaturaGemini 翻譯1250 words

Baidu's Ernie 5.1 large model is officially live. Users can experience it on the Ernie Bot official website, while enterprises and developers can call the API via the Qianfan platform. Ernie 5.1 is trained based on the Ernie 5.0 released in January this year, with total parameters compressed to about one-third of 5.0, activated parameters at about half, and pre-training compute costs at only 6% of models of the same scale. The core technology is the Once-for-All elastic training framework proposed by Baidu. (Context: DeepSeek raised $7.35 billion in its first round, outpacing Alibaba: Liang Wenfeng paid 40% out of his own pocket, seeking capital with the "fewest conditions") (Background: Anthropic to spend $200 billion on Google Cloud over five years; two AI startups consume half of the orders from the four major cloud giants) - Baidu Ernie 5.1 is officially live, with total parameters compressed to one-third of 5.0 and activated parameters at about half. - Pre-training compute cost is only 6% of models of the same scale; the core technology is the Once-for-All elastic training framework. - Ernie 5.0 produces a sub-model matrix through a single pre-training, and 5.1 extracts the optimal structure from it, directly inheriting the knowledge. Baidu Ernie 5.1 large model is officially live. Users can experience it directly on the Ernie Bot official website, and enterprises and developers can call the API via the Qianfan platform. Ernie 5.1 is trained based on the Ernie 5.0 released in January this year, with its core selling point being the significant compression of model size and training overhead. The 5.1 version has climbed to fourth place on the Arena search leaderboard. The core of the cost compression comes from the Once-for-All elastic training framework proposed by Baidu. Traditional methods require separate pre-training for models of different scales, with each model size representing an independent investment in compute; the more scale versions there are, the greater the waste from redundant training. Ernie 5.0 takes a different approach: it performs only one pre-training, using dynamic sampling technology to simultaneously optimize a large number of sub-models of different sizes, forming a "sub-model matrix." Ernie 5.1 is the optimal structure extracted from this matrix. It directly inherits all the knowledge accumulated by 5.0 during the pre-training phase, saving the compute required for training from scratch. Regarding the "6% pre-training cost" figure, it should be noted that Baidu did not find a cheaper training method for the same scale; rather, Ernie 5.1 completely skipped the pre-training process from scratch. Its training costs are mainly spent on selecting the optimal structure from the 5.0 sub-model matrix, as well as subsequent fine-tuning and alignment phases. Compared to the industry practice of training each model scale independently, this "train once, produce many" architecture has a structural advantage in marginal costs. This logic differs from the low-cost training route announced by DeepSeek earlier this year. DeepSeek V3 emphasizes lowering costs through fewer GPUs and more efficient engineering implementations in a single training run; Baidu's Once-for-All expands the output of a single training run from "one model" to "an entire model family." What is the difference between Ernie 5.1 and Ernie 5.0? Ernie 5.1 is trained based on 5.0, with total parameters compressed to one-third of 5.0 and activated parameters at about half. It extracts the optimal structure from the 5.0 Once-for-All sub-model matrix, inheriting all knowledge while being faster and cheaper to run in inference. What is the Once-for-All elastic training framework? A training method proposed by Baidu. It performs only one pre-training, using dynamic sampling to simultaneously optimize sub-models of different sizes to form a model matrix. New models are extracted from the matrix, saving the compute of training from scratch and significantly reducing marginal costs.

Data Status✓ Full text extractedRead Original (動區 BlockTempo)

🔍Historical Similar Events· Keyword + Asset Matching0 items

No similar events found (requires more data samples or embedding search; currently MVP keyword matching)

Raw Information

ID:58436348cf

Source:動區 BlockTempo

Published:2026-05-09 08:23:48

Category:zh_news · Export Category zh

Symbols:Unspecified

Community Votes:+0 / −0 · ⭐ 0 Important · 💬 0 Comments