Beating GPT-5.4 at 2 cents per query: Perplexity reveals post-training recipe for search agents

📄Full Article· Automatically extracted by trafilaturaGemini 翻譯1829 words

Perplexity has disclosed its search Agent post-training process, revealing that models based on the open-source Qwen3.5 outperform GPT-5.4 in search accuracy, with a cost per task of only 2.0 cents—less than one-quarter that of GPT-5.4. (Context: Perplexity Personal Computer is live: Letting AI take over the local Mac, available to Max users for a $200 monthly fee) (Background: Can you use it without coding? Perplexity Computer lets AI deliver results and automate workflows for you) Can open-source models beat closed-source flagship models? Perplexity has provided an answer that the industry cannot ignore with a technical report. The company, which started with AI search, has recently fully disclosed the post-training methodology for its web search Agent. The foundation of the entire training process consists of two open-source models from Alibaba's Qwen series: Qwen3.5-122B-A10B and Qwen3.5-397B-A17B. This choice alone signals that Perplexity does not intend to pay for GPT or Claude as a backbone, but rather to start from open-source models and craft its own search capabilities. Training is divided into two stages. The first stage is Supervised Fine-Tuning (SFT). In other words, the model is provided with a large amount of "standard answers" to learn basic behavioral rules: responses must follow instructions, language must be consistent, and formatting must be correct. This stage does not aim for intelligence, but for reliability—just like establishing good work habits for a new employee before training their judgment. The second stage is Reinforcement Learning (RL), using the GRPO algorithm. The model is allowed to repeatedly attempt real tasks, adjusting its strategy based on the quality of the results each time. The uniqueness of GRPO lies in the fact that it does not require training a separate "judge AI," but instead compares outputs from the same batch against each other to extract learning signals. This makes training more cost-effective and easier to scale. The RL training data consists of two streams. One is a multi-hop reasoning question bank synthesized by Perplexity. The respondent must search for the first fact, then use that fact to search for the next, repeating this 2 to 4 times to reach the final answer. These types of questions are designed to train the model's "chain-of-thought" capability, teaching it to treat search as a series of logical steps rather than a one-off keyword query. The other stream is dialogue data based on a rubric, which converts good habits established during SFT—such as "following format" and "maintaining language consistency"—into quantifiable conditions for the RL stage, preventing the model from discarding basic discipline while "chasing high scores." The biggest challenge in RL training is how to define "good search behavior." If the scoring criteria are poorly defined, the model can easily learn responses that look fluent on the surface but are factually incorrect. Persuasiveness and accuracy are two different things, but AI training signals often conflate the two. Perplexity's solution is called gated aggregation. The core logic is: preference scores are only calculated if the answer itself is correct. If the model answers incorrectly, it receives no bonus points, regardless of how organized its output appears. This "gate" places factual accuracy before all preference evaluations, ensuring that reward signals are always tied to "whether the answer is correct" rather than "whether the tone is pleasing." The logic of the efficiency penalty is also noteworthy. The benchmark for judging whether a search involves "too many tool calls" is not a fixed number, but the average number of calls used by other models in the same batch that answered correctly. In plain terms: if your peers answered correctly using three searches, and you used seven, you will still be penalized for efficiency. The evaluation results use FRAMES, an industry-recognized multi-hop search benchmark where questions are designed to require reasoning across multiple sources and steps. On this benchmark, the post-trained Qwen3.5-397B-SFT-RL achieved an accuracy of 57.3% with just one tool call, surpassing GPT-5.4 and Claude Sonnet 4.6 by approximately 5 percentage points each. But accuracy is only the first layer of the narrative. What is truly striking are the figures in the cost column. When the tool call limit is relaxed to four, the accuracy of the three models is: Qwen3.5-397B-SFT-RL 73.9%, GPT-5.4 67.8%, and Claude Sonnet 4.6 62.4%. Having the highest accuracy is already competitive enough. However, the cost per query is 2.0 cents,

Data Status✓ Full text extractedRead Original (動區 BlockTempo)

🔍Historical Similar Events· Keyword + Asset Matching6 items

2026-04-23

PrimePiper: A prime broker for AI agent trading, enabling AI agents to trade securely in global markets.

Similarity 120%關鍵字 agent同分類 zh

2026-04-23

Introduction to Claude Code's new feature /ultrareview: Cloud-based multi-agent deep code review, now free for a limited time for Pro and Max users

Similarity 120%關鍵字 agent同分類 zh

2026-04-23

The other side of AI Agents: more patient and personalized scammers

Similarity 120%關鍵字 agent同分類 zh

2026-04-23