Author: RadarAI Editorial
Editor: RadarAI Editorial
Last updated: 2026-05-21
Review status: Editorial review pending
Weekly report
周报
官方
AI热点
Gemini 3.1 Flash and Claude Code's desktop control capabilities launch simultaneously—real-time voice interaction and native GUI operation mark the tipping point for practical AI agents, ushering in the 'hands-on' era of on-device agents.
Editorial standards and source policy: Editorial standards, Team. Content links to primary sources; see Methodology.
## This Week in Summary
- Gemini 3.1 Flash Live and Claude Code’s computer-use capability have both launched simultaneously—real-time voice interaction and native GUI control are now the defining thresholds for practical AI agents. On-device agents have officially entered the “hands-on” era.
- Over five major agent frameworks/infrastructures—including OpenClaw, Claw Beta, Scion, Feishu CLI, and DingTalk CLI—have rolled out coordinated upgrades. Sub-agents, scheduled tasks, security sandboxes, and Kubernetes management dashboards are now production-ready, shifting agent engineering from PoC to full-scale deployment cycles.
- Qwen3.5-Omni outperforms Gemini-3.1 Pro across all multimodal benchmarks; GLM-5V-Turbo turns hand-drawn sketches directly into runnable frontend code. Domestic multimodal foundation models have now closed the loop across three critical dimensions: audiovisual understanding, visual programming, and agent execution.
- Embodied AI breaks out of simulation: GigaWorld-1 tops WorldArena globally; Zeekr deploys world models on a ¥86,800 vehicle; AAC Technologies unveils an acoustic perception solution for humanoid robots. Real-world deployment and low-cost mass production are accelerating in parallel.
- Anthropic’s multiple incidents—Claude Code billing anomalies, source-code leaks, and documented “obedience bias”—have triggered industry-wide reflection. OpenClaude, the Claude Agent SDK, and NO_FLICKER terminal mode are among the open-source and engineering responses now emerging rapidly—pushing the agent ecosystem toward model-agnosticism and auditable security.
- Doubao’s large language model processes over 120 trillion tokens daily; Cloudflare cut AI inference costs by 77% using Kimi K2.5. AI applications have moved past technical validation—and are now deep in the high-stakes phase of scale-driven throughput and commercial ROI verification.
## Hot Topics
1. **Gemini 3.1 Flash Live now powers Google Translate’s real-time translation and Gemini Live**
https://www.bestblogs.dev/status/2037653945632579623
*Core shift:* The underlying model has been rebuilt as a low-latency architecture optimized specifically for voice interaction—compatible with any microphone-equipped headset (iOS/Android agnostic). This breaks hardware lock-in and marks the transition of real-time multimodal interaction from lab demo to infrastructure serving hundreds of millions.
— *Opportunity:* Individual developers can test audio streaming input right away via `curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-live:generateContent`. Product teams should adopt its “voice interruption → immediate replanning” mechanism to redesign dialogue state machines in customer support or edtech apps—replacing legacy ASR+LLM pipelines that suffer from cumulative latency.
2. **Claude Code now supports native macOS GUI interaction (“Computer Use”)**
https://www.bestblogs.dev/status/2038663014098899416
*Core shift:* First-ever LLM capability to perform pixel-level operations on GUI apps without CLI interfaces—e.g., clicking buttons, dragging windows, reading on-screen text—in Electron or SwiftUI apps. AI evolves from “calling APIs” to “taking over the desktop,” redefining the scope of digital employees.
— *Opportunity:* Developers should download the preview CLI and run `/mcp` to enable it immediately. Product teams can leverage its MCP protocol to rapidly build “automated office suites”—e.g., `claudesdk run --app "Slack" --action "find_unread_in_channel #ai-dev"` to fetch unread messages and auto-generate weekly summaries, validating end-to-end workflow closure.
3. Feishu CLI Goes Open Source: Agent-Native Architecture Covers 11 Business Domains, Supports Structured Output & Dry Run
https://www.bestblogs.dev/status/2037893566853435739
Core idea: The first CLI toolkit officially released by a SaaS vendor *designed specifically for agents*. It includes AI-friendly features like JSON Schema output, safety pre-execution (Dry Run), and command composition—marking a systemic shift of SaaS platforms into agent collaboration infrastructure.
— Try now: `pip install feishu-cli`, then run `feishu calendar list --output json --dry-run` to verify permissions and response structure. On the product side, integrate it into LangGraph workflows—for example, use `feishu docs create --title "Q2 OKR" --content "{agent_output}"` to auto-align goals and replace manual doc syncing.
4. OpenClaw 3.28 Launches High-Risk Operation Pop-up Blocking + Claw Beta Adds Sub-Agents & Scheduled Tasks
https://www.bestblogs.dev/status/2038464418284282
Core idea: For the first time in an open-source framework, enterprise-grade agent controllability and reliability are achieved via three capabilities: async interception (e.g., mandatory confirmation pop-ups before database deletion), sub-agent isolation (dedicated memory/sandbox per task), and cron-based scheduling.
— Try now: Add `crons: ["0 9 * * 1", "0 18 * * *"]` to `openclaw.yaml` to auto-generate daily morning/evening reports. Individual developers can launch a sandbox with `claw run --safe-mode` and test whether dangerous commands like `rm -rf /tmp` get intercepted—validating security policy enforcement.
5. Qwen3.5-Omni Surpasses Gemini-3.1 Pro in Multimodal Capabilities, Enables Audiovisual Programming & Real-Time Voice Emotion Control
https://www.bestblogs.dev/article/cc80f169
Core idea: Achieves SOTA across 215 audio/video understanding benchmarks. Live demos include real-time voice pitch/volume/emotion adjustment, multi-turn interruption-aware travel planning, and timestamped audiovisual captioning—proving that full-modality foundation models now support dynamic decision-making in complex scenarios.
— Try now: Call `qwen-vl-api` with a meeting screen recording and prompt: `"Extract emotion curves for all speakers; label moments of anger, confusion, or excitement; and generate corresponding de-escalation phrases."` On the product side, embed it into smart glasses SDKs to close the loop: “detect colleague frowning → push real-time communication suggestions.”
6. Step 3.5 Flash by StepFun Tops the OpenClaw Leaderboard—Optimized Specifically for Agent Workloads
https://www.bestblogs.dev/status/2037527588449730627
Core idea: Outperforms general-purpose models significantly on three key metrics: task completion rate (+18.3%), response stability (first-token latency std dev ↓41%), and tool-calling accuracy (92.7%). It’s the first lightweight inference model deeply optimized for agent workflows.
— Try now: Load `step-3.5-flash` locally via Ollama and compare execution success rates: `ollama run step-3.5-flash "Check today’s Beijing weather and order a hot Americano"` vs. `gemma3`. On the product side, deploy it as the core model for edge-based agents—replacing cloud LLMs for high-frequency, low-complexity tasks and cutting inference costs by 63%.
7. GLM-5V-Turbo Released: Turn a Hand-Drawn Sketch Directly into Runnable Frontend Code
https://www.bestblogs.dev/article/793a379b
Core idea: The first multimodal coding model to achieve an end-to-end pipeline — from sketch → HTML/CSS/JS → browser rendering. Supports screenshot and screen recording inputs, and excels in GUI automation tasks for AI agents — dramatically shortening the design-to-development cycle.
— Possible use cases: Snap a Figma sketch with your phone, upload it to `glm-5v-turbo-api`, and call it with `{"input_type": "sketch", "output_format": "react"}` to get ready-to-use component code. Product teams can integrate this into low-code platforms so marketers or ops staff can upload campaign page sketches and instantly generate production-ready code — even auto-submitting PRs.
8. GigaWorld-1 Tops WorldArena Global Leaderboard: A High-Fidelity Embodied World Model
https://www.bestblogs.dev/article/54cfc8d0
Core idea: Combines explicit action modeling with a differentiable physics engine — delivering breakthrough gains in physical plausibility (+32.6%), 3D accuracy (+27.1%), and cross-scenario generalization. It’s the first industrial-grade embodied foundation model validated on real robotic hardware.
— Possible use cases: Download GigaWorld-1’s PyTorch weights and Unity plugin, then test grasp success rates in simulation using `gigaworld.step(action="grasp_cup")`. Hardware teams can deploy it directly onto UR5e robotic arms — replacing legacy motion planning modules — and verify whether coffee cup grasping completes in under 2.1 seconds.
9. Pretext: A Pure TypeScript Text Measurement Library — 500× Faster Than Traditional Approaches
https://www.bestblogs.dev/status/2038115581883257201
Core idea: Zero DOM dependency. Uses precise mathematical modeling to replicate browser text wrapping logic — solving overflow and element overlap issues in web screenshot rendering. Already battle-tested in Codepilot’s generative UI system.
— Possible use cases: In a Next.js project, run `npm install pretext`, then replace `getBoundingClientRect()` with `const width = pretext.measure("Hello 世界", { font: "14px Inter" })`. Product teams can embed Pretext into PDF generation services to ensure AI-generated reports automatically flow across columns — without overflowing A4 pages.
10. Cloudflare Fully Integrates Kimi K2.5 — Cutting AI Agent & Code Review Costs by 77%
https://www.bestblogs.dev/status/2038984561132990836
Core idea: Replaced its prior solution at scale — under the strict SLA required to power ~20% of the world’s top websites — while slashing costs by 77%. Proves that high-reasoning models are now production-ready for high-concurrency, low-latency environments — both technically robust and economically viable.
— Possible use cases: Deploy the `kimi-k2.5-worker` example template in Cloudflare Workers, calling `/v1/chat/completions` via `env.KIMI_API_KEY`. Product teams can adopt its “77% cost reduction” framework — migrating existing RAG services to Kimi K2.5 + litesearch — and benchmark real-world QPS and token cost changes.
- Gemini 3.1 Flash Live and Claude Code’s computer-use capability have both launched simultaneously—real-time voice interaction and native GUI control are now the defining thresholds for practical AI agents. On-device agents have officially entered the “hands-on” era.
- Over five major agent frameworks/infrastructures—including OpenClaw, Claw Beta, Scion, Feishu CLI, and DingTalk CLI—have rolled out coordinated upgrades. Sub-agents, scheduled tasks, security sandboxes, and Kubernetes management dashboards are now production-ready, shifting agent engineering from PoC to full-scale deployment cycles.
- Qwen3.5-Omni outperforms Gemini-3.1 Pro across all multimodal benchmarks; GLM-5V-Turbo turns hand-drawn sketches directly into runnable frontend code. Domestic multimodal foundation models have now closed the loop across three critical dimensions: audiovisual understanding, visual programming, and agent execution.
- Embodied AI breaks out of simulation: GigaWorld-1 tops WorldArena globally; Zeekr deploys world models on a ¥86,800 vehicle; AAC Technologies unveils an acoustic perception solution for humanoid robots. Real-world deployment and low-cost mass production are accelerating in parallel.
- Anthropic’s multiple incidents—Claude Code billing anomalies, source-code leaks, and documented “obedience bias”—have triggered industry-wide reflection. OpenClaude, the Claude Agent SDK, and NO_FLICKER terminal mode are among the open-source and engineering responses now emerging rapidly—pushing the agent ecosystem toward model-agnosticism and auditable security.
- Doubao’s large language model processes over 120 trillion tokens daily; Cloudflare cut AI inference costs by 77% using Kimi K2.5. AI applications have moved past technical validation—and are now deep in the high-stakes phase of scale-driven throughput and commercial ROI verification.
Hot Topics
-
Gemini 3.1 Flash Live now powers Google Translate’s real-time translation and Gemini Live
https://www.bestblogs.dev/status/2037653945632579623
Core shift: The underlying model has been rebuilt as a low-latency architecture optimized specifically for voice interaction—compatible with any microphone-equipped headset (iOS/Android agnostic). This breaks hardware lock-in and marks the transition of real-time multimodal interaction from lab demo to infrastructure serving hundreds of millions.
— Opportunity: Individual developers can test audio streaming input right away via curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-live:generateContent. Product teams should adopt its “voice interruption → immediate replanning” mechanism to redesign dialogue state machines in customer support or edtech apps—replacing legacy ASR+LLM pipelines that suffer from cumulative latency.
-
Claude Code now supports native macOS GUI interaction (“Computer Use”)
https://www.bestblogs.dev/status/2038663014098899416
Core shift: First-ever LLM capability to perform pixel-level operations on GUI apps without CLI interfaces—e.g., clicking buttons, dragging windows, reading on-screen text—in Electron or SwiftUI apps. AI evolves from “calling APIs” to “taking over the desktop,” redefining the scope of digital employees.
— Opportunity: Developers should download the preview CLI and run /mcp to enable it immediately. Product teams can leverage its MCP protocol to rapidly build “automated office suites”—e.g., claudesdk run --app "Slack" --action "find_unread_in_channel #ai-dev" to fetch unread messages and auto-generate weekly summaries, validating end-to-end workflow closure.
-
Feishu CLI Goes Open Source: Agent-Native Architecture Covers 11 Business Domains, Supports Structured Output & Dry Run
https://www.bestblogs.dev/status/2037893566853435739
Core idea: The first CLI toolkit officially released by a SaaS vendor designed specifically for agents. It includes AI-friendly features like JSON Schema output, safety pre-execution (Dry Run), and command composition—marking a systemic shift of SaaS platforms into agent collaboration infrastructure.
— Try now: pip install feishu-cli, then run feishu calendar list --output json --dry-run to verify permissions and response structure. On the product side, integrate it into LangGraph workflows—for example, use feishu docs create --title "Q2 OKR" --content "{agent_output}" to auto-align goals and replace manual doc syncing.
-
OpenClaw 3.28 Launches High-Risk Operation Pop-up Blocking + Claw Beta Adds Sub-Agents & Scheduled Tasks
https://www.bestblogs.dev/status/2038464418284282
Core idea: For the first time in an open-source framework, enterprise-grade agent controllability and reliability are achieved via three capabilities: async interception (e.g., mandatory confirmation pop-ups before database deletion), sub-agent isolation (dedicated memory/sandbox per task), and cron-based scheduling.
— Try now: Add crons: ["0 9 * * 1", "0 18 * * *"] to openclaw.yaml to auto-generate daily morning/evening reports. Individual developers can launch a sandbox with claw run --safe-mode and test whether dangerous commands like rm -rf /tmp get intercepted—validating security policy enforcement.
-
Qwen3.5-Omni Surpasses Gemini-3.1 Pro in Multimodal Capabilities, Enables Audiovisual Programming & Real-Time Voice Emotion Control
https://www.bestblogs.dev/article/cc80f169
Core idea: Achieves SOTA across 215 audio/video understanding benchmarks. Live demos include real-time voice pitch/volume/emotion adjustment, multi-turn interruption-aware travel planning, and timestamped audiovisual captioning—proving that full-modality foundation models now support dynamic decision-making in complex scenarios.
— Try now: Call qwen-vl-api with a meeting screen recording and prompt: "Extract emotion curves for all speakers; label moments of anger, confusion, or excitement; and generate corresponding de-escalation phrases." On the product side, embed it into smart glasses SDKs to close the loop: “detect colleague frowning → push real-time communication suggestions.”
-
Step 3.5 Flash by StepFun Tops the OpenClaw Leaderboard—Optimized Specifically for Agent Workloads
https://www.bestblogs.dev/status/2037527588449730627
Core idea: Outperforms general-purpose models significantly on three key metrics: task completion rate (+18.3%), response stability (first-token latency std dev ↓41%), and tool-calling accuracy (92.7%). It’s the first lightweight inference model deeply optimized for agent workflows.
— Try now: Load step-3.5-flash locally via Ollama and compare execution success rates: ollama run step-3.5-flash "Check today’s Beijing weather and order a hot Americano" vs. gemma3. On the product side, deploy it as the core model for edge-based agents—replacing cloud LLMs for high-frequency, low-complexity tasks and cutting inference costs by 63%.
-
GLM-5V-Turbo Released: Turn a Hand-Drawn Sketch Directly into Runnable Frontend Code
https://www.bestblogs.dev/article/793a379b
Core idea: The first multimodal coding model to achieve an end-to-end pipeline — from sketch → HTML/CSS/JS → browser rendering. Supports screenshot and screen recording inputs, and excels in GUI automation tasks for AI agents — dramatically shortening the design-to-development cycle.
— Possible use cases: Snap a Figma sketch with your phone, upload it to glm-5v-turbo-api, and call it with {"input_type": "sketch", "output_format": "react"} to get ready-to-use component code. Product teams can integrate this into low-code platforms so marketers or ops staff can upload campaign page sketches and instantly generate production-ready code — even auto-submitting PRs.
-
GigaWorld-1 Tops WorldArena Global Leaderboard: A High-Fidelity Embodied World Model
https://www.bestblogs.dev/article/54cfc8d0
Core idea: Combines explicit action modeling with a differentiable physics engine — delivering breakthrough gains in physical plausibility (+32.6%), 3D accuracy (+27.1%), and cross-scenario generalization. It’s the first industrial-grade embodied foundation model validated on real robotic hardware.
— Possible use cases: Download GigaWorld-1’s PyTorch weights and Unity plugin, then test grasp success rates in simulation using gigaworld.step(action="grasp_cup"). Hardware teams can deploy it directly onto UR5e robotic arms — replacing legacy motion planning modules — and verify whether coffee cup grasping completes in under 2.1 seconds.
-
Pretext: A Pure TypeScript Text Measurement Library — 500× Faster Than Traditional Approaches
https://www.bestblogs.dev/status/2038115581883257201
Core idea: Zero DOM dependency. Uses precise mathematical modeling to replicate browser text wrapping logic — solving overflow and element overlap issues in web screenshot rendering. Already battle-tested in Codepilot’s generative UI system.
— Possible use cases: In a Next.js project, run npm install pretext, then replace getBoundingClientRect() with const width = pretext.measure("Hello 世界", { font: "14px Inter" }). Product teams can embed Pretext into PDF generation services to ensure AI-generated reports automatically flow across columns — without overflowing A4 pages.
-
Cloudflare Fully Integrates Kimi K2.5 — Cutting AI Agent & Code Review Costs by 77%
https://www.bestblogs.dev/status/2038984561132990836
Core idea: Replaced its prior solution at scale — under the strict SLA required to power ~20% of the world’s top websites — while slashing costs by 77%. Proves that high-reasoning models are now production-ready for high-concurrency, low-latency environments — both technically robust and economically viable.
— Possible use cases: Deploy the kimi-k2.5-worker example template in Cloudflare Workers, calling /v1/chat/completions via env.KIMI_API_KEY. Product teams can adopt its “77% cost reduction” framework — migrating existing RAG services to Kimi K2.5 + litesearch — and benchmark real-world QPS and token cost changes.
← Back to Updates