## This Week in Summary - Gemini 3.1 Flash Live and Claude Code’s computer-use capability have both launched simultaneously—real-time voice interaction and native GUI control are now the defining thresholds for practical AI agents. On-device agents have officially entered the “hands-on” era. - Over five major agent frameworks/infrastructures—including OpenClaw, Claw Beta, Scion, Feishu CLI, and DingTalk CLI—have rolled out coordinated upgrades. Sub-agents, scheduled tasks, security sandboxes, and Kubernetes management dashboards are now production-ready, shifting agent engineering from PoC to full-scale deployment cycles. - Qwen3.5-Omni outperforms Gemini-3.1 Pro across all multimodal benchmarks; GLM-5V-Turbo turns hand-drawn sketches directly into runnable frontend code. Domestic multimodal foundation models have now closed the loop across three critical dimensions: audiovisual understanding, visual programming, and agent execution. - Embodied AI breaks out of simulation: GigaWorld-1 tops WorldArena globally; Zeekr deploys world models on a ¥86,800 vehicle; AAC Technologies unveils an acoustic perception solution for humanoid robots. Real-world deployment and low-cost mass production are accelerating in parallel. - Anthropic’s multiple incidents—Claude Code billing anomalies, source-code leaks, and documented “obedience bias”—have triggered industry-wide reflection. OpenClaude, the Claude Agent SDK, and NO_FLICKER terminal mode are among the open-source and engineering responses now emerging rapidly—pushing the agent ecosystem toward model-agnosticism and auditable security. - Doubao’s large language model processes over 120 trillion tokens daily; Cloudflare cut AI inference costs by 77% using Kimi K2.5. AI applications have moved past technical validation—and are now deep in the high-stakes phase of scale-driven throughput and commercial ROI verification. ## Hot Topics 1. **Gemini 3.1 Flash Live now powers Google Translate’s real-time translation and Gemini Live** https://www.bestblogs.dev/status/2037653945632579623 *Core shift:* The underlying model has been rebuilt as a low-latency architecture optimized specifically for voice interaction—compatible with any microphone-equipped headset (iOS/Android agnostic). This breaks hardware lock-in and marks the transition of real-time multimodal interaction from lab demo to infrastructure serving hundreds of millions. — *Opportunity:* Individual developers can test audio streaming input right away via `curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-live:generateContent`. Product teams should adopt its “voice interruption → immediate replanning” mechanism to redesign dialogue state machines in customer support or edtech apps—replacing legacy ASR+LLM pipelines that suffer from cumulative latency. 2. **Claude Code now supports native macOS GUI interaction (“Computer Use”)** https://www.bestblogs.dev/status/2038663014098899416 *Core shift:* First-ever LLM capability to perform pixel-level operations on GUI apps without CLI interfaces—e.g., clicking buttons, dragging windows, reading on-screen text—in Electron or SwiftUI apps. AI evolves from “calling APIs” to “taking over the desktop,” redefining the scope of digital employees. — *Opportunity:* Developers should download the preview CLI and run `/mcp` to enable it immediately. Product teams can leverage its MCP protocol to rapidly build “automated office suites”—e.g., `claudesdk run --app "Slack" --action "find_unread_in_channel #ai-dev"` to fetch unread messages and auto-generate weekly summaries, validating end-to-end workflow closure. 3. Feishu CLI Goes Open Source: Agent-Native Architecture Covers 11 Business Domains, Supports Structured Output & Dry Run https://www.bestblogs.dev/status/2037893566853435739 Core idea: The first CLI toolkit officially released by a SaaS vendor *designed specifically for agents*. It includes AI-friendly features like JSON Schema output, safety pre-execution (Dry Run), and command composition—marking a systemic shift of SaaS platforms into agent collaboration infrastructure. — Try now: `pip install feishu-cli`, then run `feishu calendar list --output json --dry-run` to verify permissions and response structure. On the product side, integrate it into LangGraph workflows—for example, use `feishu docs create --title "Q2 OKR" --content "{agent_output}"` to auto-align goals and replace manual doc syncing. 4. OpenClaw 3.28 Launches High-Risk Operation Pop-up Blocking + Claw Beta Adds Sub-Agents & Scheduled Tasks https://www.bestblogs.dev/status/2038464418284282 Core idea: For the first time in an open-source framework, enterprise-grade agent controllability and reliability are achieved via three capabilities: async interception (e.g., mandatory confirmation pop-ups before database deletion), sub-agent isolation (dedicated memory/sandbox per task), and cron-based scheduling. — Try now: Add `crons: ["0 9 * * 1", "0 18 * * *"]` to `openclaw.yaml` to auto-generate daily morning/evening reports. Individual developers can launch a sandbox with `claw run --safe-mode` and test whether dangerous commands like `rm -rf /tmp` get intercepted—validating security policy enforcement. 5. Qwen3.5-Omni Surpasses Gemini-3.1 Pro in Multimodal Capabilities, Enables Audiovisual Programming & Real-Time Voice Emotion Control https://www.bestblogs.dev/article/cc80f169 Core idea: Achieves SOTA across 215 audio/video understanding benchmarks. Live demos include real-time voice pitch/volume/emotion adjustment, multi-turn interruption-aware travel planning, and timestamped audiovisual captioning—proving that full-modality foundation models now support dynamic decision-making in complex scenarios. — Try now: Call `qwen-vl-api` with a meeting screen recording and prompt: `"Extract emotion curves for all speakers; label moments of anger, confusion, or excitement; and generate corresponding de-escalation phrases."` On the product side, embed it into smart glasses SDKs to close the loop: “detect colleague frowning → push real-time communication suggestions.” 6. Step 3.5 Flash by StepFun Tops the OpenClaw Leaderboard—Optimized Specifically for Agent Workloads https://www.bestblogs.dev/status/2037527588449730627 Core idea: Outperforms general-purpose models significantly on three key metrics: task completion rate (+18.3%), response stability (first-token latency std dev ↓41%), and tool-calling accuracy (92.7%). It’s the first lightweight inference model deeply optimized for agent workflows. — Try now: Load `step-3.5-flash` locally via Ollama and compare execution success rates: `ollama run step-3.5-flash "Check today’s Beijing weather and order a hot Americano"` vs. `gemma3`. On the product side, deploy it as the core model for edge-based agents—replacing cloud LLMs for high-frequency, low-complexity tasks and cutting inference costs by 63%. 7. GLM-5V-Turbo Released: Turn a Hand-Drawn Sketch Directly into Runnable Frontend Code https://www.bestblogs.dev/article/793a379b Core idea: The first multimodal coding model to achieve an end-to-end pipeline — from sketch → HTML/CSS/JS → browser rendering. Supports screenshot and screen recording inputs, and excels in GUI automation tasks for AI agents — dramatically shortening the design-to-development cycle. — Possible use cases: Snap a Figma sketch with your phone, upload it to `glm-5v-turbo-api`, and call it with `{"input_type": "sketch", "output_format": "react"}` to get ready-to-use component code. Product teams can integrate this into low-code platforms so marketers or ops staff can upload campaign page sketches and instantly generate production-ready code — even auto-submitting PRs. 8. GigaWorld-1 Tops WorldArena Global Leaderboard: A High-Fidelity Embodied World Model https://www.bestblogs.dev/article/54cfc8d0 Core idea: Combines explicit action modeling with a differentiable physics engine — delivering breakthrough gains in physical plausibility (+32.6%), 3D accuracy (+27.1%), and cross-scenario generalization. It’s the first industrial-grade embodied foundation model validated on real robotic hardware. — Possible use cases: Download GigaWorld-1’s PyTorch weights and Unity plugin, then test grasp success rates in simulation using `gigaworld.step(action="grasp_cup")`. Hardware teams can deploy it directly onto UR5e robotic arms — replacing legacy motion planning modules — and verify whether coffee cup grasping completes in under 2.1 seconds. 9. Pretext: A Pure TypeScript Text Measurement Library — 500× Faster Than Traditional Approaches https://www.bestblogs.dev/status/2038115581883257201 Core idea: Zero DOM dependency. Uses precise mathematical modeling to replicate browser text wrapping logic — solving overflow and element overlap issues in web screenshot rendering. Already battle-tested in Codepilot’s generative UI system. — Possible use cases: In a Next.js project, run `npm install pretext`, then replace `getBoundingClientRect()` with `const width = pretext.measure("Hello 世界", { font: "14px Inter" })`. Product teams can embed Pretext into PDF generation services to ensure AI-generated reports automatically flow across columns — without overflowing A4 pages. 10. Cloudflare Fully Integrates Kimi K2.5 — Cutting AI Agent & Code Review Costs by 77% https://www.bestblogs.dev/status/2038984561132990836 Core idea: Replaced its prior solution at scale — under the strict SLA required to power ~20% of the world’s top websites — while slashing costs by 77%. Proves that high-reasoning models are now production-ready for high-concurrency, low-latency environments — both technically robust and economically viable. — Possible use cases: Deploy the `kimi-k2.5-worker` example template in Cloudflare Workers, calling `/v1/chat/completions` via `env.KIMI_API_KEY`. Product teams can adopt its “77% cost reduction” framework — migrating existing RAG services to Kimi K2.5 + litesearch — and benchmark real-world QPS and token cost changes.