Author: RadarAI Editorial
Editor: RadarAI Editorial
Last updated: 2026-05-21
Review status: Editorial review pending
Brief
速报
官方
AI动态
开源
Gemma 4 and LongCat-Next jointly herald a new era of 'natively unified multimodal modeling' in open-source AI; real-time video calling capabilities for AI agents are rapidly maturing—with frameworks like OpenClaw and PikaStream now enabling live task execution [1][7][12]; Xiaomi has launched the Token Plan unified billing system, Meituan pioneered the DiNA architecture to overcome discrete modeling bottlenecks, and engineering paradigms are evolving from RAG toward more efficient architectures such as ChromaFs—a virtual file system [5][2][4].
Editorial standards and source policy: Editorial standards, Team. Content links to primary sources; see Methodology.
## 🔍 Core Insights
**Gemma 4** and **LongCat-Next** shine as dual milestones—marking open-source multimodal models' entry into a new era of *natively unified modeling*; **AI agent video calling** is accelerating toward production deployment, with frameworks including OpenClaw and PikaStream already supporting real-time task execution [1][7][12]; Xiaomi has introduced the **Token Plan**, a unified billing system; Meituan unveiled the pioneering **DiNA architecture**, breaking through long-standing bottlenecks in discrete multimodal modeling; and engineering practices are shifting—from traditional RAG toward more efficient infrastructures like the **ChromaFs virtual file system** [5][2][4].
## 🚀 Key Updates
- **Gemma 4 Release: A Native Multimodal Open-Source Model under Apache 2.0** [1]: Google DeepMind launched the Gemma 4 series—supporting audio and video—and built on a highly optimized, non-standard Transformer architecture.
- **Meituan's LongCat-Next Achieves Unified Token Prediction Across Text, Images, and Speech** [2]: Introducing the DiNA (Discrete Native Autoregressive) architecture—the first to enable native discrete autoregression across modalities, shattering performance ceilings in discrete multimodal modeling.
- **OpenClaw AI Agent Integrates Natively with Google Meet for Real-Time Video Calls** [1]: Successfully processes end-to-end audio-video streams and interactive responses—validating a novel path toward embodied AI agents.
- **Pika Labs Launches AI Agent Video Chat Capabilities Powered by PikaStream 1.0** [12]: The beta version supports real-time meeting join, visual understanding, and dynamic task response.
- **Xiaomi's MiMo Large Model Adopts Token Plan Subscription Model** [5]: A unified credit-based billing system covering all-modal agent invocations—optimized for high-intensity development workflows.
- **Mintlify Releases ChromaFs: A Virtual File System to Replace Traditional RAG** [4]: Significantly reduces latency and cost for AI-powered document assistants while improving retrieval accuracy and contextual consistency.
- **HKU Open-Sources OpenHarness: A Lightweight, White-Box AI Agent Framework** [13]: Compatible with the Claude Code ecosystem, prioritizing debuggability and resource efficiency.
- **Browser Use Cloud Introduces a Free Tier** [23]: Offers unlimited browser runtime, free proxy services, and persistent authentication—lowering barriers for cloud-based AI agent experimentation.
## 🔗 Sources
[1] [AI News] Gemma 4: The Most Powerful Compact Multimodal Open-Source Model—Outperforming Gemma 3 Across All Benchmarks — https://www.bestblogs.dev/article/185810bc
[2] Meituan's LongCat-Next: A Native Multimodal Approach That Treats Images and Speech as Tokens for Prediction — https://www.bestblogs.dev/article/2f2a5b5e
[4] Mintlify's ChromaFs Virtual File System: Engineering Best Practices for Optimizing AI Document Assistants — https://www.bestblogs.dev/status/2039945867772268951
[5] Xiaomi's MiMo Large Model Launches Token Plan—A Single Subscription Covers All-Modal Agent Tasks — https://www.bestblogs.dev/article/d3837e08
[7] AI Agent Video Calling Is Becoming Mainstream — https://www.bestblogs.dev/status/2039923815329755196
[12] Pika Labs Introduces Real-Time Video Chat Functionality for AI Agents — https://www.bestblogs.dev/status/2039904088737947889
Gemma 4 and LongCat-Next shine as dual milestones—marking open-source multimodal models' entry into a new era of natively unified modeling; AI agent video calling is accelerating toward production deployment, with frameworks including OpenClaw and PikaStream already supporting real-time task execution [1][7][12]; Xiaomi has introduced the Token Plan, a unified billing system; Meituan unveiled the pioneering DiNA architecture, breaking through long-standing bottlenecks in discrete multimodal modeling; and engineering practices are shifting—from traditional RAG toward more efficient infrastructures like the ChromaFs virtual file system [5][2][4].
🚀 Key Updates
- Gemma 4 Release: A Native Multimodal Open-Source Model under Apache 2.0 [1]: Google DeepMind launched the Gemma 4 series—supporting audio and video—and built on a highly optimized, non-standard Transformer architecture.
- Meituan's LongCat-Next Achieves Unified Token Prediction Across Text, Images, and Speech [2]: Introducing the DiNA (Discrete Native Autoregressive) architecture—the first to enable native discrete autoregression across modalities, shattering performance ceilings in discrete multimodal modeling.
- OpenClaw AI Agent Integrates Natively with Google Meet for Real-Time Video Calls [1]: Successfully processes end-to-end audio-video streams and interactive responses—validating a novel path toward embodied AI agents.
- Pika Labs Launches AI Agent Video Chat Capabilities Powered by PikaStream 1.0 [12]: The beta version supports real-time meeting join, visual understanding, and dynamic task response.
- Xiaomi's MiMo Large Model Adopts Token Plan Subscription Model [5]: A unified credit-based billing system covering all-modal agent invocations—optimized for high-intensity development workflows.
- Mintlify Releases ChromaFs: A Virtual File System to Replace Traditional RAG [4]: Significantly reduces latency and cost for AI-powered document assistants while improving retrieval accuracy and contextual consistency.
- HKU Open-Sources OpenHarness: A Lightweight, White-Box AI Agent Framework [13]: Compatible with the Claude Code ecosystem, prioritizing debuggability and resource efficiency.
- Browser Use Cloud Introduces a Free Tier [23]: Offers unlimited browser runtime, free proxy services, and persistent authentication—lowering barriers for cloud-based AI agent experimentation.
🔗 Sources
[1] [AI News] Gemma 4: The Most Powerful Compact Multimodal Open-Source Model—Outperforming Gemma 3 Across All Benchmarks — https://www.bestblogs.dev/article/185810bc
[2] Meituan's LongCat-Next: A Native Multimodal Approach That Treats Images and Speech as Tokens for Prediction — https://www.bestblogs.dev/article/2f2a5b5e
[4] Mintlify's ChromaFs Virtual File System: Engineering Best Practices for Optimizing AI Document Assistants — https://www.bestblogs.dev/status/2039945867772268951
[5] Xiaomi's MiMo Large Model Launches Token Plan—A Single Subscription Covers All-Modal Agent Tasks — https://www.bestblogs.dev/article/d3837e08
[7] AI Agent Video Calling Is Becoming Mainstream — https://www.bestblogs.dev/status/2039923815329755196
[12] Pika Labs Introduces Real-Time Video Chat Functionality for AI Agents — https://www.bestblogs.dev/status/2039904088737947889
← Back to Updates