🤖 AI Weekly Recap (Week 46)

This week’s top AI news, breakthroughs, and game-changing updates

Good morning, AI enthusiast. AI just had another wild week. ElevenLabs unveiled Scribe v2 Realtime, delivering human-quality transcription in under 150 milliseconds across 90+ languages, while World Labs launched Marble, turning text, images, or videos into fully explorable, downloadable 3D environments.

Plus: The most important news and breakthroughs in AI this week.

World Labs released Marble, its first commercial world model product that turns text prompts, photos, videos, 3D layouts, or panoramas into editable, downloadable 3D environments. This breakthrough makes spatial intelligence accessible to creators.

→ Creates persistent 3D worlds that can be exported as Gaussian splats, meshes, or videos

→ Features Chisel, an experimental 3D editor that lets users block out spatial layouts and add text prompts to guide visual style

→ Already compatible with Vision Pro and Quest 3 VR headsets, with every generated world viewable in VR

→ Generates worlds in broad variety of styles including cartoon, science fiction, fantasy, anime, realistic and retro low-poly

🧰 Who is This Useful For:

  • Game developers creating immersive environments rapidly

  • VFX artists working on film and video production

  • VR/AR creators building interactive experiences

  • Robotics researchers simulating training environments

Try it now → https://marble.worldlabs.ai

OpenAI unveiled GPT-5.1, describing it as "now warmer, more intelligent, and better at following your instructions". The update focuses on speed, conversational quality, and developer-friendly features.

→ Dynamically adapts thinking time based on task complexity, making it significantly faster and more token-efficient on simpler everyday tasks

→ Includes a "no reasoning" mode to respond faster on tasks that don't require deep thinking while maintaining frontier intelligence

→ Partners like Balyasny Asset Management report GPT-5.1 running 2-3x faster than GPT-5 while using about half as many tokens

→ Scored 76.3% on the SWE-bench Verified benchmark, a notable improvement over GPT-5's 72.8%

🧰 Who is This Useful For:

  • Developers building AI agents and coding assistants

  • Companies seeking faster, more cost-effective AI solutions

  • Enterprises deploying conversational AI at scale

  • Teams requiring better instruction-following and steerability

Try it now → https://chatgpt.com/

Microsoft Research developed the Multi-modal Critical Thinking Agent, or MMCTAgent, for structured reasoning over long-form video and image data. This breakthrough system transforms how AI handles complex visual information at scale.

→ Built on AutoGen with a Planner-Critic architecture that enables planning, reflection, and tool-based reasoning

→ Can analyze long-form video where context spans hours, far beyond the context limits of most models

→ Features modality-specific agents including ImageAgent and VideoAgent with specialized tools like get_relevant_query_frames() and object_detection-tool()

→ Modular extensibility allows developers to integrate domain-specific tools such as medical image analyzers or industrial inspection models

🧰 Who is This Useful For:

  • Researchers analyzing large video archives and image collections

  • Agricultural and scientific teams conducting visual evaluations

  • Content moderators processing hours of video footage

  • Enterprise teams building multi-step visual reasoning systems

ElevenLabs has released Scribe v2 Realtime, a new speech-to-text model with sub-150ms latency designed for enterprise-grade conversational AI agents. This breakthrough sets a new standard for voice AI, enabling truly natural real-time conversations.

→ Delivers industry-leading speed with ultra-low latency of about 150 milliseconds while maintaining high accuracy

→ Supports over 90 languages including 11 Indian languages such as Hindi, Tamil, Malayalam, Telugu, and Gujarati

→ Features "negative latency" prediction capability, predicting the next word and punctuation to further reduce perceived delay

→ Outperforms major competitors including Google's Gemini Flash 2.5, OpenAI's GPT-4o Mini, and Deepgram's Nova 3 in internal benchmarks

🧰 Who is This Useful For:

  • Developers building real-time voice assistants and AI agents

  • Enterprises deploying customer support and sales voice systems

  • Meeting platforms requiring instant, accurate live transcription

  • Healthcare and compliance teams needing HIPAA-compliant transcription

Google DeepMind introduced SIMA 2, integrating Gemini models to evolve from an instruction-follower into an interactive gaming companion that can think, reason, and learn in 3D virtual worlds. This represents a significant leap toward artificial general intelligence.

→ Doubled its predecessor's performance, achieving a 65% task completion rate compared to SIMA 1's 31%

→ Can now think about its goals, converse with users, and improve itself over time through self-directed learning

→ Uses Gemini to reason internally about abstract concepts and logical commands by understanding environment and user intent

→ Successfully navigates and carries out instructions in procedurally generated environments it has never seen before

🧰 Who is This Useful For:

  • Robotics researchers developing general-purpose navigation systems

  • Game developers exploring AI-powered NPCs and assistants

  • AI researchers studying embodied intelligence and spatial reasoning

  • VR/AR developers building interactive virtual environments

Baidu unveiled ERNIE 5.0, a natively omni-modal model designed to jointly process and generate content across text, images, audio, and video. The Chinese AI giant claims it outperforms Western rivals on key visual understanding tasks.

→ Outperformed OpenAI's GPT-5-High and Google's Gemini 2.5 Pro on OCRBench, DocVQA, and ChartQA benchmarks

→ Open-source variant ERNIE-4.5-VL-28B-A3B-Thinking activates just 3 billion parameters during operation while maintaining 28 billion total parameters

→ Demonstrates strong results on instruction following, factual question answering, and mathematical reasoning

→ Released under permissive Apache 2.0 license, signaling ambitions to compete internationally

🧰 Who is This Useful For:

  • Enterprises seeking efficient document and chart understanding

  • Developers looking for cost-effective multimodal alternatives

  • Financial analysts processing structured data at scale

  • Researchers exploring open-source multimodal reasoning models

Try it now → https://ernie.baidu.com/

Google introduced Deep Research to NotebookLM, automating complex online research and generating insightful reports. This transforms the AI note-taking tool into a comprehensive research assistant.

→ Browses hundreds of websites on your behalf, creating research plans and refining searches as it learns

→ Generates organized, source-grounded reports that can be added directly to your notebook along with all sources

→ Runs in the background while you continue working, allowing you to build a knowledge base without leaving your workflow

→ Now supports Google Sheets, Drive URLs, images, PDFs from Drive, and Microsoft Word documents

🧰 Who is This Useful For:

  • Researchers conducting multi-source investigations

  • Students building comprehensive knowledge bases for projects

  • Analysts synthesizing information across diverse sources

  • Writers and journalists gathering background research efficiently

Try it now → https://notebooklm.google.com

Thanks for making it to the end! I put my heart into every email I send, I hope you are enjoying it. Let me know your thoughts so I can make the next one even better! See you tomorrow.

- Dr. Alvaro Cintas