Wan Streamer
June 24, 20262026 年 6 月 24 日

Wan Streamer v0.1: End-to-end Real-time Interactive Foundation Models面向实时交互的端到端基础模型

A native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.一款原生流式的端到端模型:能够边听、边看、边思考、边说话,并实时生成视频回应;所有能力都在同一个 Transformer 中完成,支持 25 fps 生成,模型侧延迟约 200 ms。


Overview概览

Wan Streamer is a native-streaming, end-to-end interactive foundation model, designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. It models language, audio, and video as both input and output within a single Transformer: the sequence is an interleaving of visual, audio, and text input tokens with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming.Wan Streamer 是一款原生流式、端到端的交互式基础模型,在架构上面向实时、低延迟、全双工的音视频交互设计。它在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模:视觉、音频、文本的输入 token 与输出 token 交错成一条序列,并通过 block-causal attention 协调,以实现增量式流式生成。

Wan Streamer framework
Overview of Wan Streamer. Language, audio, and video are modeled as both input and output within a single Transformer, coordinated by block-causal attention for incremental streaming generation. Wan Streamer 总体框架。在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模,并通过 block-causal attention 协调,以实现增量式流式生成。

To support natural audio-visual responsiveness, the entire stack is redesigned around streamability — causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling — enabling streaming units as short as 160 ms at 25 fps. Wan Streamer reaches roughly 200 ms model-side response latency, and about 550 ms total interaction latency once combined with 350 ms of bidirectional network latency, supporting sub-second duplex audio-visual communication.为了让音视频响应更自然,整个技术栈都围绕可流式性重新设计——因果编码器、因果解码器、block-causal attention 以及低延迟多模态 token 调度——在 25 fps 下,流式单元最短可达 160 ms。Wan Streamer 的模型侧响应延迟约为 200 ms;计入 350 ms 的双向网络延迟后,总交互延迟约为 550 ms,可支持亚秒级全双工音视频通信。

How it compares能力对比

Real-time interaction systems split into two camps. Speech-only systems answer fast but produce no visible agent — there is no synchronized face, gaze, or motion. Audio-visual systems do render an avatar, but they are assembled from external ASR, language, TTS, and animation modules, which adds latency at every boundary, and most never report an end-to-end response number. Wan Streamer is the only model that delivers a synchronized audio + video response from one end-to-end Transformer, and does so well under a second.实时交互系统大致分为两类。纯语音系统响应很快,却不会生成可见的智能体——没有同步的面孔、目光或动作。音视频系统能生成形象,但通常由外部 ASR、语言模型、TTS 和动画模块拼接而成,每个边界都会带来延迟,而且多数没有报告端到端响应时延。Wan Streamer 是唯一能通过单一端到端 Transformer 输出同步音视频回应的模型,并把总延迟控制在一秒以内。

Response latency响应时延 lower is better · seconds越低越好,单位:秒
End-to-end interaction loop端到端交互闭环perceive → respond感知 → 回应
Wan Streamerspeech + video语音 + 视频
0.2 · 0.55s
GPT-4o Realtimespeech语音
0.23 · ~0.8s
Doubao Voicespeech语音
0.7 · ~1.0s
Gemini Livespeech语音
1.2–3.6s
Rendering stage only只计渲染阶段external LLM / ASR / TTS excluded不含外部 LLM / ASR / TTS
LPM 1.0render-only仅渲染
~0.35s
OmniForcingrender-only仅渲染
~0.7s
Hallo-Liverender-only仅渲染
0.94s
StreamAvatarrender-only仅渲染
~1.2s
Model-side / first response模型侧 / 首次响应 Total, incl. network & pipeline总时延(含网络与流水线) Rendering only · excludes external brain只计渲染 · 不含外部语言模型
The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions. 上下两组的衡量口径不同。上方一组是完整的端到端交互闭环——感知用户并产生回应;其中只有 Wan Streamer 同时输出视频。下方一组是虚拟形象 / 音视频渲染器,只计时到渲染阶段:其时延不包含所依赖的外部语言模型、ASR 与 TTS,因此用户实际感受到的延迟会高于图中数值。Wan Streamer 是唯一端到端输出同步音视频的模型,且总时延在 0.6 秒以内。数值取各系统公开报告中最接近的口径,混合了不同测量边界,具体定义见论文。
System系统 Perceives video可感知视频 Outputs video可输出视频 Full-duplex全双工 End-to-end端到端 Sub-1s response一秒内响应
Wan Streamer
Doubao Voice ~
GPT-4o Realtime ~
StreamAvatar ~~
LPM 1.0 ~~
yes ~ partial / not disclosed部分支持 / 未公开 no

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed. 代表性系统的能力覆盖对比。全双工指系统生成时仍在持续感知,也就是边理解边回应。Wan Streamer 是唯一同时具备视频感知、同步视频输出、全双工、端到端和一秒内响应能力的模型;其他系统都只覆盖其中一部分。“~” 表示部分支持或未公开的指标。

Agent demos角色演示

Each demo below is generated by the same model — a different person, voice, and scene. Open a clip to watch a prerecorded face-to-face interaction, with the user-side video shown in the corner. Clips are unedited model outputs rather than a live online experience. This v0.1 runs at a preliminary 192p — a proof of concept for the end-to-end design; higher resolution scales readily and is left to future work.下面的演示都由同一个模型生成,只是人物、声音和场景各不相同。点击卡片可观看一段预录的面对面互动片段,画面一角会显示用户侧视频。这些片段是未经剪辑的模型输出 demo,并非在线实时体验。目前 v0.1 仍是初步的 192p 分辨率——用于验证端到端设计;后续可以较容易地扩展到更高分辨率。

Real-time recording实时录屏

A screen recording of a real networked conversation: the local user stream is shown on the left, the AI agent responds on the right, and the text stream updates below during the session. The clip is compressed for the web while preserving the higher-quality recording.这是一段真实联网对话的屏幕录制:左侧是本地用户画面,右侧是 AI Agent 实时回应,下方同步显示文本流。视频已针对网页压缩,同时尽量保留新版录屏的清晰度。

Real-time networked conversation recording
Play recording播放录屏

The full-duplex challenge全双工挑战

Human interaction with the world is fundamentally streaming and full-duplex. People do not finish perceiving, then reason in isolation, and only afterwards respond. They continuously watch, listen, speak, gesture, react, pause, and interrupt, with perception and expression overlapping at audio-visual timescales. A real-time interactive model needs the same pattern: it must continuously consume audio-visual observations, maintain a persistent world and dialogue state, decide when and how to respond, and express that response through synchronized language, speech, and video with very low latency.人与世界的交互天然就是流式、全双工的。我们并不是先感知完、再独立推理、最后才作出回应;而是持续地观看、倾听、说话、用手势回应、停顿和打断,感知与表达在音视频时间尺度上重叠发生。实时交互模型也需要这样的模式:持续接收音视频观测,维护不断更新的世界与对话状态,决定何时、如何回应,并以很低延迟通过同步的语言、语音和视频表达出来。

Most existing systems are assembled as cascaded or asymmetric pipelines. Some perceive audio and video but respond only in text or speech; others generate audio-visual behavior but rely on external language, ASR, TTS, animation, or rendering modules, often using text as a hidden intermediate representation between separately trained components. Such pipelines introduce waiting time at every module boundary, accumulate recognition and synchronization errors, and make response timing, turn-taking, identity preservation, and long-horizon consistency hard to learn as one coherent behavior.大多数现有系统都以级联或非对称流水线拼装而成。有的能感知音视频,却只用文本或语音回应;有的能生成音视频行为,却依赖外部语言模型、ASR、TTS、动画或渲染模块,并常常把文本作为独立训练组件之间隐藏的中间表示。这类流水线会在每个模块边界带来等待时间,累积识别与同步误差,也让响应时机、轮次管理、身份保持与长程一致性难以被统一学习成连贯行为。

The core difficulty is that real-time audio-visual interaction is not simply the union of multimodal understanding and multimodal generation. It is intrinsically full-duplex: when the user speaks, the agent should still show visible listening behavior; when the agent responds, it should still perceive the user for interruption and adaptation. Incoming speech and video must immediately affect outgoing speech and motion, generated audio and visual states must be coupled before decoding rather than repaired afterward, and every emitted unit must become part of the interaction history. This makes streamability a modeling constraint, not a serving optimization — a system built on offline encoders, bidirectional decoders, or round-based dialogue cannot recover true low-latency full-duplex behavior by engineering alone.核心难点在于:实时音视频交互并不是多模态理解和多模态生成的简单相加,它本质上是全双工的——用户说话时,智能体仍应表现出可见的倾听行为;智能体回应时,也仍应感知用户,以便被打断并及时调整。输入的语音与视频必须立即影响输出的语音与动作;生成的音频与视觉状态必须在解码前就完成耦合,而不是事后修补;每个输出单元都必须成为交互历史的一部分。这使得可流式性是建模约束,而不只是部署优化——若系统建立在离线编码器、双向解码器或回合制对话之上,仅靠工程优化也很难获得真正低延迟的全双工能力。

A single Transformer一个 Transformer

Wan Streamer is built around one streaming contract: every component operates causally, every newly observed unit is usable immediately, and every generated unit is emitted and committed back into the interaction history. Language, audio, and video — on both the input and output sides — form a single interleaved causal sequence processed by one Transformer. There is no external VAD, ASR, language, TTS, animation, or video-generation module; perception, reasoning, response planning, speech and visual generation, response timing, and turn-taking are optimized jointly within one persistent state.Wan Streamer 遵循一条流式约束:所有组件都因果运行,新观测到的每个单元都能立即使用,生成出的每个单元都会输出并写回交互历史。输入侧和输出侧的语言、音频、视频共同构成一条交错的因果序列,由一个 Transformer 处理。系统没有外部 VAD、ASR、语言模型、TTS、动画或视频生成模块;感知、推理、响应规划、语音与视觉生成、响应时机和轮次管理都在同一个持久状态中联合优化。

The model treats interaction as a continuous causal stream in which user observations and agent responses jointly update the ongoing context. At each streaming unit, it encodes the currently available user observations and predicts the next response from the complete causal history across both sides of the interaction. The language response is a sequence of discrete tokens trained with next-token prediction; the audio and video responses live in continuous latent spaces and are generated jointly with conditional flow matching, conditioned on the same clean context so that speech, motion, appearance, and scene evolution are denoised as one coupled response.模型把交互视为一条连续的因果流,用户观测与智能体回应共同更新当前上下文。在每个流式单元,它会编码当前可用的用户观测,并基于交互双方完整的因果历史预测下一段回应。语言回应由一串离散 token 表示,并用 next-token 预测训练;音频与视频回应位于连续 latent 空间,通过条件 flow matching 联合生成,并以同一份 clean context 为条件,让语音、动作、外观与场景演化作为一个耦合整体一起去噪。

To make this work, the whole stack is causal from the start: strictly causal audio and video VAEs for streaming latent coding, causal audio-visual encoders, causal audio and video decoders, and a temporally causal Transformer coordinated by block-causal attention. After denoising, the estimated clean latents are appended directly to the history as context for subsequent units, while the causal decoders render them into external audio and video.为此,整个技术栈从设计之初就保持因果性:包括用于流式 latent 编码的严格因果音频与视频 VAE、因果音视频编码器、因果音频与视频解码器,以及由 block-causal attention 协调的时序因果 Transformer。去噪后,估计出的 clean latent 会直接追加到历史中,作为后续单元的上下文;因果解码器则把它们渲染成最终的音视频输出。

Deployment: thinker–performer部署:thinker–performer 架构

Wan Streamer is trained as a single end-to-end model. For real-time deployment, that same model is split into a thinker–performer pipeline across two GPUs to maximize overlap and hardware utilization. After system prefill, the thinker broadcasts the initial KV cache to the performer, so both sides share the same full-history state and the behavior of the unified model is preserved exactly.Wan Streamer 训练时是单个端到端模型。实时部署时,同一个模型会拆分为一条跨两张 GPU 运行的 thinker–performer 流水线,尽可能提高并行重叠和硬件利用率。完成系统 prefill 后,thinker 会把初始 KV-cache 广播给 performer,使两侧共享同一份全历史状态,从而完整保留统一模型的行为。

Thinker-performer overlap
Thinker–performer overlap during streaming inference. At unit k, the thinker encodes the current user observations, updates the KV cache, and decodes the previous unit's latents for immediate emission; the performer takes the new KV slice and runs only the flow-matching solver to produce the next audio-visual latents, returned at the following unit. Perception, decoding, communication, and denoising overlap across adjacent units. 流式推理中的 thinker–performer 并行重叠。在第 k 个单元,thinker 编码当前用户观测、更新 KV-cache,并将上一单元的 latent 解码后立即输出;performer 接收新的 KV 切片,仅运行 flow-matching 求解器生成下一段音视频 latent,并在下一单元返回。感知、解码、通信与去噪在相邻单元间重叠执行。

The thinker hosts the causal audio/video encoders, the short token-causal Transformer pass for language prediction and state update, KV-cache construction, and the causal decoders that render the previous unit's latents into audio and video for immediate emission. The performer holds only the latent-generation path, running the flow-matching solver for the next audio-visual unit from the shared full-history KV context. Because the performer never runs decoders and the thinker never runs the expensive solver, decoding and generation never block each other.thinker 负责因果音视频编码器、一次用于语言预测与状态更新的短 token-causal Transformer 计算、KV-cache 构建,以及把上一单元 latent 渲染为音视频并立即输出的因果解码器。performer 只负责 latent 生成路径,基于共享的全历史 KV 上下文,为下一段音视频单元运行 flow-matching 求解器。由于 performer 从不运行解码器、thinker 从不运行高成本的求解器,解码与生成互不阻塞。

This schedule pipelines current-frame perception, previous-frame decoding, KV/latent communication, and next-frame denoising across adjacent units. Real-time throughput holds as long as the performer time plus communication fits inside one 160 ms streaming unit. Separately, the signal-to-signal path — encode → state update → latent generation → decode — is the ~200 ms model-side latency, held in budget with CUDA-graph capture, compilation, and optimized kernels.这种调度把当前帧感知、上一帧解码、KV/latent 通信与下一帧去噪在相邻单元间流水线化。只要 performer 耗时加上通信耗时能控制在一个 160 ms 的流式单元内,就能维持实时吞吐。同时,signal-to-signal 路径——编码 → 状态更新 → latent 生成 → 解码——对应 约 200 ms 的模型侧延迟,并通过 CUDA graph 捕获、编译与优化算子控制在预算内。

Cite this work引用本文

@misc{wanstreamer2026,
  title         = {Wan Streamer v0.1: End-to-end Real-time Interactive Foundation Models},
  author        = {Wan Team, Alibaba Group},
  year          = {2026},
  month         = jun,
  eprint        = {2606.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.XXXXX},
  note          = {Submitted on 24 Jun 2026}
}