AnimeGamer

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

1. ARC Lab, Tencent PCG
2. City University of Hong Kong

Infinite Anime Life Simulation of Qiqi

Infinite Anime Life Simulation of Sosuke

Abstract

Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as "infinite game'' since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience.

Method

Overview of our AnimeGamer. The training process consists of three phases: (a) We model animation shots using action-aware multimodal representations through an encoder and train a diffusion-based decoder to reconstruct videos, with the additional input of motion scope that indicates action intensity. (b) We train an MLLM to predict the next game state representations by taking the history instructions and game state representations as input. (c) We further enhance the quality of decoded animation shots from the MLLM via an adaptation phase, where the decoder is fine-tuned by taking MLLM's predictions as input.

Architecture of animation shot encoder and decoder. The action-aware multimodal representation integrates visual features of the first frame with textual features of action description, and serve as the input to the modulation module of the decoder. Additional motion scope indicating action intensity is injected using a condition module.