






Shanghai (Gasgoo)- On February 18, Geely Auto Group and its tech ecosystem partner Stepfun announced the open-sourcing of two multimodal AI large models—the Step-Video-T2V for video generation and the Step-Audio for voice interaction.
The collaboration leveraged both companies' strengths in computing power, algorithms, and scenario-based training, significantly enhancing the AI models' performance. Stepfun stated that the initiative aims to share the latest advancements in multimodal large models with the global open-source community and contribute to its development.
Step-Video-T2V
With 30 billion parameters, the Step-Video-T2V can generate high-quality videos at 540p resolution with 204 frames, ensuring exceptional information density and consistency.
To comprehensively assess AI-generated video quality, Stepfun has also released an open-source benchmark dataset, the Step-Video-T2V-Eval. This dataset includes 128 real-world Chinese-language queries to evaluate video performance across 11 categories, such as motion, landscapes, animals, abstract concepts, surrealism, human figures, 3D animation, and cinematography.
The company said the Step-Video-T2V outperforms existing open-source models in instruction adherence, motion smoothness, physical realism, and aesthetic appeal. The model excels in generating complex motion sequences, expressive human figures, visually imaginative scenes, bilingual text integration, and advanced cinematographic compositions.
The AI model's ability to accurately depict intricate movements is particularly noteworthy. Whether it's the grace of ballet, the intensity of karate, the speed of badminton, or the high-speed rotations of diving, the model demonstrates a deep understanding of physical space and motion dynamics. In one test case, it realistically portrayed the spatial relationships between a panda, a sloped surface, and a skateboard, producing physics-aware visuals—one of the most challenging aspects of AI video generation today.
Step-Audio
According to Stepfun, the Step-Audio is the industry's first product-grade open-source voice interaction model. It can generate speech with diverse emotions, dialects, languages, singing styles, and personalized expressions, enabling natural, high-quality conversations across various scenarios, including film, entertainment, social interactions, and gaming.
The company added that the Step-Audio has outperformed similar open-source models in five major industry-standard tests, including LLaMA Question and Web Questions. Its performance in the HSK-6 (Chinese Proficiency Test Level 6) evaluation highlights its deep understanding of the Chinese language, making it one of the most proficient open-source voice AI models for Chinese speakers.
Beyond language comprehension, Step-Audio also demonstrates high emotional intelligence, offering empathetic and thoughtful responses, much like a close friend providing guidance through life's challenges.
Additionally, it excels in rhythm and melody processing, allowing it to generate dynamic rap performances with a deep understanding of linguistic cadence and flow.
Recognizing the lack of comprehensive voice AI evaluation benchmarks, Stepfun has also introduced the StepEval-Audio-360, an open-source testing framework. This benchmark assesses voice AI models across nine key dimensions, including role-playing, logical reasoning, content generation, wordplay, creative abilities, and instruction-following.
For queries, please contact William Gu at williamgu@smm.cn
For more information on how to access our research reports, please email service.en@smm.cn