Non-ferrous

Base Metals

Rare Earth

Scrap Metals

Minor Metals

Precious Metals

Ferrous Metals

Iron Ore Index Iron Ore Price Finished Steel Coke Coal Pig Iron Silicon Steel

New Energy

Solar Lithium Cobalt Lithium Battery Cathode Precursor and Material Anode Materials Artificial Graphite Diaphragm Electrolyte Other Materials Chemical Compound Lithium-ion Battery Used Lithium-ion Battery Sodium-ion Battery Hydrogen Energy Energy Storage

Price Center DatabasePro Reports Events Car Insight

Home / Metal News / Geely Auto, Stepfun open-source multimodal AI models for video, audio generation

Geely Auto, Stepfun open-source multimodal AI models for video, audio generation

Feb 18, 2025 14:53

Source:gasgoo

On February 18, Geely Auto Group and its tech ecosystem partner Stepfun announced the open-sourcing of two multimodal AI large models—the Step-Video-T2V for video generation and the Step-Audio for v...

Shanghai (Gasgoo)- On February 18, Geely Auto Group and its tech ecosystem partner Stepfun announced the open-sourcing of two multimodal AI large models—the Step-Video-T2V for video generation and the Step-Audio for voice interaction.

The collaboration leveraged both companies' strengths in computing power, algorithms, and scenario-based training, significantly enhancing the AI models' performance. Stepfun stated that the initiative aims to share the latest advancements in multimodal large models with the global open-source community and contribute to its development.

Step-Video-T2V

With 30 billion parameters, the Step-Video-T2V can generate high-quality videos at 540p resolution with 204 frames, ensuring exceptional information density and consistency.

To comprehensively assess AI-generated video quality, Stepfun has also released an open-source benchmark dataset, the Step-Video-T2V-Eval. This dataset includes 128 real-world Chinese-language queries to evaluate video performance across 11 categories, such as motion, landscapes, animals, abstract concepts, surrealism, human figures, 3D animation, and cinematography.

The company said the Step-Video-T2V outperforms existing open-source models in instruction adherence, motion smoothness, physical realism, and aesthetic appeal. The model excels in generating complex motion sequences, expressive human figures, visually imaginative scenes, bilingual text integration, and advanced cinematographic compositions.

The AI model's ability to accurately depict intricate movements is particularly noteworthy. Whether it's the grace of ballet, the intensity of karate, the speed of badminton, or the high-speed rotations of diving, the model demonstrates a deep understanding of physical space and motion dynamics. In one test case, it realistically portrayed the spatial relationships between a panda, a sloped surface, and a skateboard, producing physics-aware visuals—one of the most challenging aspects of AI video generation today.

Step-Audio

According to Stepfun, the Step-Audio is the industry's first product-grade open-source voice interaction model. It can generate speech with diverse emotions, dialects, languages, singing styles, and personalized expressions, enabling natural, high-quality conversations across various scenarios, including film, entertainment, social interactions, and gaming.

The company added that the Step-Audio has outperformed similar open-source models in five major industry-standard tests, including LLaMA Question and Web Questions. Its performance in the HSK-6 (Chinese Proficiency Test Level 6) evaluation highlights its deep understanding of the Chinese language, making it one of the most proficient open-source voice AI models for Chinese speakers.

Beyond language comprehension, Step-Audio also demonstrates high emotional intelligence, offering empathetic and thoughtful responses, much like a close friend providing guidance through life's challenges.

Additionally, it excels in rhythm and melody processing, allowing it to generate dynamic rap performances with a deep understanding of linguistic cadence and flow.

Recognizing the lack of comprehensive voice AI evaluation benchmarks, Stepfun has also introduced the StepEval-Audio-360, an open-source testing framework. This benchmark assesses voice AI models across nine key dimensions, including role-playing, logical reasoning, content generation, wordplay, creative abilities, and instruction-following.

For queries, please contact William Gu at williamgu@smm.cn

For more information on how to access our research reports, please email service.en@smm.cn

SMM Events & Webinars

All

Apr
09
- NET ZERO MEA - Solar & Energy Storage
  Apr 09 - 10,2025
  NET ZERO MEA is the premier leadership summit of solar, energy storage, and renewable energy sectors in the Mid-East.
  MARRIOTT HOTEL AL JADDAF, DUBAI, UAE
Apr
09
- 2025 SMM (20th) Lead & Zinc Conference and Industry Expo
  Apr 09 - 11,2025
  To promote the sustainable development of lead and zinc industry to build a high-end, professional, pragmatic exchange p
  Nanjing · Jiangsu ·China
Apr
16
- CLNB 2025 New Energy Industry Chain Expo
  Apr 16 - 18,2025
  Suzhou International Expo Center, Jiangsu, China

Geely Auto, Stepfun open-source multimodal AI models for video, audio generation

SMM Events & Webinars

NET ZERO MEA - Solar & Energy Storage

2025 SMM (20th) Lead & Zinc Conference and Industry Expo

CLNB 2025 New Energy Industry Chain Expo