In pursuit of “fast”, the MiniMax team has carried out a number of technical innovations.
Recently, Shanghai-based MiniMax launched its multi-modal models at the Xuhui Riverfront.The founder of the company, Dr. Junjie Yan, also delivered a speech as a representative of entrepreneurs at the 2024 Pujiang Innovation Forum – Global Venture Capital Conference.The video effect of the big model generated by the big model he played in his speech was quite good, whether it was a magical skit in the style of the Harry Potter movie or a sci-fi video of astronauts sailing through the universe in a spaceship, giving the audience an experience comparable to that of the OpenAI-developed Sora.
Under the conditions of limited arithmetic power, how to make domestic big models generate high-quality text, pictures, videos, music and speech?John Yan shares his views.
Multiple technological innovations for the sake of “speed”
Yan Junjie graduated from the Institute of Automation of the Chinese Academy of Sciences, served as vice president of the Shangtang Group, and founded Rare Technology at the end of 2021.In his opinion, there are three important optimization directions for AI big models at present: one is to make the error rate of the model continuously reduced, because most models have a high error rate, sometimes the performance is amazing, sometimes unreliable, and it becomes a major bottleneck that restricts the model from handling complex tasks; the second is to realize infinitely long inputs and outputs, because this is the ability that human beings have, and the computational demand of the big model is rising with the input and output processing volume of thesquare rises, will soon reach the upper limit of arithmetic power can not afford, this bottleneck needs to be cracked by the underlying innovation; the third is the multimodal, that is, text, sound, pictures and videos and other modalities can be generated to interact with the user for all kinds of information.
Video generated by MiniMax Large Model
“How do you overcome technical difficulties in these three directions?We believe that within the same capacity, faster is better.”John Yan said, “Of two models with similar performance, the one with faster training and inference can iterate more data more efficiently using arithmetic resources, resulting in a better model capability, so we believe that faster is better.This is a simple but easily overlooked philosophical concept.”
In order to pursue “fast”, MiniMax team has carried out a number of technical innovations to the big model, and MoE (Mixed Expert Model) is one of the innovations, and when this kind of architecture has not yet been recognized by the majority of experts, they decided to be the first one to complete the breakthrough of the core MoE algorithm technology route in China.
According to the introduction, the design of the hybrid expert model is based on the idea of “specialization”, i.e., the task is classified and then assigned to multiple “experts” to solve.The corresponding concept is dense model, the use of this structure is a “generalist” model.Compared to a “generalist”, a group of “experts” can accomplish complex tasks more efficiently and professionally, and can also increase the model capacity dramatically without significantly increasing the computational cost, which makes it possible to create large models with trillions of parameters.In the abab-text-6.5s large language model developed by Rare Technology, the MoE model is 3-5 times faster than the dense model.This big model can handle billions of interactions per day, and MOE plays a key role.
The Linear Attention mechanism is also a technical innovation carried out by the MiniMax team.Through algorithmic optimization, it turns the square growth relationship between input length and computational complexity in the traditional model architecture into a linear relationship, and takes the key step of “realizing infinitely long inputs and outputs”.
John Yan introduces the models and products developed by MiniMax.
Inviting users to experience video and music AI creation
Supported by hybrid expert model, linear attention mechanism and other technologies, the video model abab-video-1 features high compression rate, good text response, and support for native high-resolution and high-frame-rate video, which is comparable to movie texture.The music model abab-music-1 supports multi-functional end-to-end music generation, which can be used to synthesize pure music, orchestral works and other forms of music, and meets the simultaneous generation of accompaniment and vocals, which is expected to greatly simplify the process of music recording and creation, and enable amateurs to engage in music creation.Readers can visit the web version of “Conch AI” (www.hailuoai.com/video) to experience the fun of creating videos and music.
Video Generated by MiniMax Big Model
Rare Technology has also updated its speech model abab-speech-1, which can generate multilingual synthesized speech in Mandarin, Cantonese, Japanese, Korean, Spanish and other languages, with a high degree of anthropomorphism and subtle and natural changes in emotions.
Yan Junjie introduced, at present, MiniMax big model 3 billion interactions with end users every day, processing more than 3 trillion token text every day, generating 20 million pictures and 70,000 hours of voice.
Video Generated by MiniMax Big Model
The 3 billion interactions per day come from the company’s own products such as “Conch AI” and “Starfield”, as well as from the company’s open platform partners.For example, Kingsoft Office Software cooperates with MiniMax, and through the chain of thought, WPS can show the reasoning steps of the big model when generating document summaries and answering users’ questions, so as to improve the transparency and credibility of the solutions; the mobile office platform Nail cooperates with it, and obtains the ability of copy generation and format compliance, so as to improve the productivity of users; the online literature website Nail is the only one of its kind in the world, and it is the only one of its kind in the world.The mobile office platform “Nail” cooperated with it to gain the ability of copy generation and format compliance, which improved the productivity of users; the online literature website “Read” gained the ability to quickly understand the overall context, so that it could maintain the consistency of emotions in the production of audiobooks of long novels and accurately analyze the emotions of the characters and carry out stylized interpretations; the human resources platform “Wisdomlink Recruitment” cooperated with it to use vertical recruitment methods.Through cooperation, the HR platform “Zhilian Recruitment” uses vertical and full-time industry data for model fine-tuning, which significantly improves the accuracy of AI interview evaluation, job description information extraction and resume matching.
With the release of the video model, music model, and voice model, Rare Technology has created a full set of multimodal large model products.Yan Junjie revealed that in the next few weeks, the company will release the multimodal large model abab 7, which is benchmarked against the GPT-4o in terms of speed and effect, and will then be subject to the test of partners and end users.