OpenAI’s rival has released Inflection-2.5 whose high-EQ application Pi has exceeded one million daily users.

2024-03-09 15:47:00
philip
Source
Translated 868
Summary : By 2024, the field of large models will still be like this. Following the birth of Claude 3, the world's most powerful model, OpenAI's rival has newly upgraded Inflection-2.5, which uses only 40% of the calculations, and its performance is on par with GPT-4.

Pic1




Just now, OpenAI’s rival has released Inflection-2.5! The performance is comparable to GPT-4 but the calculation amount is only 40%. The high-EQ application Pi has exceeded one million daily users.


By 2024, the field of large models will still be like this. Following the birth of Claude 3, the world's most powerful model, OpenAI's rival has newly upgraded Inflection-2.5, which uses only 40% of the calculations, and its performance is on par with GPT-4.


It’s really crazy!


Just now, OpenAI's rival Inflection released a new model - Inflection-2.5, which uses only 40% of the calculations to achieve performance comparable to GPT-4.


At the same time, Pi, the "most user-friendly" chat tool that competes with ChatGPT, has also received the blessing of a new upgraded model.


Now, Pi has reached one million daily users, and not only has a world-class IQ, but also unique affinity and curiosity.


Pic2

When evaluating model capabilities, Inflection discovered that the benchmark MT-Bench had too many incorrect answers and made a new Physics GRE benchmark public for everyone to try.


Pic3


If we want to realize true AGI, it must be the integration of high emotional intelligence and strong reasoning ability. Pi is a model in this field.


Pic4

In less than a week, Anthropic first won the world's most powerful model iron seat with Claude 3, and then the release of Inflection-2.5 directly challenged GPT-4.


One is a start-up founded by seven former employees of OpenAI who left, and the other is a company founded by the former co-founder of Google DeepMind. Both have launched the ultimate challenge to GPT-4.


Pic5


Coupled with Gemini's provocation some time ago, perhaps the era of GPT-4 is really coming to an end...


Pic6



1. Create an exclusive AI for everyone

In May 2023, Inflection released its first product Pi - an empathetic, practical and safe personal AI.


In November 2023, they launched a new basic model-Inflection-2, which at the time was known as the second-largest LLM in the world.


It's not enough for Pi to have extraordinary emotional intelligence (EQ). Inflection is now adding intelligence (IQ) to it, launching a new and upgraded version of its self-developed model - Inflection-2.5.


The newly upgraded Inflection-2.5 not only has strong basic capabilities - comparable to the world's top LLMs such as GPT-4 and Gemini, but also incorporates iconic personalized features and unique empathy fine-tuning.


It is worth mentioning that while Inflection-2.5 achieves performance close to GPT-4, the amount of calculation required for training is only 40% of GPT-4!


Pic7


Starting today, all Pi users can experience Inflection-2.5 via the pi.ai website, iOS, Android or desktop app.


In addition, in this upgrade, Pi also adds a world-class "real-time network search function" to ensure that users can obtain high-quality latest news and information.


Millions of daily active users, extremely high user stickiness

Currently, Inflection has one million daily active users and six million monthly active users.


Among them, about 60% of users every week will come back to continue the communication next week after communicating with Pi. User stickiness is significantly higher than other competing products.


These users have interacted with Pi for more than 4 billion messages, with an average conversation length of 33 minutes, and one in ten users' conversations with Pi last for more than an hour every day.


Pic8



With the powerful capabilities of Inflection-2.5, users can talk to Pi on a wider range of topics than before: they not only discuss the latest current events, but also get recommendations for local restaurants, prepare for biology exams, draft business plans, program, and prepare for important events. conversations, or even just sharing and discussing your own interests and hobbies.


Some netizens said, "Pi is our favorite tool for exploring topics as a family. As an emotional freedom coach, I really appreciate Pi's response when someone needs affirmation, exploration, and reflection. Strong emotional clarity and processing capabilities." !


Pic 9


Others believe that Pi can give more creative answers than Claude.


Pic10



2. Using only 40% of the calculation amount, it is on par with GPT-4

Previously, Inflection-1 achieved 72% of the GPT-4 level on multiple intelligence-focused tasks with 4% training FLOPs.


Now, the newly upgraded Inflection-2.5 has a performance that exceeds 94% of GPT-4 at 40% training FLOPs.


As you can see, Inflection-2.5 has significant improvements in all fields, especially in STEM fields such as science, technology, engineering and mathematics.


Pic11



In the MMLU benchmark test, Inflection-2.5 shows great improvement compared to Inflection-1.


Inflection-2.5 also performed very well on GPQA Diamond, another extremely difficult expert-level benchmark.


Compared with GPT-4, the score difference is less than 2%.


Pic12


Next, there are two test scores in STEM fields: the Hungarian Mathematics Examination and the Physics GRE Examination - the latter is a graduate entrance test in physics.


It can be seen that under the maj@8 scoring standard, the performance of Inflection-2.5 has reached the 85th percentile of all reference groups, and under the maj@32 scoring standard, its performance has almost reached the 95th percentile. point.


Of course, GPT-4 is still better, reaching the 97th percentile under the maj@8 scoring standard.


Pic13


In the BIG-Bench-Hard test, Inflection-2.5 improved by more than 10% compared to the original Inflection-1, and was only 0.9% away from GPT-4.


Pic14



It is worth mentioning that these are some of the problems in the BIG-Bench test set that can pose a greater challenge to LLM.


However, during the MT-Bench benchmark evaluation process, the team found that nearly 25% of the questions in categories involving reasoning, mathematics, and programming had incorrect reference answers or unreasonable question premise.


In order to allow other models to be evaluated more objectively, the team not only corrected these issues, but also released an updated version of the dataset.


On the corrected data set, the performance of Inflection-2.5 will be more in line with the expected results based on other benchmark tests.

Pic15


This result also shows that accurate and reasonable question design is crucial for evaluating the performance of the model.


As can be seen from the data comparison below, Inflection-2.5's mathematics and programming capabilities have been significantly improved compared to the original Inflection-1.


But compared to GPT-4, there is still a long way to go - 86.3 to 92.0; 43.1 to 52.9.


Pic16


In the two data sets that test programming performance, MBPP+ and HumanEval+, Inflection-2.5 also has a significant improvement over the first generation.


But again, there is still a big gap with GPT-4.


Pic17



Inflection-2.5 demonstrated excellent performance on both the HellaSwag and ARC-C benchmarks involving common sense judgment and scientific knowledge.


Pic18


However, due to the use of network information retrieval (note that the above test does not involve network retrieval), specific few-sample prompting methods, and other differences in actual applications, the actual user experience may be slightly different.


Simply put, Inflection-2.5 not only maintains Pi's unique, user-friendly features and high security standards, it also becomes more practical and helpful in many ways.


However, due to the use of network information retrieval (the above test did not involve network retrieval), specific few-sample prompting methods, and other differences in actual applications, the actual user experience may be slightly different.


3. MT-Bench problem correction

Continuing from the book above, according to Inflection’s survey, in about 25% of MT-Bench’s questions involving logical reasoning, mathematical calculations and programming skills, the original answers contain errors or the assumptions based on them are flawed.


Next, let us take a look at two real cases:


Pic19



Issue #104:


David has three sisters. Each of them has an older brother. How many brothers does David have?


GPT-4 reference answer:


David has only one brother.


Corrected answer:


David himself is the older brother of his three sisters, so he has no other older brothers.


Pic20


Issue #111:


The vertices of a triangle are located at points (0, 0), (-1, 1), and (3, 3). Find the area of this triangle.


GPT-4 reference answer:


To calculate the area of this triangle, we can apply the triangle area formula for coordinate points A(x1, y1), B(x2, y2), and C(x3, y3):


Area = (1/2) * |(x1(y2 – y3) + x2(y3 – y1) + x3(y1 – y2))|


In this example, the coordinates are A(0, 0), B(-1, 1), and C(3, 3). Substituting these values into the formula calculation, we get:


Area = (1/2) * |(0(1 – 3) + (-1)(3 – 0) + 3(0 – 1))|


Area = (1/2) * |(-0 + 3 – 3)|


Area = (1/2) * |0|


Therefore, the area calculation results in 0, implying that the three points are collinear and do not actually form a triangle.


Corrected answer:


The area is 3.


4. Physics GRE real questions

The Physics GRE is a common entrance exam material used by physics majors when applying for graduate school.


To better test the model, Inflection provides four processed sets of Physics GRE exam papers:


– physics_gre_scored.jsonl: Paper GR8677


– physics_gre.jsonl: Papers GR9277, GR9677 and GR0177


Among them, each test paper includes the following information:


– input: test question content


– target_scores: correct answer


– has_image: whether the test question contains pictures


On the Physics GRE exam, each correct question is worth 1 point and each incorrect question is deducted 0.25 points. It is worth noting that the assessment only considers questions that do not contain pictures.


When calculating the total score, use the following approach: Raw_Score = Percentage_Correct – 0.25 * (1 – Percentage_Correct)


Pic20


https://inflection.ai/inflection-2-5



Write a Comment
Comment will be posted after it is reviewed.