According to a new analysis from Artificial Analysis, OpenAI’s flagship large language model for ChatGPT, GPT-4o, has substantially regressed in recent weeks, bringing the cutting-edge model’s performance on level with the far smaller, and significantly less competent, GPT-4o-mini model.
This investigation comes less than 24 hours after the company announced an update to the GPT-4o model. “The model’s creative writing ability has leveled up more natural, engaging, and tailored writing to improve relevance & readability,” posted OpenAI on X. “It’s also better at working with uploaded files, providing deeper insights & more thorough responses.” It is now unclear if those assertions will continue to stand up.
“We have completed running our independent evals on OpenAI’s GPT-4o released yesterday and are consistently measuring materially lower eval scores than the August release of GPT-4o,” the Artificial Analysis announced via an X post on Thursday, noting that the model’s Artificial Analysis Quality Index decreased from 77 to 71.
Furthermore, GPT-4o’s performance on the GPQA Diamond benchmark dropped from 51% to 39%, while its MATH benchmarks fell from 78% to 69%.
Simultaneously, the researchers noticed a more than doubling in the model’s response speed, which increased from around 80 output tokens per second to approximately 180 tokens per second. “We have generally observed significantly faster speeds on launch day for OpenAI models (likely due to OpenAI provisioning capacity ahead of adoption), but previously have not seen a 2x speed difference,” the study’s authors said.
“Based on this data, we conclude that it is likely that OpenAI’s Nov 20th GPT-4o model is a smaller model than the August release,” according to them. “Given that OpenAI has not cut prices for the Nov 20th version, we recommend that developers do not shift workloads away from the August version without careful testing.”
GPT-4o was initially introduced in May 2024 to outperform the GPT-3.5 and GPT-4 models. According to OpenAI, GPT-4o achieves cutting-edge benchmark performance in voice, language, and vision tasks, making it appropriate for advanced applications such as real-time translation and conversational AI.