UC Berkeley researchers found that ChatGPT has not improved over time, and in fact, may have gotten worse.
Jose Antonio Lanz•
The decline was especially steep in the chatbot’s software coding abilities.
"For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June," the research found. These results were obtained by using the pure version of the models, meaning, no code interpreter plugins were involved.
To assess reasoning, the researchers leveraged visual prompts from the Abstract Reasoning Corpus (ARC) dataset. Even here, while not as steep, a decline was observable. “GPT-4 in June made mistakes on queries on which it was correct for in March” the study reads.
What could explain ChatGPT's apparent downgrade after just a few months? Researchers hypothesize it may be a side effect of optimizations being made by OpenAI, its creator.
One possibility cause is changes introduced to prevent ChatGPT from answering dangerous questions. This safety alignment could impair ChatGPT's usefulness for other tasks, though. The researchers found the model now tends to give verbose, indirect responses instead of clear answers.
"GPT-4 is getting worse over time, not better," said AI expert Santiago Valderrama on Twitter. Valderrama also raised the possibility that a "cheaper and faster" mixture of models may have replaced the original ChatGPT architecture.
“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he hypothesized, which he said could accelerate responses for users but reduce competency.
Another expert, Dr. Jm, Fan also shared his insights on a Twitter Thread.
“Unfortunately, more safety typically comes at the cost of less usefulness,” he wrote, saying he was trying to make sense of the results by linking them to the way OpenAI finetunes its models. “My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn't have time to fully recover the other capabilities that matter.”
Fan argues that other factors may have come into play, namely cost-cutting efforts, the introduction of warnings and disclaimers that may “dumb down” the model, and the lack of broader feedback from the community.
While more comprehensive testing is warranted, the findings align with users' expressed frustrations over declining coherence in ChatGPT's once eloquent outputs.
How can we prevent further deterioration? Some enthusiasts advocated for open-source models like Meta's LLaMA (which has just been updated) that enable community debugging. Continuous benchmarking to catch regressions early is crucial.
For now, ChatGPT fans may need to temper their expectations. The wild idea-generating machine many first encountered appears tamer—and perhaps less brilliant. But age-related decline appears to be inevitable, even for AI celebrities.