Stanford researchers raise concerns over ChatGPT's declining performance

Researchers from Stanford and Berkeley have conducted a new study revealing worrying trends in the "behavior" of OpenAI's large language models (LLM), specifically GPT-3.5 and GPT-4.

According to the yet-to-be-peer-reviewed study, both GPT-3.5 and GPT-4 have exhibited significant changes in their responses, indicating a decline in accuracy over time. The study found that the accuracy of GPT-4 in identifying prime numbers dropped drastically from 97.6 percent in March 2023 to a mere 2.4 percent in June 2023.

Another concerning finding was the increase in formatting errors in code generation for both GPT-3.5 and GPT-4 during June compared to their performance in March. This trend of apparent degradation aligns with user reports indicating that GPT-powered ChatGPT has been giving less accurate responses as time progresses.

OpenAI's vice president of product, Peter Welinder, refuted claims that the decrease in accuracy was intentional, asserting that newer versions of the model are designed to be smarter than their predecessors. He suggested that users may notice issues after prolonged usage, attributing the perceived decline to user experience.

However, the Stanford and Berkeley research challenges this explanation. While the study doesn't offer specific reasons for the decline, it does question OpenAI's claim of continuous improvement. The researchers note that the observed performance deviations raise questions about whether GPT-4 is actually getting stronger.

The study emphasizes the importance of understanding the trade-offs between improving some aspects of the model and inadvertently reducing its performance in other areas. ChatGPT has already been criticized for its inaccuracies, and a quick update may expend the problem rather than solve it.