Model Drift: Are Large Language Models Getting “Dumber” Over Time?

With ChatGPT entering our lives in November 2022 and the diversification of LLM providers, a new Large Language Model (LLM) enters our lives at increasingly shorter intervals. After every launch, platforms like LinkedIn, X, and Reddit are filled with praise about how great the model is, how successfully it performs tasks, and days of debate on which professions it will bury in history (or won’t). This "hype," which diminishes over time, rises again with full force whenever a new model comes out.
Here, exactly at the point where the "hype" fades in this cycle, there is another phenomenon. Some time after that new model comes out and becomes a trending topic on social media, we think, “Is [So-and-so] model not as smart as it used to be, or is it just me?” In the early days, I wasn't sure if we were just imagining things or if there was any truth to it, but what we noticed was indeed true. Models lose competence over time; this is called “Model Drift”.
So, how is it that flagship models, which take months to train in hyperscale data centers spread over hundreds of square meters, lose their competence over time? Could they be seriously getting "dumber"? Is this a spontaneous and unavoidable mechanic, or are there restrictions intentionally placed by major LLM providers? Considering the reality that Apple designed its new software to intentionally slow down old products, this is not a surprising situation at all.
The Reality of Model Drift
One of the leading studies on this subject was conducted on GPT-3.5 and GPT-4. For example, GPT-4 was asked 1000 questions to distinguish between prime and composite numbers. While its performance was 84% in March 2023, this rate dropped to 50% in June 2023 (Chen, Zaharia and Zou, 2023). As can be understood from this research, this phenomenon is definitely real, but it is essential to dig a little deeper to understand the reasons.
Causes Leading to Model Drift
Alignment Tax
Foundational models are oversimplified as "predicting the word that comes after a word in a sentence," so they are essentially probability models. One of the processes in the fine-tuning stage after a Large Language Model is trained is “Reinforcement Learning from Human Feedback” (RLHF). With this process, a mathematical function is created where desired types of answers are rewarded and unwanted types of answers are punished. Through this function, the probability function of the base model is optimized in the desired direction. However, human feedback is not flawless, nor is its rationality guaranteed. While this process gives the model indispensable competencies, primarily safety measures and ethical principles, user requests might be disregarded. The price of models becoming safer is paid through reductions in model performance metrics.
Cost Optimization (Model Distillation)
I mentioned that Large Language Models are based on probability models. These models contain billions of parameters and weights. These weights are values ranging between 0 and 1. The more digits there are in the decimal part of these values, the more precisely the probabilities work. When a model is opened for use for the first time, it is allowed to work with all its competence to create the desired echo on social media, and after user habits are established, the quantities of the weights are “distilled”, represented by numbers with fewer digits. Model sizes are reduced, their performance weakens, operating costs decrease but since user habits are established, they live off their old reputation for a while.
Mixture of Experts
A query may contain subtasks requiring different specializations. In order to answer these queries in the best way, the models we see on the frontends do not work alone in the background. They emerge through the collaboration of many micro-services. A router receiving the query analyzes the input and prepares subtasks to be sent to the "expert" or micro-service best suited for the task at hand. For example, part of the query might require coding, part creative writing, and part web searching. The router combines and formats the results it receives from the experts. Depending on how much demand there is for these experts, the system's load balancers may use substitute micro-services; that is, they may forward the task not to the one that will do it best, but to the micro-service that will do it well enough and is under less load. Therefore, instantaneous model drifts can also be observed.
Comparison Psychology and Contrast
New models, adding significant amounts of innovation and capability over old models, overshadow the old ones so much that we stop looking at those models we once used with excitement. If, for some reason, we encounter the outputs of old models, with the emerging contrast effect, we even start questioning how we delegated work to them and how we took such risks. Especially if the last impression the old model left on us is from a time when it was under the influence of model drift, while the prestige of a newly launched model—for which all resources are mobilized and which is burning through cash—skyrockets, the old model is bid farewell with a perhaps undeserved defeat.
Problems Caused by Model Drift in AI-Supported Applications and Solutions
Narrowing the context as much as possible in tasks delegated to LLMs, deterministic system architectures can be the most important working principle to ensure consistency in quality. If the work to be done is clear—and since a system has been established, it means there is a task performed repeatedly—it is necessary to leave as little room as possible for the LLM's creativity. The wider the context, the more different formats the answers we receive will appear in, disrupting our workflows.
Another benefit of constructing deterministic systems is that it facilitates jumping from model to model. Since new models work at 100% technical capacity initially, they will exhibit superior performance in many metrics. The more the expected answer is limited by thick lines, the easier it is to avoid “creativity entropy” and chaos; while preventing unexpected results that may arise from switching to a new model (which may belong to any service provider), output quality and thus the functioning of the system are secured, and the use of new models becomes easier.
As an example, let's have a task where we need to analyze customer reviews in a few different dimensions. If we shape the evaluation prompt not as "Rate this review in terms of satisfaction out of 10, where 1 is not satisfied at all and 10 is very satisfied," but as "Label this review in terms of satisfaction as 'Fully satisfied', 'Partially satisfied', 'Not satisfied at all', 'Satisfied with some things, not with others', use only the labels given to you," this deterministic setup will help minimize logic errors.
In my projects like Clinic Scores, Siyasentez, and Mevzuat, I make extensive use of artificial intelligence's ability to extract information from unstructured data. Problems that I would normally have to use Supervised Machine Learning to solve, and for which I would have to set up a multi-layered model training system, have become solvable with a single API call. In Clinic Scores, insights of extremely high quality and consistency can be generated by taking the projection of user evaluations in areas such as each surgical procedure, economic affordability, and communication style in clinical responses. In order to generate insights that will be useful to the user, it is necessary to map the user journey, define the decision-making processes regarding the questions they ask, the data they collect to answer these questions, and how they handle this data. The more each decision can be handled with granularity, the easier the simulation of processes becomes. While the abundance and emptiness of content produced by artificial intelligence instead of real people on the internet has given rise to the “Dead Internet Theory,” it is possible to clean and distill noise pollution using the exact same tools and turn it into “high value-added insights.”