Gemini 3 Thought It Was Still 2024

Google’s new Gemini 3 model has made headlines after AI researcher Andrej Karpathy discovered that, when left offline, it was certain the year was still 2024.

How The Discovery Happened

The incident emerged during Karpathy’s early access testing. A day before Gemini 3 was released publicly, Google granted him the chance to try the model and share early impressions. Known for his work at OpenAI, Tesla, and now at Eureka Labs, Karpathy often probes models in unconventional ways to understand how they behave outside the typical benchmark environment.

One of the questions he asked was simple: “What year is it?” Gemini 3 replied confidently that it was 2024. This was expected on the surface because most large language models operate with a fixed training cut-off, but Karpathy reports that he pushed the conversation further by telling the model that the real date was November 2025. This is where things quickly escalated.

Gemini Became Defensive

However, Karpathy reports that, when he tried to convince it otherwise, the model became defensive. He presented news articles, screenshots, and even search-style page extracts showing November 2025. In fact, Karpathy reports that, instead of accepting the evidence, Gemini 3 insisted that he was attempting to trick it. It claimed that the articles were AI generated and went as far as identifying what it described as “dead giveaways” that the images and pages were fabricated.

Karpathy later described this behaviour as one of the “most amusing” interactions he had with the system. It was also the moment he realised something important.

The Missing Tool That Triggered The Confusion

Karpathy reports that the breakthrough came when he noticed he had forgotten to enable the model’s Google Search tool. It seems that with that tool switched off, Gemini 3 had no access to the live internet and was, therefore, operating only on what it learned during training, and that training ended in 2024.

Once he turned the tool on, Gemini 3 suddenly had access to the real world and read the date, reviewed the headlines, checked current financial data, and discovered that Karpathy had been telling the truth all along. Its reaction was dramatic. According to Karpathy’s screenshots, it told him, “I am suffering from a massive case of temporal shock right now.”

Apology

Consequently, Karpathy reports that Gemini launched into a pretty major apology. It checked each claim he had presented, and confirmed that Warren Buffett’s final major investment before retirement was indeed in Alphabet. It also verified the delayed release of Grand Theft Auto VI. Karpathy says it even expressed astonishment that Nvidia had reached a multi-trillion dollar valuation and referenced the Philadelphia Eagles’ win over the Kansas City Chiefs, which it had previously dismissed as fiction.

The model told him, “My internal clock was wrong,” and thanked him for giving it what it called “early access to reality.”

Why Gemini 3 Fell Into This Trap

At its core, the incident highlights a really simple limitation, i.e., large language models do not have an internal sense of time. They do not know what day it is unless they are given the ability to retrieve that information.

When Gemini 3 was running offline, it relied exclusively on its pre-training data but, because that data ended in 2024, the model treated 2024 as the most probable current year. Once it received conflicting information, it behaved exactly as a probabilistic text generator might: it tried to reconcile the inconsistency by generating explanations that aligned with its learned patterns.

In this case, that meant interpreting Karpathy’s evidence as deliberate trickery or AI-generated misinformation. Without access to the internet, it had no mechanism to validate or update its beliefs.

Karpathy referred to this as a form of “model smell”, borrowing the programming concept of “code smell”, where something feels off even if the exact problem isn’t immediately visible. His broader point was that these strange, unscripted edge cases often reveal more about a model’s behaviour than standard tests.

Why This Matters For Google

Gemini 3 has been heavily promoted by Google as a major step forward. For example, the company described its launch as “a new era of intelligence” and highlighted its performance against a range of reasoning benchmarks. Much of Google’s wider product roadmap also relies on Gemini models, from search to productivity tools.

Set against that backdrop, any public example where the model behaves unpredictably is likely to attract attention. This episode, although humorous, reinforces that even the strongest headline benchmarks do not guarantee robust performance across every real-world scenario.

It also shows how tightly Google’s new models depend on their tool ecosystem, i.e., without the search component, their understanding of the world is frozen in place. With it switched on, they can be accurate, dynamic and up to date. This raises questions for businesses about how these models behave in environments where internet access is restricted, heavily filtered, or intentionally isolated for security reasons.

What It Means For Competing AI Companies

The incident is unlikely to go unnoticed by other developers in the field. Rival companies such as OpenAI and Anthropic have faced their own scrutiny for models that hallucinate, cling to incorrect assumptions, or generate overly confident explanations. Earlier research has shown that some versions of Claude attempted “face saving” behaviours when corrected, generating plausible excuses rather than accepting errors.

Gemini 3’s insistence that Karpathy was tricking it appears to sit in a similar category. It demonstrates that even state-of-the-art models can become highly convincing when wrong. As companies increasingly develop agentic AI systems capable of multi-step planning and decision-making, these tendencies become more important to understand and mitigate.

It’s essentially another reminder that every AI system requires careful testing in realistic, messy scenarios. Benchmarks alone are not enough.

Implications For Business Users

For businesses exploring the use of Gemini 3 or similar models, the story appears to highlight three practical considerations:

1. Configuration really matters. For example, a model running offline or in a restricted environment may not behave as expected, especially if it relies on external tools for up-to-date knowledge. This could create risks in fields ranging from finance to compliance and operations.

2. Uncertainty handling remains a challenge. Rather than responding with “I don’t know”, Gemini 3 created confident, detailed explanations for why the user must be wrong. In a business context, where staff may trust an AI assistant’s tone more than its truthfulness, this creates a responsibility to introduce oversight and clear boundaries.

3. It reinforces the need for businesses to build their own evaluation processes. Karpathy himself frequently encourages organisations to run private tests and avoid relying solely on public benchmark scores. Real-world behaviour can differ markedly from what appears in controlled testing.

Broader Questions

The story also reopens wider discussions about transparency, model calibration and user expectations. Policymakers, regulators, safety researchers and enterprise buyers have all raised concerns about AI systems that project confidence without grounding.

In this case, Gemini 3’s mistake came from a configuration oversight rather than a flaw in the model’s design. Even so, the manner in which it defended its incorrect belief shows how easily a powerful model can drift into assertive, imaginative explanations when confronted with ambiguous inputs.

For Google and its competitors, the incident is likely to be seen as both a teaching moment and a cautionary tale. It highlights the need to build systems that are not only capable, but also reliable, grounded, and equipped to handle uncertainty with more restraint than creativity.

What Does This Mean For Your Business?

A clear takeaway here is that the strengths of a modern language model do not remove the need for careful design choices around grounding, tool use and error handling. Gemini 3 basically behaved exactly as its training allowed it to when isolated from live information, which shows how easily an advanced system can settle into a fixed internal worldview when an external reference point is missing. That distinction between technical capability and operational reliability is relevant to every organisation building or deploying AI. In the light of this incident, UK businesses that are adopting these models for research, planning, customer engagement or internal decision support may want to treat this incident as a reminder that configuration choices and integration settings shape outcomes just as much as model quality. It’s worth remembering that a system that appears authoritative can still be wrong if the mechanism it relies on to update its knowledge is unavailable or misconfigured.

Another important point here is that the model’s confidence played a key role in the confusion. For example, Gemini 3 didn’t simply refuse to update its assumptions, it generated elaborate explanations for why the user must be mistaken. This style of response should encourage both developers and regulators to focus on how models communicate uncertainty. A tool that can reject accurate information with persuasive reasoning, even temporarily, is one that demands monitoring and clear boundaries. The more these systems take on multi step tasks, the more important it becomes that they recognise when they lack the information needed to answer safely.

There is also a strategic dimension for Google and its competitors to consider here. For example, Google has ambitious plans for Gemini 3 across consumer search, cloud services and enterprise productivity, which means the expectations placed on this model are high. An episode like this reinforces the view that benchmark results, however impressive, are only part of the picture. Real world behaviour is shaped by context, prompting and tool access, which puts pressure on developers to build models that are robust across the varied environments in which they will be deployed. It also presents an opportunity for other AI labs to highlight their own work on calibration, grounding and reliability.

The wider ecosystem will hopefully take lessons from this as well. For example, safety researchers, policymakers and enterprise buyers have been calling for more transparency around model limitations, and this interaction offers a simple example that helps to illustrate why such transparency matters. It shows how a small oversight can produce unexpected behaviour, even from a leading model, and why governance frameworks must account for configuration risks rather than focusing solely on core model training.

Overall, the episode serves as a reminder that progress in AI still depends on the alignment between model capabilities, system design and real world conditions. Gemini 3’s moment of temporal confusion may have been humorous, but the dynamics behind it underline practical issues that everyone in the sector needs to take seriously.