AoE2 Goat Neural Network Exposes LLM Reasoning Myths

Picture this: a medieval strategy game, a scenario editor, some goats, and a working neural network. Not a metaphor. Not a tech-bro pitch deck slide. A Microsoft researcher actually did this, and the point of the whole stunt is one of the most useful ideas anyone in AI has put forward in years: stop assuming that large language models think the way humans do, just because they learned from human language. ## The Setup: Goats as Bits, Bridges as Logic Adrian de Wynter, a researcher at Microsoft and the University of York, built a functioning neural network inside the map editor of Age of Empires II, according to The Decoder. The design is completely absurd on purpose. A goat standing on grass equals 0. A goat standing on a bridge equals 1. De Wynter constructs logic gates using the scenario editor's scripting tools, and ice ramps with waiting goats keep the calculations from getting scrambled. The finished mini-network consists of two XNOR gates and one AND gate, and it learns the logical AND function. That is a real, working neural network. It runs on a 1999 real-time strategy game. The goats do not know this. De Wynter goes further in the appendix, according to The Decoder: he demonstrates that, in theory, any computer could be replicated using an idealized version of the game, making Age of Empires II as computationally expressive as any substrate that can run an LLM. Which means, if you are willing to argue that an LLM is conscious or sentient because it processes language and produces human-sounding outputs, you have to extend that same argument to the goats. You probably do not want to do that. ## The Actual Argument: Anthropomorphism Is a Design Bug The paper's thesis, as covered by 404 Media, is that "the point of the paper is to formally show that we anthropomorphise too readily." That is not a vibe; it is a methodological critique with direct consequences for how AI systems get built, tested, and trusted. When researchers and product teams assume that an LLM reasons the way a human does because it was trained on human text, they design evaluations around that assumption. They ask models to explain their reasoning, treat fluent output as evidence of understanding, and mistake pattern-matching at scale for genuine inference. De Wynter's experiment is a formal reductio ad absurdum: the same logical properties that get attributed to LLMs as evidence of human-like cognition are present in a system made of medieval farm animals and palisade walls. For anyone building with AI, this is not a reason to distrust every model output. It is a reason to design your tests and your trust calibration around what LLMs actually do, which is next-token prediction over learned statistical patterns, rather than what they appear to do, which is think. The distinction matters enormously when you are deciding whether to let an AI system handle consequential tasks unsupervised. ## What Builders and Learners Should Take From This PC Gamer reported the headline framing directly from de Wynter's stated goal: to make people "stop assuming that LLMs behave like humans just because they were trained with natural language." That is actionable advice, not just academic chest-thumping. If you are learning to build with AI tools, the most durable skill you can develop right now is the habit of testing outputs against ground truth rather than against whether the response sounds confident and coherent. An LLM that explains its answer fluently is not necessarily correct; it is just very good at sounding like it is. XDA Developers framed the project as proof that LLMs are not sentient, and that framing holds up. But the more constructive read is that sentience is the wrong question entirely. The useful question is: under what conditions does this system produce reliable outputs, and how do I verify them? De Wynter's goat network cannot answer a customer support ticket or write a lesson plan, but it makes the underlying architecture legible in a way that a hundred explainer articles have failed to do. Sometimes the clearest proof is the most absurd one. That is 10 out of 10 methodology, zero out of 10 livestock welfare implications, and exactly the kind of research that should be required reading before anyone ships an AI feature. Watch this space: as AI evaluation frameworks evolve, expect de Wynter's core argument, that substrate independence is the reason behavioral tests for sentience or human-like reasoning are fundamentally unreliable, to show up in how serious teams define "AI safety" and model auditing. The goats got there first. ## Sources - Microsoft researcher builds a working neural network out of goats in Age of Empires II to critique AI science