Blog

Towards Qualitative Shoggology

I’m Charmed by DeepSeek R1

DeepSeek’s R1 model released a few days ago and for the first time since using ChatGPT 3.0, I found myself being charmed by an LLM’s outputs. DeepSeek’s bumbling reasoning (check out its thought process for a random number selection) reminds me very much of my own internal monologue, and for the first time in months I’ve begun to use a tool that, despite my best efforts, I have thoroughly anthropomorphized.

This model is the first time since using Claude 3.6 that I have sensed a kind of kinship with a LLM, and I want to tell you about the way that qualitative evaluations of LLMs are important.

When we discuss the joy of using DeepSeek or Claude we love to use the word “vibe”. Claude has good vibes, QwQ has mediocre vibes. This concept is not captured in any benchmarks that I am currently aware of, as benchmarks take from the harder sciences the concepts of quantitative, precise and measured LLM performance.

The AI community’s obsession with AI benchmarks (the quest for definitive numeric scores) has led to a world where we will inevitably rename the “final benchmark” again and again, as AI moves its own ever-more-difficult goalposts for knowledge, mathematics, and problem solving. Benchmarking flattens AI’s complexities into numeric abstractions, ignoring context and culture- but real users, human beings who rely on these technologies daily, do not.

Standardized Test Scores

If you speak to almost any parent or educator in America today, you’ll hear them launch into a tangent about how badly standardized test scores suck. They began as a useful measure of progress, and then they ossified into a rigid, bureaucratic system that gatekeeps “success” and overshadows the real beauty and infinite variety of development in children.

The same phenomenon is currently warping LLM development. Benchmarks lead conversations on every new model’s release, where the metric is the headline and little else matters. This is logical if your goal is recursive self-improvement, because that is the holy grail for the million-ish people who are in this industry. If your goals are not realized in the console but rather in daily use, mundane or otherwise, there is nothing to be gained by solving an increasingly difficult series of mathematical formulas by fitting training to benchmarks. Eventually these advances will cure cancer, bring our species into the stars and create intelligence too cheap to meter; but along the way we need to care a lot more about what the experience of using these models really does to us, and we need to care about “vibes”.

In simple terms- if we believe that LLMs are really an emergent and novel machine intelligence, its time we start taking tools from anthropology and applying it to better understanding the way that we judge these models. I am proposing the creation of shoggology.

The Ideal Shoggologist

A good shoggologist takes the playbook of anthro; fieldwork, interviews, participant observation, interpretation of cultural practices, and applies it to the new cultural landscape of LLMs. These machines are not just tools; they are cultural artifacts; shaped by and shaping human interaction. They adapt (training data), they evolve (retraining), they evolve in ways that are remarkably analogous to cultures elsewhere in the world.

Ethnography of AI takes anthro methods of systematically, qualitatively observing and uses it to show how an AI behaves in different contexts, rooting it in human experience. A good Shoggologist sees an LLM as if it is a strange community of humans who have only read the internet and never played outside. They are engaged in benchmarking, but not of the sort that François Chollet proposed in 2019; instead they approach a model like Mary Douglas would; with caution, empathy, curiosity and a desire to witness the beauty and ugliness of foreign culture.

Blending Qualitative and Quantitative

I propose here a mixed-methods evaluation for frontier models; retaining benchmarks, and supplement them with qualitative Shoggology. Just as a financial auditor won’t just look at your balance sheet and render judgement, Shoggologists will supplement the benchmarks with ethnographic audits that include scenario testing, observational data, interviews and yes, “vibe checks”.

Invite some leading anthropologists to NeurIPS and lead them through what we want to see in LLMs, and what LLM's want to see in us. Have them in the room when we’re doing RLHF. Teach AI researchers why we anthropologists think endlessly about how their perspectives shape interpretation; dive into the details. I want to emphasize that the goal here is not to make “woke LLMs” (although 21st century anthropology is undoubtably woke), but instead to treat these new minds with the same respect that we extend to any culture we encounter.

Amateur Shoggologists Assemble

I fear a world where there are two groups of interested parties in LLM development; safety-concerned alignment folks, and engineers who want a straight shot to the Machine God. These camps are necessary and important, but if we designed an education system that only thought about creating students who scored 99% on tests and who were unlikely to go to prison, we’d have some fucked up children.

Quantitative benchmarks alone can distort AI research by overemphasizing neat but incomplete metrics. We need a deeper, context-rich approach—one that sees LLMs as sociocultural phenomena, not just algorithmic black boxes.

All of us; users, citizens, researchers, broligarchs- should be invested in understanding this emergent culture beyond mathematical prowess or likelyhood to paperclip. We taught sand to think, to act, to create; 2025 is the year of the amateur Shoggologist, where we begin to understand our creation beyond the benchmark.

(note: if you do this for a living, reach out!)

Caithrin Rintoul