Denormalization 101

2026/06/23

I have had exactly two conversations this week about denormalization, which is a sign that it is time to write a blog post. If you already know what denormalization is, you will likely find this blog post uninteresting. That is okay! It is meant more to be an artifact that I can point to when people who haven't been burnt by the fires of sad databases (or, perhaps, sad ETL pipelines) ask what I'm talking about.

The well-behaved schema

Imagine the data behind an art gallery. Galleries run calls — open invitations for submissions — and artists submit entries. The application that captures all this stores it the way Codd would want: tidy, normalized, every fact in exactly one place.

Gallery
PK	id	int
	name	string
	url	string
	state	string

Call
PK	id	int
FK	gallery_id	int
	title	string
	deadline	date

Entry
PK	id	int
FK	call_id	int
FK	artist_id	int
	title	string

Artist
PK	id	int
	name	string
	email	string
	state	string

Here's what a handful of rows actually look like — the keys (gallery_id, call_id, artist_id) are the threads you pull to stitch it back together:

id	name	url	state
1	Pace	pacegallery.com	NY
2	Gagosian	gagosian.com	NY
3	White Cube	whitecube.com	CA

id	gallery_id	title	deadline
11	1	Spring Open	2026-03-15
12	2	Summer Group	2026-06-01
13	3	Emerging Voices	2026-09-20

id	call_id	artist_id	title
101	11	5	Field No. 4
102	11	6	Marfa Light
103	12	7	Balloon Dog

id	name	email	state
5	Agnes Martin	agnes@example.com	NM
6	Donald Judd	donald@example.com	TX
7	Jeff Koons	jeff@example.com	NY

Every fact lives once. An artist's state is on the artist; a call's deadline is on the call. This is normalization, and for the system that has to keep the data correct — accept a submission, change an address, never contradict itself — it is exactly right.

I honestly struggle to talk about normalization, and I think even the concept of database design (or at least data modeling) has become a little less in vogue because best practice has just sort of been absorbed into the collective subconscious. Everyone knows how to model stuff a little bit better now than they did 20 years ago. Most databases look something like the above.

Asking questions

Sooner or later someone asks an actual question. Say whoever's running the gallery wants to know how much of their interest is local — concretely: how many entries come from an artist based in the same state as the gallery running the call?

That's one sentence. Against the normalized schema, it's four tables and three joins, because the two facts you need to compare — the artist's state and the gallery's state — live at opposite ends of the graph:

select count(*)
from entry e
join call    c on c.id = e.call_id
join gallery g on g.id = c.gallery_id
join artist  a on a.id = e.artist_id
where a.state = g.state;

You have to hold the whole shape in your head — entry to call to gallery on one side, entry to artist on the other — just to line up two state columns. And this is the toy version — real questions pile on: break it down by gallery, restrict to this year's calls, weight it by how many entries each artist submitted.

This, at least in an abstract sense, has two problems:

Meaningful insights come from combining disparate datasets, and normalization makes that definitionally more cumbersome.
Joins are slow.

Simply eat the other tables

The answer is denormalization!

You build the thing that makes it easy:

EntryView
PK	id	int
	title	string
	artist_name	string
	artist_email	string
	artist_state	string
	call_title	string
	deadline	date
	gallery_name	string
	gallery_url	string
	gallery_state	string

id	title	artist_name	artist_state	call_title	deadline	gallery_name	gallery_state
101	Field No. 4	Agnes Martin	NM	Spring Open	2026-03-15	Pace	NY
102	Marfa Light	Donald Judd	TX	Spring Open	2026-03-15	Pace	NY
103	Balloon Dog	Jeff Koons	NY	Summer Group	2026-06-01	Gagosian	NY

This is denormalization, and the warehouse crowd has a dozen names for the result — a wide table, a one-big-table model, the flat fact table at the center of a star schema, a feature table if you're about to train something on it. ETL, materialized views, whatever — the implementation is a detail; the point is that this pattern has a dozen names.

The artist's state and the gallery's state now sit in the same row. You've reintroduced, deliberately, the redundancy normalization warned you about — Spring Open and Pace repeat once per entry — because the join you used to pay on every query, you now pay once.

And the query becomes the question, almost verbatim:

select count(*)
from entry_view
where artist_state = gallery_state;

How lovely!

[points at butterfly] is this a cache?

So obviously there's a reason we don't do this for everything: we now have to figure out how to build the table. This is a rich subject for which I am much less qualified or interested to opine; if you're reading this blog for answers, the answer is something like "run a daily cron which makes the big ol' join manually and reconstitutes the table".

At this point, you have yourselves a cached table with twenty four hours of freshness. You have traded (presumably) a bit of operational toil for a bit of... something. What is that something?

Sometimes you're caching knowledge — you just want one clear, blessed definition of what an EntryView is, a shape everyone can point at and treat as the source of truth. If so, just through it in a view! Don't worry about materialization.
Other times you're caching reads — the join itself is the problem, too slow or too expensive to pay for on every query. That's the one that earns a materialized table and an ETL pipeline to keep it fresh, operational toil and all.

Both are valid reasons, but as someone in the business of reducing his exposure to moving parts I am morally obligated to make sure you know which camp you're in.

Postscript: I once again recommend a data greenhouse

I wrote about building a data greenhouse four months ago, and it's largely in service of this problem. Being able to trivially answer stuff like "how quickly do we solve bugs from US-based users vs. EU ones?" is wildly actionable.

Denormalization 101

The well-behaved schema

Asking questions

Simply eat the other tables

[points at butterfly] is this a cache?

Postscript: I once again recommend a data greenhouse

About the Author

Colophon

Denormalization 101

The well-behaved schema

Asking questions

Simply eat the other tables

[points at butterfly] is this a cache?

Postscript: I once again recommend a data greenhouse

About the Author

Keep in touch

Colophon