Compass Navigation Streamline Icon: https://streamlinehq.com
applied cartography

Denormalization 101

I have had exactly two conversations this week about denormalization, which is a sign that it is time to write a blog post. If you already know what denormalization is, you will likely find this blog post uninteresting. That is okay! It is meant more to be an artifact that I can point to when people who haven't been burnt by the fires of sad databases (or, perhaps, sad ETL pipelines) ask what I'm talking about.

The well-behaved schema

Imagine the data behind an art gallery. Galleries run calls — open invitations for submissions — and artists submit entries. The application that captures all this stores it the way Codd would want: tidy, normalized, every fact in exactly one place.

Gallery
PKidint
namestring
urlstring
statestring
Call
PKidint
FKgallery_idint
titlestring
deadlinedate
Entry
PKidint
FKcall_idint
FKartist_idint
titlestring
Artist
PKidint
namestring
emailstring
statestring

Here's what a handful of rows actually look like — the keys (gallery_id, call_id, artist_id) are the threads you pull to stitch it back together:

id name url state
1 Pace pacegallery.com NY
2 Gagosian gagosian.com NY
3 White Cube whitecube.com CA

Every fact lives once. An artist's state is on the artist; a call's deadline is on the call. This is normalization, and for the system that has to keep the data correct — accept a submission, change an address, never contradict itself — it is exactly right.

I honestly struggle to talk about normalization, and I think even the concept of database design (or at least data modeling) has become a little less in vogue because best practice has just sort of been absorbed into the collective subconscious. Everyone knows how to model stuff a little bit better now than they did 20 years ago. Most databases look something like the above.

Asking questions

Sooner or later someone asks an actual question. Say whoever's running the gallery wants to know how much of their interest is local — concretely: how many entries come from an artist based in the same state as the gallery running the call?

That's one sentence. Against the normalized schema, it's four tables and three joins, because the two facts you need to compare — the artist's state and the gallery's state — live at opposite ends of the graph:

select count(*)
from entry e
join call    c on c.id = e.call_id
join gallery g on g.id = c.gallery_id
join artist  a on a.id = e.artist_id
where a.state = g.state;

You have to hold the whole shape in your head — entry to call to gallery on one side, entry to artist on the other — just to line up two state columns. And this is the toy version — real questions pile on: break it down by gallery, restrict to this year's calls, weight it by how many entries each artist submitted.

This, at least in an abstract sense, has two problems:

  1. Meaningful insights come from combining disparate datasets, and normalization makes that definitionally more cumbersome.
  2. Joins are slow.

Simply eat the other tables

The answer is denormalization!

You build the thing that makes it easy:

EntryView
PKidint
titlestring
artist_namestring
artist_emailstring
artist_statestring
call_titlestring
deadlinedate
gallery_namestring
gallery_urlstring
gallery_statestring
id title artist_name artist_state call_title deadline gallery_name gallery_state
101 Field No. 4 Agnes Martin NM Spring Open 2026-03-15 Pace NY
102 Marfa Light Donald Judd TX Spring Open 2026-03-15 Pace NY
103 Balloon Dog Jeff Koons NY Summer Group 2026-06-01 Gagosian NY

This is denormalization, and the warehouse crowd has a dozen names for the result — a wide table, a one-big-table model, the flat fact table at the center of a star schema, a feature table if you're about to train something on it. ETL, materialized views, whatever — the implementation is a detail; the point is that this pattern has a dozen names.

The artist's state and the gallery's state now sit in the same row. You've reintroduced, deliberately, the redundancy normalization warned you about — Spring Open and Pace repeat once per entry — because the join you used to pay on every query, you now pay once.

And the query becomes the question, almost verbatim:

select count(*)
from entry_view
where artist_state = gallery_state;

How lovely!

[points at butterfly] is this a cache?

So obviously there's a reason we don't do this for everything: we now have to figure out how to build the table. This is a rich subject for which I am much less qualified or interested to opine; if you're reading this blog for answers, the answer is something like "run a daily cron which makes the big ol' join manually and reconstitutes the table".

At this point, you have yourselves a cached table with twenty four hours of freshness. You have traded (presumably) a bit of operational toil for a bit of... something. What is that something?

  • Sometimes you're caching knowledge — you just want one clear, blessed definition of what an EntryView is, a shape everyone can point at and treat as the source of truth. If so, just through it in a view! Don't worry about materialization.
  • Other times you're caching reads — the join itself is the problem, too slow or too expensive to pay for on every query. That's the one that earns a materialized table and an ETL pipeline to keep it fresh, operational toil and all.

Both are valid reasons, but as someone in the business of reducing his exposure to moving parts I am morally obligated to make sure you know which camp you're in.

Postscript: I once again recommend a data greenhouse

I wrote about building a data greenhouse four months ago, and it's largely in service of this problem. Being able to trivially answer stuff like "how quickly do we solve bugs from US-based users vs. EU ones?" is wildly actionable.


About the Author

I'm Justin Duke — a software engineer, writer, and founder. I currently work as the CEO of Buttondown, the best way to start and grow your newsletter, and as a partner at Third South Capital.

Colophon

You can view a markdown version of this post here.