Compass Navigation Streamline Icon: https://streamlinehq.com
applied cartography

Building a data greenhouse

A conviction I've held for a while now — one that has upgraded from idiosyncratic to prescient — is that you should insource your data warehouse. Pull all exogenous data into your application database. Linear, Sentry, Stripe, GitHub, whatever you've got: bring it in-house so it's accessible and subject to business logic in the same way as everything else.

For sensitive data that needs to be extremely accurate and timely, we use webhooks. And for everything else that is relatively bounded in terms of size, we just have an hourly cron that resyncs the whole thing. It's maybe a total code footprint of 3,000 lines to pull in data from ten services. This number continues to not really grow, which is its own kind of endorsement.

The shape of this is simple. A model for each external service stores the raw payload as JSON alongside whatever fields you need to query on:

class LinearIssue(BaseModel):
    identifier = models.CharField(max_length=255, unique=True)
    data = models.JSONField(default=dict)

And then an hourly cron that pulls everything down and upserts:

@register_cron(cadence=CronJob.Cadence.HOURLY)
def update_linear_issues() -> dict:
    identifier_to_issue = {
        issue.identifier: issue for issue in LinearIssue.objects.all()
    }
    new_issues = []
    updated_issues = []

    for team in TEAMS:
        issues = linear.get_issues(
            filter={"team": {"key": {"eq": team}}}
        ) or []
        for issue in issues:
            identifier = issue.identifier
            payload = issue.dict()
            if identifier not in identifier_to_issue:
                new_issues.append(
                    LinearIssue(identifier=identifier, data=payload)
                )
            elif json.dumps(payload) != json.dumps(
                identifier_to_issue[identifier].data
            ):
                existing_issue = identifier_to_issue[identifier]
                existing_issue.data = payload
                updated_issues.append(existing_issue)

    if new_issues:
        LinearIssue.objects.bulk_create(new_issues)
    if updated_issues:
        LinearIssue.objects.bulk_update(updated_issues, ["data"])

    return {
        "new_issues": len(new_issues),
        "updated_issues": len(updated_issues),
    }

That's the whole pattern. Every service we ingest looks roughly like this — a model, a cron, a bulk upsert. No ETL pipeline, no transformation layer, no DAGs.


The reason for doing this is not to save money. Save money on one of Snowflake's outrageously expensive self-serve plans, I mean — though that is a nice benefit. No, the reason is more around simply being able to handle this useful data however the hell I want.

Let me give a few examples.

  1. We use Plain for support and ticketing, and are able to inject all previous issues a user has filed, as well as their LTV via Stripe, directly into the support interface. An agent responding to a ticket sees the full picture without leaving the page.
  2. We're able to write automated checkers around process slippages: any Linear issue that's been open for more than two days past its due date, any Stripe customer that has a coupon applied without a note explaining why. These are things that are easy to let slip when they live in separate dashboards; they're trivial to catch when they live in the same database as everything else.
  3. We're able to write exploratory queries that involve non-trivial business logic. What percentage of the backlog items we've shipped came from free users versus paid users? That question spans Linear and Stripe and our own subscriber data; having all three in the same Postgres instance makes the query straightforward rather than heroic.
  4. We're able to inject a lot of this information into our own admin tooling — for which we use Django's built-in admin. When we log in as a given user, we have a pane that shows not just the basic debug information about that user, but a full history of all of their support tickets, transactions, and subscription state.
  5. We can pipe interaction data into our KYC process to help reduce false positives. Some users who on the application side might look somewhat suspicious have a support history and transaction record that clearly indicates otherwise — or vice versa.

The obvious counterarguments: data freshness, schema coupling, and scale.

Data freshness is real — an hourly cron means you're up to an hour stale. For Buttondown, this simply does not matter. We're not running a trading desk. If a Linear issue closed forty minutes ago and hasn't synced yet, nothing bad happens. And for the things where staleness does matter — Stripe webhook events, for instance — we use webhooks, which are real-time by definition.

Schema coupling — what happens when Linear changes their API — is handled by the JSONField. We store the raw payload. If the shape changes, the cron keeps working; we just update whatever downstream code reads specific fields out of data. In practice, this has happened maybe twice across ten services in two years.

And scale: our external data totals barely a gigabyte out of a couple terabytes. The hourly cron for all ten services finishes in under a minute. If we were ingesting orders of magnitude more data, or needed sub-second freshness, or had dozens of consumers with conflicting transformation needs — sure, stand up a proper warehouse. But we don't, and I suspect most companies at our stage don't either.


The last and most recent benefit has been around LLMs. I think almost all of our vendors at this point have an MCP server, if only to say they have one alongside their REST API. But pagination over external APIs is slow, especially if you're trying to do exploratory queries in natural language. Having all that data already in Postgres means I can point an LLM at the database directly and get answers in seconds rather than watching it page through API responses for minutes.

All of this is to say: a data warehouse is a tool for making data accessible to analysis. If your data is small enough and your application is the primary site of analysis, the best data warehouse might just be the database you already have.


About the Author

I'm Justin Duke — a software engineer, writer, and founder. I currently work as the CEO of Buttondown, the best way to start and grow your newsletter, and as a partner at Third South Capital.

Colophon

You can view a markdown version of this post here.