Letting Go of Data, Part II: Data in a vault at the bottom of a lake

In Part I of this series on corporate data design, I went over a fairly old, but still relevant debate between a top-down, or Inmon method, approach to data warehousing, or a bottom-up, or Kimball method, approach. In short, the top-down method starts at the broadest view of the business and attempts to design an idealized end state for the data, while the bottom-up approach focuses on tactical business needs first, compiling their output into higher-level reporting more organically.

I pointed out something that many have recognized: this is a false dichotomy. There are other choices.

As evidence of this, I pointed out that the data marts in the Kimball approach are already somewhat polished, compared to the realities of the data generated by most businesses on the ground floor. If top-down and bottom-up were the two options, the latter strategy would actually start at the bottom.

This brings me to the data vault method, due largely to Dan Linstedt. Let me say, here, that this approach might not be quite what you expect with my setup. The data vault is still tackling data design top-down, in the sense that it aims to take all of the organization’s data and create a single repository of all of it from the outset. Where it differs from the Inmon method is that a data vault does not seek to normalize and consolidate the data in the same way. Tactically, the Inmon method avoids redundancy and attempts to produce a coherent whole. Building a data vault, on the other hand, means putting all of the raw data into the warehouse, such that even records that might contradict each other in interpretation are stored in the same warehouse.

The structure of a data vault is also more like a network, with hubs of data representing the entities core to the business, like customers or deals. Meanwhile, satellite data collections exist that provide further information about those entities, each representing a domain-specific source of data. On top of these, links connect those records together, so that records in a satellite can be recognized as being about some core record in a hub.

*Tied at Cedar Lake*, Algonquin Park, Ontario, Canada – Fujifilm X-Pro3 – 2022

Vault or lake?

This sounds a bit like a similar concept: the data lake. The data lake is meant to similarly be a centralized place where data is deposited, but most descriptions of a data lake come with even less structure. A data lake can consist of both structured and unstructured data. It’s typically in an interconnected environent, but some of the data might be in relational tables, while others are blobs, log files or anything else.

Now, the term ‘data lake’ seems to lack as much definition as the other things that have been mentioned here. I’ve heard the term used synonymously with ‘data warehouse’. I’ve heard it used to describe the exact scenario of the ‘data vault’. I’ve heard it described as a place where data is kind of pooled for narrower purposes, on an ad hoc basis.

However, most uses of the term seem to agree that the data all comes in its native format. The data in a lake is raw, without transformation since obtained from its source. It’s a consolidation of the unfiltered, poured into one place. The liquid metaphor is meant to imply the fluidity of the contents, contrasted with a warehouse or a vault that is more rigidly structured.

The data vault has some structure and is meant to be built as a centralized repository of everything, or at least everything significant to the business. You’re filling the vault with all of your valuable first-party data, and not necessarily for specific tactical outcomes.

Data lakes, however, really are more of a bottom-up approach. The water level rises over time (I’m going to milk this metaphor for all it’s worth), but a data lake is typically filling up with new sources of data that are added as needed. It’s a goal to put everything being worked on in a central place, but not to work on everything.

Ideals are hard, and expensive

I am sympathetic to the vision of a single source of truth; one place to go to acquire the data you need to answer questions as they come. I have worked on things like this in little start-up organizations and with huge enterprises. Extremely rarely does the vision really get fulfilled, at least in my experience.

It’s also expensive. One of the best arguments against the top-down approach to designing data products is the sheer cost. That cost is not simply the financial investment required to have teams working on connecting everything together, but also the missed opportunities that come from the time overhead. Getting to a functional data warehouse that is pulling in an organization-wide view, normalizing data sources and collecting a wide breadth of data takes a lot of time, and during much of it, there’s not much being delivered to the consumers of the data.

None of this means that this is not a worthy goal to strive for, but it is prudent to consider that most organizations will face challenges in getting all the ducks in a row. In large organizations, I have seen internal politics ensure that data stays siloed. In small ones, I have seen the ideal burn through limited resources before anything useful materializes. I have seen enormous stores of data produced and managed with impressive rigor, only to have the stakeholders who would benefit from them ignore them entirely; they don’t trust the data, sometimes because the tools they look at independently don’t match what the normalized view is telling them.

So, what would it look like to start in the middle? Some clues come from the data vault and data lake approaches.

Choosing a narrative voice

The data vault is, basically, a top-down approach. However, it differs from the methods Inmon describes in that it does not have as much of the normalization component. The idealized data warehouse is fully self-consistent. The facts that live there are harmonious; even where they may differ, strictly, from the individual sources that make them, they tell a story with a consistent omnipresent narrator.

Unfortunately for the people who need the data—though perhaps fortunately for those who get paid to spend time figuring this out—perspective is important.

*Down at Cedar Lake*, Algonquin Park, Ontario, Canada – Sigma dp1 Quattro – 2022

It’s all relative

Actually, if we want to get picky about it, perspective is all there is. Consider general relativity in physics. We know that there are many conditions that can produce differences in observation from different perspectives, especially when time is concerned. Travel fast enough, and you’ll disagree with someone who’s stationary (relative to yourself) about what order events happened in. A core thing we’ve learned is that there is no truth of the matter. Both of them are correct, and the contradiction disappears if we nest their claims in context.

Context is paramount here, and where business matters are concerned, we don’t need all of the fuss about what is exactly true. What matters is what we are trying to do with the information, and whether the information helps us reliably achieve that. But that differs depending the function of the data in the organization, and that of the people working with it. A core insight of Linstedt’s methods and motivation for the data vault is that a single source of truth is neither realistic nor truly honest about the meaning of the data. It would be more prudent, instead, to have a single source of facts: a place to get the data from the org, and some contextual information to understand the information it carries.

In some cases, it’s helpful to have a single value to represent some top-level KPIs that people in different teams agree on, but not all contexts benefit from this, especially considering the cost of crafting that and dropping conflicting data.

Coming from the perspective of digital analytics, I’m quite familiar with the differences in data sets that, at first glance, are reporting the same thing. An ecommerce platform, Google Analytics, Adobe Analytics and Google Ads will report different numbers for things like conversions, users, sessions, etc. This is, in part, a product of the limitations and context of each, but it’s also a difference in definitions. What counts as a user, and whether a data platform can recognize a return visitor, and how long a session lasts between actions, and whether the data comes from a client-side or server-side implementation… these are all just a part of what causes reporting differences.

I mentioned that wrapping conflicting values in the details of their context removes the apparent contradictions in what they’re reporting. Google is notably in the process of rolling out a change across Google Analytics and its ad platforms. GA has long had a notion of “conversions” that was configurable, but the ad platforms also define “conversions” based on different tagging and attribution models. The problem is that, in aggregate, this meant that “conversions” were always different between reports from these different tools, and seeing integrated reporting from the platforms in GA generates confusion. Google is renaming the GA version of conversions as “key events”, so that the meaning of the term “conversion” is shared across the incoming ad platform data, and other customizable events that rely on GA-only definitions can still get the appropriate ROI reporting and custom attribution associated with them.

Much ink has already been spilled on this matter, with some objection to the change. Having been exposed to some of the product teams’ thinking on this, though, I do think there’s a more thoughtful aim behind the changes than is universally acknowledged. Ultimately, what data consumers need to understand is the important coupling of data with the context it comes from. The two cannot be separated without losing something important.

Certainty always has a cost

Of course, we believe there’s an objective truth of the matter, when it comes to these facts. There’s a finite number of people who bought a thing, and they have identities, and they spent finite amounts of money. Those “true” values aren’t really ambiguous. Instinctively, we want to capture and report on the most accurate and complete data we can.

The thing is, measurement can never be perfectly accurate and perfectly complete at the same time. Here we have another mirroring of the physical world. This claim about measurement is true at a fundamental level of physics: the more precisely you understand some property of a particle, the more uncertainty you have about its other properties.

In our usual contexts, we have enough aggregation of particles to make such problems disappear, but we have an analogous practical problem. We have finite resources, and finite time, and business needs are constantly present and frequently shifting. The more time you spend trying to produce the perfect consolidated record, the fewer records you can make available, and ultimately the less you know, even if you’re certain you know it.

I’m tempted to dive into a deeper epistemological problem that would challenge that certainty in any case, but that’s going to lead us down an unproductive rabbit hole. In a business context, we must focus on the practical. And speaking practically, we do ourselves a disservice trying to pull data away from the teams that generate and use it.

That’ll be my next question in this series: should we be pulling data away?

This series continues in Part III: The mesh that’s already there.

Cover Photograph: Costello Lake Through the Trees, Algonquin Park, Ontario, Canada – Ricoh GR III – 2020

Colin Temple

Letting Go of Data, Part II: Data in a vault at the bottom of a lake

Vault or lake?

Ideals are hard, and expensive

Choosing a narrative voice

It’s all relative

Certainty always has a cost

About Colin

MORE Articles

Letting Go of Data, Part III: The mesh that’s already there

Letting Go of Data, Part II: Data in a vault at the bottom of a lake

Letting Go of Data, Part I: Kimball vs. Inmon is a false dichotomy

The Advertising Impact of 3PCD