SeriousDataSolutions

What Can Data Catalogs Learn from Product Master Data Management?

Data catalogs are on fire in 2021. The number of entrants into the space is increasing, and there seems to be a tremendous demand for adoption.

There is a lot of variation in what is expected of a data catalog. Yet, even if there is such variation in expectations, surely there must be some fundamental, common idea of what a data catalog is meant to be. Well, we all know what data is, so it is the “catalog” part that we need to look at more closely. And the thesis of this article is that, if we do that, then we are going to find some eerie similarities with Master Data Management (MDM — particularly Product MDM.

“Catalog” is defined by the online Merriam-Webster dictionary as: a complete enumeration of items arranged systematically with descriptive details.

The items are stored in a data catalog are usually termed “data assets” or just “assets” — for want of any better term. The items are not the data assets themselves but are the metadata about them.

Now, data assets are things. This is not the place to get into metaphysics, but the general reality we exist in principally consists of concepts, things and events.

In a database, these correspond to Reference Data, Master Data and Event Data, respectively. From a metaphysical viewpoint, “things” are bearers of properties — they have identities, attributes, relationships and can change over time. In the world of data management, they are usually entity types like Customer, Product, Financial Instrument and so on.

So, the conclusion seems to be that:

Data Assets are another Master Data entity type.

You could try to stretch this to say:

Metadata is Master Data.

However, some metadata could be Event Data, in which case this statement would be unjustified. But for the rest of this article, to keep things simple, let’s use the term “metadata” to mean just the information about data assets that is stored in a data catalog.

If metadata is master data, it is legitimate to ask if any lessons can be learned from MDM about how we should manage metadata. Where do we see catalogs in MDM? The answer is that they are closely connected to Product MDM. Product Catalogs are often (but not always) a component of an eCommerce site, along with functionality to, for example, purchase the goods being offered.

So, at this point, we have a good reason to ask if there are any useful insights that data catalogs can learn from Product MDM and from Product MDM catalogs in particular.

Product Life Cycle

Products have a life cycle. A notional example might be:

Ideation > Design > Prototype > Testing > Manufacturing > Discontinuation of Manufacturing > Discontinuation of Warranty and Support

These phases have to be reflected in the automated support that Product MDM provides. If our analogy holds, then a data asset, like a dataset of SQL query, should follow some kind of life cycle that the data catalog will help to manage.

An even more profound conclusion from thinking about this life cycle would be that data is a product and should be treated like a product. (We will leave that one for another day.)

Product Taxonomies

Taxonomies are huge in Product MDM. Basically, they are the ways in which products are grouped together.

Taxonomies serve two fundamental purposes:

To help the enterprise govern, manage and report on its universe of products
To help customers explore the products they are interested in and find the products they want to buy

It seems that any data catalog is going to contain a diverse array of different data assets, or, perhaps, if we want to go further — data products. Therefore, taxonomies are going to have to be taken seriously for data catalogs, too. There is a large body of work on product taxonomies and it is very likely that we can learn a lot from them that can be applied to data catalogs.

Differences between Product MDM and Data Catalogs

So, we have at least a couple of areas from Product MDM that we may be able to take lessons from. However, we need to acknowledge that the analogy may eventually break down and that data catalogs are, in some ways, unique.

One difference is that Product MDM is about Product Types, not instances of products. That is, Product MDM is ultimately about the types of things that have to be managed, not the individual things themselves. A data catalog will, in large part, be about individual assets, like schemas, tables, columns, datasets, queries, and so on. That is a big conceptual difference that is likely to have implications.

Another example where there might be a difference is the use of Item Master teams in Product MDM. These are centralized teams that set up the Item Master Record — the key information for a product — in a Product MDM system. They make sure the data is complete and accurate and assign all the correct taxonomies. It is difficult to see how this methodology fits with the data democratization that a data catalog is intended to support.

Conclusion

There are definitely close parallels between a data catalog and Product MDM, and there are some great ideas we can glean from the technology and art of Product MDM. However, there are differences too, and we will still have to slowly think our way through the challenges that are unique to data catalogs in order to achieve the vision that has been set for them.

Junk Metadata and Data Catalogs

Here are a few fun facts about each of us:

The human genome consists of about 3 billion nucleotides (the basic information units of DNA).
The Human Genome Project found that within this there are about 20,000 functional genes — genes that encode the design for proteins.
But 20,000 genes represent only about 2% of nucleotides in the entire genome (each gene is only a few hundred nucleotides in length).

So, about 98% of our DNA seems to have no functional role in our bodies. Scientists like to call it “Non-coding DNA,” but the more popular name is “Junk DNA.”

Having 98% useless DNA would seem like the mother of all data quality problems. But this DNA is not being used to create proteins, so it does not really matter that we all have so much of it.

That said, it seems scientists are not happy to simply dismiss Junk DNA as having no role, and there is considerable controversy and speculation about why we have so much of it and what it might be doing.

Junk Metadata parallels Junk DNA

What does all this have to do with metadata and data catalogs? Well, data catalogs are a collection of information, just like the human genome is, and they are filled with metadata.

Now, there are many different kinds of metadata, but the type I want to focus on is technical metadata. Technical metadata is metadata that comes from something other than direct data entry by human beings. It includes database structures, data profiles, ETL metadata, inferred foreign keys, report structures, APIs and so on. Increasingly, this technical metadata is being collected and integrated automatically at vast scale in data catalogs.

At first glance, this might seem like a great and beneficial achievement. All that technical metadata in one place should be tremendously useful for use cases like data discovery, understanding the provenance of data, and finding the best source of data. And it is undeniable that the development of the capabilities to collect and integrate all this metadata has been a significant technical achievement.

However, there is an assumption here: we are assuming that all metadata is equally useful. This is similar to how all of the human genome was thought to be functional DNA before the Human Genome Project found that 98% was, in factm Junk DNA. Luckily, we know our bodies work and are useful despite having so much Junk DNA, but what about data catalogs that are enormous reservoirs of technical metadata?

How Metadata can be Junk

This problem first came home to me when a client said he was afraid to allow business users access to a data catalog because they might do something like type “CUST” into the search bar and get back tens of thousands of results from a variety of technical components and services. He rightly feared the users would be horrified and give up, unable even to comprehend the types of technical objects the metadata has been harvested from.

So here we have a paradox. The more technical metadata that data catalogs contain, the more accurately and completely they hold a picture of the enterprises data assets — but at the same time, the more unusable they are by business users, who are meant to be the principal beneficiaries of data catalogs. It seems we have created Junk Metadata — metadata that cannot usefully be consumed by business users.

what is junk metadata — What is junk metadata? It’s metadata that cannot be understood in business terms by business users.

“Junk” is a Business Viewpoint

Is this a fair conclusion? Going back to our Junk DNA parallel, we should remember that many scientists think that there must be a role for it and, in the future, it may be proven to have a use we currently do not understand.

Perhaps the same is true of Junk Metadata and in the future AI or ML may be used to derive business insights from it.

We can clarify this by defining Junk Metadata and its properties like this:

Metadata that cannot be understood in business terms by business users

That is, an item of Junk Metadata either:

(a) Has no business understandable content; or

(b) Is not related to sufficient other metadata objects that do have enough business understandable content for the user to infer a business understanding of the item

A major point here is that it is the business user’s viewpoint that is being considered. What we are calling Junk Metadata may be very useful for IT users. However, data catalogs have promised us that they are going to be enterprise-wide, and they are going to democratize data for all users in the enterprise. Otherwise, they would just be another IT technical tool like a DBA workbench.

Is Junk Metadata real? I think it is to some extent. All metadata in a data catalog must be understandable in business terms to be even considered by business users. Even then it may have no business use. But I certainly do not want to imply that all technical metadata is Junk Metadata — just that some of it is. And, like Junk DNA, we cannot dismiss Junk Metadata completely, as there may be a way to figure out how to extract business value from it in the future.

Covid-19: Five Practical Ways Data Governance Can Help

We’ve all been affected by Covid-19, and I hope you and your loved ones are well. The health crisis continues to unfold, and it may be many months before it is under control. At the same time there has been a massive economic shock, whose ramifications are less easy to understand than the disease itself.

When it comes to data, I think we all understand more of how public health and epidemiology is driven by data, and how valuable data is. In fact, I heard Dr Fauci of the US CDC say (and I am paraphrasing him) that actual data trumps all the models you can have. But I would like to defer any discussion of the role of data in the health crisis until later, in part because its going to be controversial.

So while the health crisis is acutely important today I would like to try to think about the economic crisis.

In all wide-scale crises we each have our role to play, and it is a very fair question to ask what Data Governance can do to help. Now in answering this question, we are not here to think about how Data Governance can profit from the situation, or even how Data Governance units can find ways to make themselves seem relevant so they can survive the economic crisis and then get back to business as usual after this is all over. Selfish thinking is not going to contribute anything to help the organizations we work for or the communities we live in.

To help frame how Data Governance can help it is useful to reflect how the need to move rapidly to work from home seems to have caught many enterprises by surprise, even though they had spent a lot of time on Disaster Recovery and Business Continuity. Luckily, the technical infrastructure was broadly available for rapid adoption. This shows that the focus of Data Governance in this crisis should be in the area of risk management. It is not the time for Data Governance to be focusing on increasing revenues, and attaining enterprise goals. Improving process efficiency may be relevant, but it implies business process re-engineering, and this takes time. So risk management is a logical place for Data Governance to help.

Here are 5 practical ways that Data Governance can help today.

#1 Build a data competencies matrix for all staff working closely with data. What data does a staff person work with? What data processing do they do? What data are they knowledgeable about? What data-related skills do they have (tool skills, methodology skills)? We need to know this in case individual staff members are incapacitated, so we can more easily find people to plug the gaps.

#2 Get an understanding of Data Lineage at a high level. This is the dataset level. It is not at the detailed column-to-column level. What we need to understand is the enterprise’s Data Supply Chain, and if we understand this, we can predict what will be impacted if an area of the business has to shut down or operate in some kind of degraded mode.

#3 Develop and implement guidance for End User Computing (EUC). With so many people working at home there is a possibility of many more corporate data assets ending up on endpoints like personal devices. People need help in managing these well to prevent negative incidents. EUC is risky even when done on premise, and is even more risky with remote working. It takes time to develop policies, so practical guidance is a better option.

#4 Proactively help with data needs. People may be encountering data problems but do not know who to turn to. Data Governance can help facilitate solving these problems by reaching out to colleagues working remotely and asking if they are having data-related issues. If there are problems, Data Governance is better placed to help coordinate resolutions. IT is likely to be overwhelmed with technology concerns, and in any case is not well placed to deal with data issues.

#5 Provide Data Literacy training. This is not related to risk management. There are a lot of anecdotes that people are finding they are much more productive working at home. This means that they have more time available, which could be used to gain new skills. Roughly speaking, data literacy is the ability to understand the enterprise’s data assets and use tools and methodologies to work with them effectively. In this lockdown period we need to prepare for what happens when the economy reopens. It is unlikely to be the same and we need to get ahead of that now.

So these are 5 practical approaches that Data Governance can take today to support the enterprises we work for. Planning for the future, and a possibly very different world after the lockdown ends is something we will tackle in another episode.

Until we meet again, stay safe.

Download the slides from here

Click on image below to view the video

Data Accuracy: The Hard Truth

Data Accuracy is one of the so-called “dimensions” of Data Quality. The goal for these dimensions, and it is a noble one, is that we can measure each of them, and then there should be a uniform set of best practices that we can implement to cure may deficiencies found. Of course, these best practices will differ from dimension to dimension. But just how feasible is this for Data Accuracy?

In this video we start by looking at the definition of accuracy in general, and Wikipedia provides a commonly accepted definition, which is:

the degree of closeness of measurements of a quantity to that quantity’s true value [https://en.wikipedia.org/wiki/Accuracy_and_precision]

The word “true” is very important and needs to be understood so we can understand accuracy. Truth was defined by Aristotle in a way that highlights the relation between a representation and the reality that the representation is trying to represent. That seems to be very much connected with Data Accuracy.

We give the definition of Data Accuracy as:

the degree to which a data value actually represents what it purports to represent

One of the problems of the dimensions of Data Quality is that there are differences of opinion about how they should be defined. No doubt our definition can be improved, but we will take it as a basis for exploring how we should estimate Data Accuracy.

The next thing we have to consider is that Data Accuracy is impossible to achieve with 100% accuracy for observations. Two great minds in the area of Quality Control provide a basis for this understanding. Walter A. Shewhart pointed out that all systems of measurement introduce some kind of error. W. Edwards Deming, who was taught by Shewhart, went further and pointed out that there is “no true value of anything”.

So it seems we can never get 100% Data Accuracy. If this is the case, we will want to know how we can assess Data Accuracy, since we will want to know just how imperfect it is. And this is where we run into another hard truth, which is that was cannot do this just by considering the data alone. This is potentially a difficulty for data professionals. We are used to working with data and like to work with it. But to estimate Data Quality we will need to step outside the data, figure out a method to independently assess a sample of the population we are interested in for the data values in question and compare that with the data under curation. It just cannot be done from entirely within the data itself. Of course, the way we assess Data Accuracy is likely to vary from data element to data element.

There is one exception to this, which is where data is itself the managed reality. In these cases, the data is not based on observations of something outside itself. Anything dematerialized, like a bank account is included in this category. In the database that actively manages a bank account, then the data is a reality. However, if we look at bank accounts and try to, say, capture their balances, then we are making observations and we are back to normal data. We will explore this distinction more in the future.

So those are a couple of hard truths about Data Accuracy: for data based on observations, it is impossible to achieve it completely, and we cannot estimate it just by looking at data.

Download slides here

Click on image below to view the video

Data Security vs. Data Privacy

It is not that uncommon to hear someone speak of “Data Security and Privacy”, seemingly implying they are just two sides of the same coin. But is that really justified? Are we looking as some kind of common problem domain that these two disciplines cover, or are there a lot of techniques that are common to both?

In this video we start by looking at just security and privacy as concepts (nothing to do with data) and it is pretty easy to find examples where you can have a lot of security and very little privacy, and vice versa. So, it seems they are quite distinct concepts, and given that, we can suspect that Data Security and Data Privacy are different.

We then move to look first at the definition of Data Security, and after that the definition of Data Privacy. Data Security is essentially concerned with confidentiality, integrity, and availability (the so-called “CIA” triad). At least this is the traditional way in which Data Security has been considered. For Data Privacy we go out on a limb and provide our own definition, which is:

Keeping the obligations the enterprise has, and the promises it makes, about the way in which it uses data.

This definition is broad and covers not just personal information (PI) but also licensed data (data covered by contractual terms, like data purchased from data vendors). Of course, today the focus is on PI because of all the new Data Privacy laws, but the contractual aspects are still important. In any case, just the definitions indicate that Data Security and Data Privacy are quite different.

Next, we do a quick comparison of some of the detailed activities and concerns of the two disciplines. There are a lot of differences, although there are some shared concerns. But even here Data Security and Data Privacy play different roles. So, I think we can be quite justified in keeping Data Security and Data Privacy separate, and deal with each in its own way. As for the shared concerns, there is probably little more than coordination of the two roles that is needed, rather than combining both to perform the same activities.

Download slides from here

Data Universes #2: The 9 Universes of Data

In the previous post on Data Universes we looked at how semantics contributes to their boundaries. This time we look at the roles of operational processes, which have a good deal more in terms of practical impact. We are still trying to answer the question that all data analysts and data scientists ask when they encounter a new dataset – “what’s in it?”.

The video goes through how operations can change the semantics that were originally intended in the design of the dataset, and how different populations of the same thing can end up in different databases. But even where this does not happen, we still have a possible 9 different kinds of universe that we need to think about in order to answer the “what’s in it?” question.

One thing I realized I should have pointed out in the video is that “Target Data Universe” has a dashed line around it to make that point that it is requirements-driven (or, if you like, “subjective”). It is not really objective like the other 8 universes.

How relevant is all of this? That depends on each dataset. Probably its more important in some than others. But at least with the 9 Data Universes we have a rough checklist that we can run through when we need to understand a new dataset to make sure there is not something significant that could impact the use of the data.

The best place to store the information about what’s in a dataset would be in a data catalog, although we would also need to have some kind of business glossary functionality to take care of the semantics. But it is only a data catalog – where information about the physical data assets is kept – that can be used to describe what the significance is of each of the 9 universes for our dataset. Let’s hope the data catalog vendors start to address this need soon.

Download slides from here.

Data Universes #1: The Outer Limits

What’s in a database or a major database table like Customer? One answer to this is that the set of things that is, or could be, represented as records in the table – the data universe. Two factors combine to create the boundaries of any data universe: semantics and operational processes. In this video (link below) we look at the semantics. We will get to operations next time.

Why does any of this matter? Because anyone using the data needs to know what set of things are supposed to be represented in it, and what set of things are actually represented in it. A lot of mistakes have been made by trying to guess the answers to these two questions.

Thinking about semantics always means having to do due diligence with the philosophers, and they give us three useful concepts: Universe of Discourse; Intension; and Extension. We get into these in the video and try to relate them to what we see in databases.

One point in the video needs a bit of clarification. When discussing intension, I talk about “essential attributes”. These are the attributes that qualify the corresponding set of concrete objects – the extension. Without getting into a deep philosophical discussion, an essential attribute is one that specifies part of the very nature of a concrete instance. Most attributes, and therefore most columns in a database table, are non-essential. So, a Party model table, having very few columns that are essential attributes, might still have thousands of columns that are non-essential. The overall number of columns in a table is not part of what we are discussing.

More to come in the next video on data universes.

Download Slides from Here.

5 Fatal Flaws in a Business Glossary Strategy

There is a lot to say about Business Glossaries so I have made a video (link below) that concentrates on what could be wrong or missing in a strategy for a Business Glossary. Having a strategy is really important, and an overall thrust in the argument I am trying to make is that it is unfair to expect a tool just to do everything by itself. Business Glossary tools do provide great capabilities, but there are success factors that have to be present for them to become adopted. And only after they are adopted can they bring real business benefit.

One thing I forgot was to point out in the video was what I mean by a “technical term”, so let’s deal with that now. Unfortunately, today “technical term” often means terms used by IT. But I am using “technical term” in its more traditional sense, which is contrasted with “common term” as follows:

Common Term – a term that that is expected to be understood by most if not all speakers of the language in question (English in our case).

Technical Term – a term that is specific to a particular domain of human endeavor and not in common use in the language. For a Business Glossary this means the jargon that the business uses, which could be industry-specific, enterprise-specific, or even department-specific.

Working with Business Glossaries can be a very rewarding experience for data people, even though there can be challenges. We will be doing more videos on them in the future, so stand by.

I am also making the slides used in the video available.

Download the slides from here