loader

Data Vault 2.0 Modelling in Data Warehousing

Data modelling is one of the most important steps in any analytical project, so it’s important to get it right!

Data models are built around business needs, but what if those business needs or processes change? The data modelling technique used should address the relationship between data from various sources, while also being flexible enough to manage future business process changes. 

A data model that can’t accommodate business process changes will require design changes, meaning extensive development effort, such as changing ETL and schemas to include the new table and hence increasing the cost of the project.

This blog provides an overview of data modelling and the techniques used for performing data analysis in the data warehousing and business intelligence space. Data modelling can be applied across a full information system, a database or subset of data sources, for this blog we are using a data warehousing lens. 

What is Data Modelling?

The Data Management Association (DAMA) defines Data Modelling as:

 “Data Modelling is a Process of discovering, analysing and scoping data requirements and then representing and communicating these data requirements in a precise form called data model“

In other words, analysing the data within one or more business areas to understand the relationship between these business areas, database objects and derive meaningful insights thereby enabling the stakeholders to make informed decisions.

Data Warehouse Modelling Approach

There are two different approaches used in Data Modelling as described below

KimballInmon
It follows the Bottom-Up ApproachIt follows the Top-Down Approach
Less Time to ImplementComplex Approach, hence takes a longer implementation time
Difficult MaintenanceEasy Maintenance
Low CostHigh Cost
Less Set up TimeLonger Set up Time
Data isn’t entirely integrated so the concept of ‘single source of truth’ is lostUnified source of truth for the entire business
It may not be able to handle future enterprise reporting requirementsCan handle enterprise-wide reporting requirements

                            Skillfield_Inmon_Data_Model                              

Both data warehouse approaches have their pros and cons. The approach you take will depend on the:

  • Reporting needs of the business – enterprise versus team reporting
  • Project Capacity – both time and resources. The Inmon approach will take longer to implement and require highly skilled data warehouse specialists.
  • Project Budget – the Inmon approach is far more costly.

Data Modelling Techniques

There are three main data modelling techniques used for data warehousing. 

  1. 3NF Data Modelling
  2. Dimensional Data Modelling
  3. Data Vault 2.0 Modelling

3NF Data Modelling

  • Primarily used in the Relational Databases
  • All attributes of an entity must depend on the key of the entity
  • In the example below, Customer Context attributes are placed in the same table along with the Primary Keys of other tables which happen to be the Foreign keys in this context
Customer_Entity
Customer_ID
Customer_Name
Customer_Address
Customer_ClassID (FK)
Date_ID (FK)
Customer_Phone

Dimensional Data Model

  • Based on Star and Snowflake schema
Entity NameDescription
Dimension TablesContains the Descriptive Attributes that are typically textual fields
Lookup TableExtension tables to dimension table when all the required information is unavailable in dimension table
Fact TablesDescribes the numbers or dollar value related to a dimension, this table will store the Primary Key of all dimension tables as foreign key column

Skillfield_ER Diagram

 

Problem Arising in 3NF and Dimensional Data Models

Consider a company doing sales of its products to a number of customers. Let’s say the company has a total of 5 products and each customer can buy one product.

So the ER diagram would look like this:

Customer to Product will be 1: 1 relationship 

Product to Customer will be 1: Many relationship

Skillfield_ERD

Now in few years time, the business scenario changes and as a part of revenue growth the company allows the customers to have multiple products under the same customer ID

Now the data model would like this:

Customer to Product will be 1: Many relationship 

Product to Customer will be 1: Many relationship

Thus it will become Many to Many relationship

Skillfield_ERD3

In this scenario, the introduction of a bridge table Customer_Product_Relationship is needed, thereby causing design changes resulting in extensive development effort (Change ETL, schemas) to include the new table and hence increasing the cost of the project.

This is where Data Vault Modelling comes into play, let’s discuss Data Vault in more detail

Data Vault 2.0 Modelling 

Data Vault 2.0 is a hybrid of 3NF and Dimensional (Star Schema) data models and is useful to overcome the drawbacks in the other models.

Data vault modelling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public domain modelling method.

If the relationship or bridge table described above always exists in the data model, that is, it is built into the model at the beginning, this will allow flexibility in business rules and save time and costs in the future. 

High Level Steps for Data Vault 2.0 Modelling

There are two main stages in Data Vault Modelling, Discovery and Modelling. 

  1. Discovery: In this stage we need to understand from the business users “What questions need to be Answered”?

Example: We need to know the Sales Amount for Asia Pacific, so the first phase of the Data Vault Model will include only the source tables needed to answer this question. Typically a business workshop is conducted to understand the keys and attributes important to the business processes.

 2. Modelling: This stage has 8 steps involved. We need to (1)      identify the core business concepts, (2) identify the hubs and (3)      establish the business keys & (4) hash keys for these hubs. Then (5)     model the hubs and (6) identify the relations between the hubs. Then finally (7) model the links and (8) identify context attributes from each hub.

  • Identify Core Business Concept (CBC); 
    • The Driver to identify the CBC is the business itself
    • We need to select a grain that the business is using e.g.customer
  • Identify Hubs: 
    • This is an easy step as the CBC identified in the above step is a hub table
    • Formed around the business key of this concept
    • This is established only once in the Data warehouse, once this CBC is identified
  • Establish Business Keys for Hubs: 
    • Note that the Primary Key is not always the Business key
    • Business stakeholders will be in the best position to answer how to track a unique customer
    • As the example below, the Business key can be a combination of customer_id and the source_system where it has been stored
Hub_CustomerDescription
H_customer_hashkeyThis will be the HASH Key of business Key (E.g.Customer_ID, Source_System)
Customer_IDUnique Identifier for a customer
Source_SystemIdentity the Source Application Customer Information is Stored (E.g. Retail or Wholesale Customer)
Date_TimeDateTime when the record was inserted in the table
  • Establish Hash Keys for Hubs: 
    • Business Keys can be Hash using the SQL functions like MD5 or func_sha1
    • This hash key is a unique key in the hub table thereby eliminating the risk of duplicate records
  • Model Hubs: 
    • Identify any other hub in context to the CBC
    • Another Example would be Product
Hub_ProductDescription
H_Product_hashkeyThis will be the HASH Key of business Key (E.g.Product_ID, Source_System)
Product_IDUnique Identifier for a customer
Source_SystemIdentity the Source Application Customer Information is Stored (E.g. Retail or Wholesale Product)
Date_TimeDateTime when the record was inserted in the table
  • Identify relations between Hubs: 
    • This will help us in modelling the link tables
    • As an example: 
      • Customer to Product will be 1: Many relationship 
      • Product to Customer will be 1: Many relationship
  • Model Links: 
    • Now you can connect the relationship between 2 hub table and design the link table
Link_Customer_ProductDescription
Link_customer_product_hashkeyHASH Key of primary Key of 2 hubs (E.g.Hub_customer and Hub_Product)
H_customer_hashkeyPrimary Key from Hub_Customer
H_Product_hashkeyPrimary Key from Hub_Product
Source_SystemSource System coming from each hub
Date_TimeDateTime when the record was inserted in the table
  • Identify Context Attributes from each hub: 
    • Hubs and links are the backbone of Data Vault
    • Now, we need to add the other attributes which are descriptive in nature to the data vault
    • For each hub or link, there should be as many satellite tables as the number of source tables
    • An example of Satellite table would be as below
Sat_CustomerDescription
Customer_hashkeyHASH Key of business Key (E.g.Customer_ID, Source_System)
Customer_hashdiffHASH Key of all attributes which we want to capture changes (E.g.Name, Add, Phone)
Customer_NameContext Attributes
Customer_AddressContext Attributes
Customer_PhoneContext Attributes
Source_SystemIdentity the Source Application Customer Information is Stored (E.g. Retail or Wholesale Customer)
Date_TimeDateTime when the record was inserted in the table

As an example, final Data Vault Model would look like as below:

Skillfield_Data_VaultLimitations of Data Vault 2.0 Modelling

Although Data Vault 2.0 Modelling overcomes many issues experienced in 3NF and Dimensional Star Schema models, it still has some limitations: 

  • End user reporting tools like Tableau and Power BI need tables available in a dimensional model (Dimensions and Facts) for reporting purposes
  • As a workaround Hubs and Satellites can be treated as Dimensions and Link table treated as Fact tables for consumption into these reporting tools
  • Data Vault is not query optimised and performance can be a problem sometimes

Benefits of Data Vault 2.0 Modelling

The benefits of Data Vault modelling are many: 

  • Delivery of the project will be easier as the Data Vault model can easily accommodate any business process changes
  • There can be only one source of truth for all the data coming into the data warehouse
  • It’s easier to maintain as there will always be INSERTs as no UPDATEs are allowed in DV. For example if a customer’s information needs to be updated, lets say the address. A new record will be inserted and the most recent record will be identified according to the “Date_Time” attribute. 

Conclusion

There are many advantages to Data Vault modelling as it is designed to be resilient to changes in the business environment, such as business rule changes and where the sources of data are being stored. There are also performance advantages, such as allowing parallel loading and being able to scale without major redesign and blow-outs to project timeframes and costs. 

Data modelling is one of the most important steps in any analytical project, so it’s important to get it right. I hope this blog has provided a useful overview of data modelling and the techniques used for performing data analysis in the data warehousing and business intelligence space.

About Skillfield

Skillfield is a data services and cyber security company, empowering organisations to excel in the digital era. 

If data platform challenges or data complexity issues are preventing your organisation from making data-driven decisions, contact us. Let’s talk about how we can help.

Author: Apoorv Chitre