Data modelling is one of the most important steps in any analytical project, so it’s important to get it right!
Data models are built around business needs, but what if those business needs or processes change? The data modelling technique used should address the relationship between data from various sources, while also being flexible enough to manage future business process changes.
A data model that can’t accommodate business process changes will require design changes, meaning extensive development effort, such as changing ETL and schemas to include the new table and hence increasing the cost of the project.
This blog provides an overview of data modelling and the techniques used for performing data analysis in the data warehousing and business intelligence space. Data modelling can be applied across a full information system, a database or subset of data sources, for this blog we are using a data warehousing lens.
What is Data Modelling?
The Data Management Association (DAMA) defines Data Modelling as:
“Data Modelling is a Process of discovering, analysing and scoping data requirements and then representing and communicating these data requirements in a precise form called data model“
In other words, analysing the data within one or more business areas to understand the relationship between these business areas, database objects and derive meaningful insights thereby enabling the stakeholders to make informed decisions.
Data Warehouse Modelling Approach
There are two different approaches used in Data Modelling as described below
Both data warehouse approaches have their pros and cons. The approach you take will depend on the:
- Reporting needs of the business – enterprise versus team reporting
- Project Capacity – both time and resources. The Inmon approach will take longer to implement and require highly skilled data warehouse specialists.
- Project Budget – the Inmon approach is far more costly.
Data Modelling Techniques
There are three main data modelling techniques used for data warehousing.
- 3NF Data Modelling
- Dimensional Data Modelling
- Data Vault 2.0 Modelling
3NF Data Modelling
- Primarily used in the Relational Databases
- All attributes of an entity must depend on the key of the entity
- In the example below, Customer Context attributes are placed in the same table along with the Primary Keys of other tables which happen to be the Foreign keys in this context
Dimensional Data Model
- Based on Star and Snowflake schema
Problem Arising in 3NF and Dimensional Data Models
Consider a company doing sales of its products to a number of customers. Let’s say the company has a total of 5 products and each customer can buy one product.
So the ER diagram would look like this:
Customer to Product will be 1: 1 relationship
Product to Customer will be 1: Many relationship
Now in few years time, the business scenario changes and as a part of revenue growth the company allows the customers to have multiple products under the same customer ID
Now the data model would like this:
Customer to Product will be 1: Many relationship
Product to Customer will be 1: Many relationship
Thus it will become Many to Many relationship
In this scenario, the introduction of a bridge table Customer_Product_Relationship is needed, thereby causing design changes resulting in extensive development effort (Change ETL, schemas) to include the new table and hence increasing the cost of the project.
This is where Data Vault Modelling comes into play, let’s discuss Data Vault in more detail
Data Vault 2.0 Modelling
Data Vault 2.0 is a hybrid of 3NF and Dimensional (Star Schema) data models and is useful to overcome the drawbacks in the other models.
Data vault modelling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public domain modelling method.
If the relationship or bridge table described above always exists in the data model, that is, it is built into the model at the beginning, this will allow flexibility in business rules and save time and costs in the future.
High Level Steps for Data Vault 2.0 Modelling
There are two main stages in Data Vault Modelling, Discovery and Modelling.
- Discovery: In this stage we need to understand from the business users “What questions need to be Answered”?
Example: We need to know the Sales Amount for Asia Pacific, so the first phase of the Data Vault Model will include only the source tables needed to answer this question. Typically a business workshop is conducted to understand the keys and attributes important to the business processes.
2. Modelling: This stage has 8 steps involved. We need to (1) identify the core business concepts, (2) identify the hubs and (3) establish the business keys & (4) hash keys for these hubs. Then (5) model the hubs and (6) identify the relations between the hubs. Then finally (7) model the links and (8) identify context attributes from each hub.
- Identify Core Business Concept (CBC);
- The Driver to identify the CBC is the business itself
- We need to select a grain that the business is using e.g.customer
- Identify Hubs:
- This is an easy step as the CBC identified in the above step is a hub table
- Formed around the business key of this concept
- This is established only once in the Data warehouse, once this CBC is identified
- Establish Business Keys for Hubs:
- Note that the Primary Key is not always the Business key
- Business stakeholders will be in the best position to answer how to track a unique customer
- As the example below, the Business key can be a combination of customer_id and the source_system where it has been stored
- Establish Hash Keys for Hubs:
- Business Keys can be Hash using the SQL functions like MD5 or func_sha1
- This hash key is a unique key in the hub table thereby eliminating the risk of duplicate records
- Model Hubs:
- Identify any other hub in context to the CBC
- Another Example would be Product
- Identify relations between Hubs:
- This will help us in modelling the link tables
- As an example:
- Customer to Product will be 1: Many relationship
- Product to Customer will be 1: Many relationship
- Model Links:
- Now you can connect the relationship between 2 hub table and design the link table
- Identify Context Attributes from each hub:
- Hubs and links are the backbone of Data Vault
- Now, we need to add the other attributes which are descriptive in nature to the data vault
- For each hub or link, there should be as many satellite tables as the number of source tables
- An example of Satellite table would be as below
Limitations of Data Vault 2.0 Modelling
Although Data Vault 2.0 Modelling overcomes many issues experienced in 3NF and Dimensional Star Schema models, it still has some limitations:
- End user reporting tools like Tableau and Power BI need tables available in a dimensional model (Dimensions and Facts) for reporting purposes
- As a workaround Hubs and Satellites can be treated as Dimensions and Link table treated as Fact tables for consumption into these reporting tools
- Data Vault is not query optimised and performance can be a problem sometimes
Benefits of Data Vault 2.0 Modelling
The benefits of Data Vault modelling are many:
- Delivery of the project will be easier as the Data Vault model can easily accommodate any business process changes
- There can be only one source of truth for all the data coming into the data warehouse
- It’s easier to maintain as there will always be INSERTs as no UPDATEs are allowed in DV. For example if a customer’s information needs to be updated, lets say the address. A new record will be inserted and the most recent record will be identified according to the “Date_Time” attribute.
Conclusion
There are many advantages to Data Vault modelling as it is designed to be resilient to changes in the business environment, such as business rule changes and where the sources of data are being stored. There are also performance advantages, such as allowing parallel loading and being able to scale without major redesign and blow-outs to project timeframes and costs.
Data modelling is one of the most important steps in any analytical project, so it’s important to get it right. I hope this blog has provided a useful overview of data modelling and the techniques used for performing data analysis in the data warehousing and business intelligence space.
About Skillfield
Skillfield is a data services and cyber security company, empowering organisations to excel in the digital era.
If data platform challenges or data complexity issues are preventing your organisation from making data-driven decisions, contact us. Let’s talk about how we can help.
Author: Apoorv Chitre