How to implement slowly changing dimension type 2 using azure data factory |SCD type 2
In the world of data warehousing, effective management of historical data is crucial for making informed business decisions. Slowly Changing Dimension (SCD) Type 2 is a widely used technique for handling changes in data over time. In this blog post, we will delve into the concept of SCD Type 2, its importance, and how it is implemented in data warehouses.
What is SCD Type 2?
SCD Type 2 is one of the most common techniques used to manage historical data in a data warehousing environment. It is particularly useful when dealing with slowly changing data, where some attributes of a dimension change over time, and you need to preserve historical records while still accommodating new information.
In SCD Type 2, each time a change occurs in a dimension’s attribute, a new record is created with a new surrogate key and a valid time period during which that version of the data was valid. This allows data analysts and business users to analyze historical data accurately and maintain a complete audit trail.
Why is SCD Type 2 Important?
1. Accurate Historical Analysis: By preserving historical data, SCD Type 2 enables organizations to perform accurate historical analysis. This is particularly valuable in industries where tracking changes over time is critical, such as finance, healthcare, and retail.
2. Auditing and Compliance: SCD Type 2 provides a comprehensive audit trail of all changes made to dimension data. This audit trail is essential for compliance purposes, especially in regulated industries.
3. Improved Decision-Making: Historical data is a valuable resource for making informed business decisions. SCD Type 2 ensures that decision-makers have access to a complete dataset to analyze trends and patterns.
4. Simplified Reporting: Reporting tools can easily query SCD Type 2 data, as it maintains a clear history of changes. This simplifies the process of generating reports and dashboards.
How to Implement SCD Type 2?
Implementing SCD Type 2 involves several key steps:
create table scd2(surrkey int identity(1,1),id int, name nvarchar(100),address nvarchar(100),isactive nvarchar(100))
insert into scd2 values (1,’john’,’chennai’,1)
insert into scd2 values(2,’antony’,’chennai’,1)
insert into scd2 values(3,’antony’,’chennai’,1)
Step 1: Add the source dataset (dataset should point to file which is located in your source layer).
Step 2: Add derived column resource and add column name as isactive and provide the value as 1.
Step 3: Configure your sink mappings as shown below
Step 4: Add SQL dataset as another source.
Step 5: Use select resource to rename columns from SQL table.
Step 6: Add lockup activity (It requires two sources so, first source will be your select activity and second source will be your source file).
Step 7: Output of lookup activity should be like shown below (For non-matching rows there should be null’s).
Step 8: Now let’s filter out the rows which has non-nulls in the source file columns.
Step 9: Select only the required columns that you are going to insert or update in SQL table.
Step 10: Add derived column and add isactive column to the table and the value should be 0.
Step 11: Add alter row resource and configure as shown below.
Step 12: Add sink and configure the sink as shown below.
Step 13: Finally Under settings section of your dataflow select sink2 as first and sink1 as second.
Step 14: After successfully running, your pipeline verify the data in your SQL table.
If you face any errors or need any help on azure data factory please feel free to reach out to me on LinkedIn : https://www.linkedin.com/in/madhu-narayana-varma-m-08216919a/