- 3 Minutes to read
- Print
- DarkLight
- PDF
Automatic Schema Drift Management
- 3 Minutes to read
- Print
- DarkLight
- PDF
Introduction
Schema drift refers to the gradual changes in the structure of data in a Source over time. These changes can have a significant impact on the integrity, and accuracy of data being replicated to a destination Target, such as a data warehouse.
Rivery provides automatd solutions for extracting, transforming, and loading data from various Sources into a Target. One of the key features is the ability to detect and handle schema changes in the Source.
Here's an Illustration of Schema Drift
Customers Table: Initial Structure
ID | Name | Address | Phone Number |
---|---|---|---|
1 | Jane | 500 7th Avenue | 5551235555 |
2 | John | 16 Great Chapel Street | 5554565556 |
Customers Table: After Schema Drift
ID | Name | Address | Phone Number | Social Media Handle | |
---|---|---|---|---|---|
1 | Jane | 500 7th Avenue | 555-123-5555 | Null | Null |
2 | John | 16 Great Chapel Street | 555-456-5556 | Null | Null |
3 | Bob | 14425 Falcon Head Blvd | (+1)555-789-5559 | bob.smith@rivery.com | @bobsmith |
The Customers database initially included four columns, as seen in this example: "ID," "Name," "Address," and "Phone Number." However, due to schema drift, the table now has two extra columns, "Email" and "Social Media Handle," and the "Phone Number" column's data type has been changed from integer to string.
Handling Schema Changes
When a schema change is detected in the Source, Rivery updates the schema in the destination Target to match the new structure. This is done without interrupting the flow of data and without requiring manual intervention in any scenario, whether it's through Standard extraction or CDC.
the process involves detecting any changes in the data source (Standard/CDC) by comparing it with the metadata. The metadata is essentially a record of the structure and attributes of the data source.
To ensure that the process is always up-to-date, Rivery will resync the metadata and the data source on each run, meaning it will compare the two and make any necessary updates. Once the metadata and the data source are in sync, Rivery will use the metadata as the base for the data processing.
In cases where the user makes changes to the data source, Rivery will detect these changes and update the metadata accordingly. This ensures that the metadata always reflects the most recent state of the data source, allowing for accurate processing and analysis.
Dropping a Column in the Source Table
When a column in the Source is dropped, but the corresponding column in the Target still exists. In this case, the dropped field in the target table will be populated with null values.
Renaming a Column in the Source Table
Any renaming of a column in the source table will be detected as soon as a user runs the River. This will result in a new column being added to the target table with the updated name, along with a new record. The previous table records will be duplicated, with null values in the newly added column.
The main objective of this approach is to ensure that the addition of the new column does not negatively impact any existing processes. By inserting new records into the new column and keeping old records as null values, Rivery is able to maintain the integrity of the original data while also incorporating new information into the dataset.
Adding a Column in the Source Table
If a new column is added to the source table, Rivery will detect this as soon as the user runs the River and add it to the target table.
Previous records that already existed in the target table before the new column was created will have null values for that column.
Changing Data Type of an Existing Column
If the data type of an existing column is changed in the source table, the target column will be set as the "wider" data type.
For example, if the field is changed from an integer to a string, the target data type will be changed to a string, as this data type can encapsulate integer values.
However, if the column has been changed from a string to a date, it will remain a string in the Target.
Sources Supporting Schema Drift
Schema drift is available for the following Sources:
- Salesforce
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Redshift
- MariaDB
- Teradata
- Vertica
- SAP S4/Hana
- MongoDB - In this instance, we're using the most recent 100 pages from each run to sync the schema.