Automatic Schema Drift Management
  • 3 Minutes to read
  • Dark
    Light
  • PDF

Automatic Schema Drift Management

  • Dark
    Light
  • PDF

Article Summary

Introduction

Schema drift refers to the gradual changes in the structure of data in a Source over time. These changes can have a significant impact on the integrity, and accuracy of data being replicated to a destination Target, such as a data warehouse.

Rivery provides automatd solutions for extracting, transforming, and loading data from various Sources into a Target. One of the key features is the ability to detect and handle schema changes in the Source.

Here's an Illustration of Schema Drift

Customers Table: Initial Structure

IDNameAddressPhone Number
1Jane500 7th Avenue5551235555
2John16 Great Chapel Street5554565556

Customers Table: After Schema Drift

IDNameAddressPhone NumberEmailSocial Media Handle
1Jane500 7th Avenue555-123-5555NullNull
2John16 Great Chapel Street555-456-5556NullNull
3Bob14425 Falcon Head Blvd(+1)555-789-5559bob.smith@rivery.com@bobsmith

The Customers database initially included four columns, as seen in this example: "ID," "Name," "Address," and "Phone Number." However, due to schema drift, the table now has two extra columns, "Email" and "Social Media Handle," and the "Phone Number" column's data type has been changed from integer to string.


Handling Schema Changes

When a schema change is detected in the Source, Rivery updates the schema in the destination Target to match the new structure. This is done without interrupting the flow of data and without requiring manual intervention in any scenario, whether it's through Standard extraction or CDC.

the process involves detecting any changes in the data source (Standard/CDC) by comparing it with the metadata. The metadata is essentially a record of the structure and attributes of the data source.

To ensure that the process is always up-to-date, Rivery will resync the metadata and the data source on each run, meaning it will compare the two and make any necessary updates. Once the metadata and the data source are in sync, Rivery will use the metadata as the base for the data processing.

In cases where the user makes changes to the data source, Rivery will detect these changes and update the metadata accordingly. This ensures that the metadata always reflects the most recent state of the data source, allowing for accurate processing and analysis.

Untitled 23.png

Dropping a Column in the Source Table

When a column in the Source is dropped, but the corresponding column in the Target still exists. In this case, the dropped field in the target table will be populated with null values.

Renaming a Column in the Source Table

Any renaming of a column in the source table will be detected as soon as a user runs the River. This will result in a new column being added to the target table with the updated name, along with a new record. The previous table records will be duplicated, with null values in the newly added column.

The main objective of this approach is to ensure that the addition of the new column does not negatively impact any existing processes. By inserting new records into the new column and keeping old records as null values, Rivery is able to maintain the integrity of the original data while also incorporating new information into the dataset.

Adding a Column in the Source Table

If a new column is added to the source table, Rivery will detect this as soon as the user runs the River and add it to the target table.
Previous records that already existed in the target table before the new column was created will have null values for that column.

Changing Data Type of an Existing Column

If the data type of an existing column is changed in the source table, the target column will be set as the "wider" data type.
For example, if the field is changed from an integer to a string, the target data type will be changed to a string, as this data type can encapsulate integer values.
However, if the column has been changed from a string to a date, it will remain a string in the Target.

Sources Supporting Schema Drift

Schema drift is available for the following Sources:

  • BigQuery
  • MariaDB
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Redshift
  • Salesforce
  • SAP S4/Hana
  • Teradata
  • Vertica
  • MongoDB - In this instance, we're using the most recent 100 pages from each run to sync the schema.

Was this article helpful?