- 4 Minutes to read
PostgreSQL Change Data Capture (CDC) Overview
- 4 Minutes to read
Rivery's Change Data Capture (CDC) extraction mode provides a real-time stream of any changes made to the databases and tables configured, eliminating the need to implement and maintain incremental fields or retrieve data via select queries. It also allows you to retrievedeleted rows and schema changes from the database.
How Does Change Data Capture Extraction Work?
TheChange Data Capture (CDC) mechanism is based on aWrite-Ahead Log (WAL) provided by the PostgreSQL database. The logical decoding feature of PostgreSQL is available in versions 10 and higher.
In PostgreSQL, the logical decoding mechanism creates a consistent and easy-to-understand format in one place. The logical decoding mechanism in PostgreSQL, on the other hand, is based on converting the data written to the WAL using an output plugin that converts the WAL transaction rows into a dedicated format. To be able to consume data from the database, this output plugin must be installed on the PostgreSQL server and the logical decoding configuration must be enabled. See the PostgreSQL configuration docs for more information on setting up a PostgreSQL database for Rivery CDC extraction.
Rivery uses the output plugin to continuously pull new rows in theWAL in order to pull data using the CDC architecture. In order to get historical data from the database, Rivery cannot rely on the entire log history already existing prior to setting up a river. This is because PostgreSQL normally keeps track of the WAL and purges it after a certain amount of time.
Rivery uses the Overwrite loading mode to create a full snapshot (or migration) of the chosen table(s) in order to align the data and metadata as it was on the first run. After the migration is complete, Rivery takes the existing WAL records and performs an Upsert-merge to the target table(s), while continuing to fetch new records from the log as they are created.
Rivery's PostgreSQL connector reads WAL records and generates change events in the FileZone files for row-level INSERT, UPDATE, and DELETE commands. Each file represents a set of database operations performed over a period of time. The data from the log is continuously streamed into the FileZone path determined in the river (with timeframes of no more than 15 minutes) and pushed into the target by the river's scheduled frequency. This method saves the data first in the FileZone, and then it can be pushed into the target DWH at any time.
FileZone is covered in further detail in the Target documentation.
Discrepancies in transaction records can arise when two users simultaneously execute identical transactions, causing conflicts in the timestamp field.
Recognizing this challenge, Rivery has implemented a "sequence" Change Data Capture (CDC) mechanism to tackle this issue.
Rivery has enhanced each emitted record from the database by incorporating two extra metadata fields: '__transaction_id' and '__transaction_order'.
The '__transaction_id' field serves as a unique identifier for each transaction, ensuring that no two transactions share the same identifier. This uniqueness allows for precise identification and differentiation between transactions, mitigating conflicts arising from identical timestamps.
Furthermore, the '__transaction_order' field denotes the order in which the transactions were emitted from the database. By incorporating this field, the sequencing of transactions can be accurately maintained, enabling downstream systems such as Apache Kafka or AWS Kinesis to process and order transactions correctly.
The inclusion of these metadata fields guarantees that the ordering of transactions is preserved throughout the River. As a result, smooth and accurate transaction flows can be achieved, resolving the discrepancies that previously arose from transactions with identical timestamps.
The additional fields are depicted in this table:
How to EnableExtraction?
- A brief reminder appears, encouraging you to check your connection and set up your Source and Target, which will happen next. Select 'Got It' to proceed.
- Choose Log Based Version from the Advanced Options drop-down menu.
- Set the name of the custom replication slot (optional).
- Set a custom publication name (optional).
- Turn the 'Enable Log' toggle to true at the bottom of the page.
Type Mapping in CDC
While extracting your data, we match PostgreSQL data types to Rivery data types.
The mapping ofPostgreSQL data types to Rivery-compatible types is shown in the table below:
|PostgreSQL Type||Rivery Type|
- The 'TOAST' file format is not supported.
- Any special character will be replaced with an underscore. As a result, if you wish to edit the table name, go to:
1. The 'Schema' tab
2. Select a Table
3. Click 'Table Settings'
4. Choose 'Edit' to change the table name manually.