- 3 Minutes to read
MySQL Log-Based Overview
- 3 Minutes to read
What is Log-Based Extraction?
How is Log-Based Extraction configured in MySQL?
The CDC mechanism for MySQL is based on its binary log (more commonly known as binlog). Like all databases, logging needs to be specifically enabled for MySQL ensure Change Data Capture is taking place. This feature comes pre-built for MySQL Server and just needs to be turned on.
Enabling this log is a straightforward operation in MySQL - the following three parameters just need to be set up correctly in your MySQL database server:
Configuring the expire_logs_days to the right number of days is crucial for succeeding in bringing data from the log constantly. The trade-off here is storage vs. retention of the data and recovering when the failures happen. This means setting up expire_log_days to a very long time will reduce the available storage in the MySQL server or cluster, and setting it to a short time may purge the log before Rivery will take the new data if something is wrong in the connector.
Therefore, the recommendation number in setting up the expire_logs_days parameter for Rivery is 7 (days). This recommendation is balanced between the storage needs of the binlog and the retention Rivery asks for in a case of failure in fetching the log.
How does Log-Based Extraction work in Rivery?
The MySQL connector in Rivery reads the binlog records and produces change events for row-level INSERT, UPDATE, and DELETE commands into the FileZone files. Each file represents a group of operations made in the database in a specific timeframe. The data from the log is streamed constantly (by timeframes of maximum 15 minutes) into the FileZone path determined in the river and pushed into the target by the river's scheduled frequency. This approach means the data is saved, first and foremost, in the FileZone, and then can be pushed anytime into the target DWH.
For more information regarding the FileZone configuration in Rivery, you can refer to the target configuration documentation.
Load tables with Log-Snapshot tables
Loading your data from a database to a target using CDC has an additional capability which is pulling of the log snapshot table.
Using the following use case table:
To attach each log snapshot table to its respective table we use the same naming with suffix "_log". E.g., DEPARTMENTS will be the table itself being passed to the target using CDC, and DEPARTMENTS _LOG will be its respective log snapshot.
In the log snapshot table, we append each action performed on the table. When it was changed (under the field __ts_ms) and if it was deleted (under the field __deleted).
The original table:
Where dept_no and dept_name are the fields in DBO.DEPARTMENTS, and update_time is the calculated expression in the target's mapping: current_timestamp().
The inherently added fields __deleted and __ts_ms have null fields for all records added prior to enabling CDC and migrating the rivers.
The log snapshot table (only loaded when Status = Streaming, as we describe at the bottom):
where one record was added after the river's migration:
These are the possible configurations when using CDC:
Each river will start with status = Waiting For Migration, where we will take everything present in the table.
After the first run, the status will be changed to Streaming where we will also load the log snapshot table. The scheduling of the rivers tells us what is the frequency to write the logs to the filezone, this option can be changed later on in the river's "Settings and Schedule".
Version 1 will soon be deprecated, so please do not configure it.