Databases Overview
  • 5 Minutes to read
  • Dark
    Light
  • PDF

Databases Overview

  • Dark
    Light
  • PDF

Article summary

Overview

Rivery enables migrating data from legacy and cloud databases to one of the DWHs we support.
Our database connectors are built with correctness, safety, and security in mind. The term "correctness" refers to the fact that your destination should always be an exact copy of your database source. Rivery should never cause difficulties in your source database, and clients can use a custom file zone to ensure that data does not leave their environment during transit.

We have simple methods to obtain data from databases without needing access to Binary Logs, in addition to connecting with Binary Log capture tools. We can handle more customers and databases thanks to our customizable approach.

Compatibility of Versions

DatabaseModesVersions SupportedSupport for Managing Schema Drift
ElasticSearchStandard ExtractionScroll API Version 7.10
Indexes API Version 7.10
Mapping API Version 7.10
MariaDBStandard ExtractionVersions 10.0 and Above
System Versioning
Custom Query
Microsoft SQL ServerStandard ExtractionAll Versions
Change Data CaptureSp1 2016 and Above (Standard or Enterprise Editions)
Change TrackingAll Versions
MongoDBStandard Extraction
Versions 3.4 and Above
Change Streams Versions 4.0 and Above
MySQLStandard ExtractionAll Versions
Change Data CaptureMySQL 5.6 and Above, or MySQL 8.0 and Above
OracleStandard Extraction
Versions 11g and Above
Change Data Capture
PostgreSQLStandard ExtractionAll Versions
Write Ahead Logging (WAL)Versions 10 and Above
RedshiftStandard Extraction
All Versions
TeradataStandard Extraction
Version 15.10
VerticaStandard ExtractionAll Versions

Supported Features:

Parquet File Conversion

It is possible to convert CSV/JSON files to Parquet files when using Amazon S3 as the Target.

Consult our Quick Guide for an illustration of how to do it.

Please keep in mind that while the Quick Guide uses Amazon S3 as a Source, this is possible with any Database and Google Cloud Storage.

Incremental Behavior

Rivery works in millisecond increments, but users can store data in more granular microsecond or nanosecond increments.

When operating incremental runs based on timestamp, it may appear that the most recent record is missing.

Rivery reads three numbers after the dot when there are six, so '2022-08-14 10:26:52.895132' becomes '2022-08-14 10:26:52.895' and the last record is missing from the current run when fetching the data.

Please note that this record will be retrieved in the next run with updated data.

Change Data Capture (CDC) Metadata Fields

CDC is a functionality found in database management systems, allowing for the detection and recording of changes made to data within a database. There are 4 metadata fields commonly used to provide information about the changes made to records.

This section details the automatically appended fields (columns) to the table when employing CDC, including:

  • __DELETED
  • __TS_MS
  • __TRANSACTION_ORDER
  • __TRANSACTION_ID

__DELETED 

The __DELETED column serves to indicate rows that have been removed from the Source database. In cases of modification, the status is clarified as follows:

  • True: indicating that the actual row was deleted from the Source database.
  • False: denoting that the row still exists in the Target warehouse but has been updated.
  • Null: Indicating that the row remains unchanged in the Target warehouse without any updates.

When a row is deleted from the Source database, it is categorized as a "Hard Delete,"  indicating that the deletion is irreversible. Our CDC process records this event in the logs table, and the corresponding record appears as a "Soft Delete" in the Target warehouse. This signifies that the most recent version of the row still exists in your Target warehouse. The "Soft Delete" approach in database table management maintains a record of the deleted data, providing the possibility of data recovery if needed, as the deleted rows persist in the Target warehouse.

Please Note:

A Soft Delete retains the record in the table without physically removing it, leading to its continued presence in your Target warehouse and inclusion in backups unless additional steps are taken to exclude it.

__TS_MS

The TS_MS column, short for the TIMESTAMP Milliseconds column, is designed to convey integer values or null entries. Depending on the circumstance, the timestamps within this column are displayed in integer or null. This column serves the purpose of documenting the timing of changes when employing the CDC method.

___TRANSACTION_ORDER

This column indicates the sequence in which the transaction was emitted from the database. It derives its significance from the transaction order within the database log or replication stream. Ensuring that __Transaction_Order is both unique for each transaction and accurately mirrors the actual commit order is crucial.

___TRANSACTION_ID

This serves as a distinct identifier for each transaction generated by the database. The nature of this identifier depends on the database system in use and may be derived from factors such as log position, GTID, or LSN. For instance, in MySQL, the __Transaction_ID field might be assigned the GTID of the transaction, while in PostgreSQL, it could be configured to reflect the LSN.

Here is an example:

Add a Calculated Column

One of the key features of Rivery is the ability to add calculated columns using expressions. This powerful functionality allows the user to customize their output table beyond the limits of the raw metadata extracted from the Source by including expressions, mathematical or string operations, or simple functions to enrich and tailor the output to their use-case.

For more detailed information, please refer to our Targets document.

Limitations of Kafka in Change Data Capture (CDC)

Our CDC implementation uses Kafka for file screening, and it comes with a limitation:

Topics (representing tables and columns) in Kafka cannot be named using characters like underscores (_) or commas (,).


Extended Execution Time for Large Tables 

Rivery extends the execution time for handling large datasets. This feature allows Rivers to efficiently process and load extensive data, automatically extending the processing time up to 48 hours if necessary, ensuring successful data loading.

Rivery has an automatic mechanism that adjusts the execution time based on the size of the table or the number of rows. For large tables, Rivery will automatically switch to long-duration mode to complete the process without manual intervention.

User Control

You can adjust the default execution time in the River's settings tab to customize the timeout limit. If a custom timeout value is set, the process will terminate after the selected time is exceeded.

Please Note:
Rivery will automatically process large tables and reports for up to 48 hours, unless the user modifies the default timeout setting.


image.png


Factors Influencing Storage Efficiency and Performance in Databases

The size and storage efficiency of both tables in a database or data source are influenced by factors such as data types, compression, indexes, the number of records or tables, and the configuration of storage, including storage zones. Optimizing these factors is crucial for efficient storage, improved performance, and scalability in a database environment.

In addition to tables, storage in a database is also influenced by similar factors:

  • Data Types: Different data types stored in tables occupy varying amounts of storage space.
  • Table Size: The size of individual tables depends on the amount of data stored within them.
  • Number of Tables: More tables stored in the database lead to increased storage requirements.
  • Compression: Compression techniques applied to tables can reduce their storage size.
  • Indexes: Similar to tables, indexes created on tables for efficient data retrieval also add to storage requirements.
  • Storage Configuration: The configuration of storage, including factors like block size, file system type, and storage zones, impacts storage efficiency, performance, and scalability.


Connection

Click the sections below for details on how to connect to our supported databases:

ElasticSearch

MariaDB

Microsoft SQL Server

MongoDB

MySQL

Oracle

PostgreSQL

Redshift

SAP S4/Hana

Teradata

Vertica

Note: To view the connectivity options for our databases, please consult the connectivity option documentation.


Was this article helpful?