Upsert-Merge Method Overview
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Upsert-Merge Method Overview

  • Dark
    Light
  • PDF

Article summary

Overview

Rivery's Upsert-Merge method ensures data consistency by synchronizing records between sources and target systems. It checks if a record exists in the target table using a specified key. If found, the record is updated; otherwise, a new record is inserted. This process keeps data up-to-date across multiple systems.

Supported Sources

Rivery's Upsert-Merge method is compatible with:

  • Amazon Athena

  • Amazon Redshift

  • Azure SQL

  • Azure Synapse Analytics

  • Databricks SQL

  • Firebolt

  • Google BigQuery

  • PostgreSQL RDS/Aurora

  • Snowflake

  • Treasure Data

Please Note: For Snowflake, Azure SQL, and Databricks, multiple Merge methods are available, each with distinct performance implications.

Merge vs. Switch Merge

Merge Method

The Merge method applies updates and inserts directly into the target table using a MERGE INTO statement. This method is efficient for incremental updates where only a subset of records change.

Merge Method Flowchart:

Key Characteristics:

  • Directly updates existing records and inserts new ones.

  • No intermediate table replacement.

  • Best for smaller, frequent updates.

Disadvantage:

  • Key Duplication Failure: If the increment contains duplicate keys, the Merge method will fail due to conflicts in the target table. This can occur when multiple source records share the same key but have differing values, leading to an ambiguous update scenario.

Sample Merge Query:

MERGE INTO target_table AS tgt
USING source_table AS src
ON tgt.id = src.id
WHEN MATCHED THEN
    UPDATE SET tgt.column1 = src.column1, tgt.column2 = src.column2
WHEN NOT MATCHED THEN
    INSERT (id, column1, column2) VALUES (src.id, src.column1, src.column2);

Switch Merge Method

The Switch Merge method creates a temporary table to store the merged data, then replaces the target table in an atomic swap. This approach minimizes downtime and is ideal for large transformations or schema changes.

Switch-Merge Method Flowchart:

Key Characteristics:

  • Uses a temporary table to hold merged data before replacement.

  • Ensures atomic replacement for consistency.

  • Reduces downtime and is efficient for large-scale data changes.

Sample Switch Merge Query:

CREATE OR REPLACE TABLE temp_table AS
SELECT * FROM source_table;
ALTER TABLE target_table SWAP WITH temp_table;

Merge vs. Switch Merge: Key Differences

Feature

Merge

Switch Merge

Operation

Updates/inserts specific rows

Replaces the entire table

Best For

Insure non dups

Full table refresh or transformed Large datasets.

Why Rivery Uses Temporary Tables in Both Methods

Although the Merge method appears to update the target table directly, Rivery creates a temporary table first to ensure data integrity. Both methods use a temporary table:

  • Merge: Changes are applied to a temporary table before being merged into the target, without replacing the table.

  • Switch Merge: The temporary table fully replaces the target table in a single step.

Automatic Failover to Switch Merge

Rivery may switch from Merge to Switch Merge in cases such as:

  • Schema changes (e.g., new columns in the source table)

  • Primary key conflicts that could cause inconsistencies

  • Performance constraints or timeouts

Delete-Insert Method (Snowflake-Specific)

In Snowflake, an alternative upsert method involves deleting outdated rows before inserting new ones. This is useful when a full data refresh is needed.

Delete-Insert Method Flowchart:

Sample Delete-Insert Query:

DELETE FROM schema.metadata trgt_
USING schema.tmp_metadata src_
WHERE trgt_._id = src_._id;

INSERT INTO schema.metadata (id, name, email)
SELECT id, name, email FROM schema.tmp_metadata;

Best Practices: Avoid Running Multiple Rivers on the Same Target Table

Running multiple Rivers to the same target table can cause:

  • Data inconsistencies due to overlapping updates.

  • Performance issues from excessive resource usage.

  • Locking conflicts, leading to transaction failures.

  • Unclear data lineage, making debugging difficult.


Was this article helpful?