Snowflake Upsert-Merge Loading Mode Options
  • 03 Nov 2021
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Snowflake Upsert-Merge Loading Mode Options

  • Dark
    Light
  • PDF

Rivery allows for three distinct options when the Upsert-Merge loading mode option is selected in Snowflake. 

The Upsert-Merge option is used to update existing data in a table that may have changed in the source since the last run, as well as insert new records. While all three Upsert-Merge options result in this effect, they differ in the backend process, and have performance implications.

Switch-Merge

This is the default and generally the fastest method of updating rows in a target table. The backend process isolates the "new data" by querying the old table for data with matching primary keys to the new data coming in from the source, and constructs a temporary table with the "new data" along with the "old data" (data not matching the rows with the updated primary keys) from the target table. Then the program "switches" the old table in the target, with the temporarily constructed one to result in a table in the target with updated rows corresponding to those of matching primary keys.


Example Backend Query for this process:

CREATE Table <tmp_target_table> 
COPY GRANTS
AS
SELECT col1, col2, col3, col4
FROM (
  SELECT trgt.col1 as col1, 
        null as col2 /*In a case the field was dropped from the mapping */, 
        trgt.col3 as col3, 
        trgt.col4 as col4
  FROM <trgt_table> as trgt
  WHERE NOT EXISTS (
    select 1 from <tmp_src_table> src_ where hash(trgt.key) = src_.gk /* gk created in advanced */
  )
  UNION ALL
  SELECT src.col1 as col1, 
        src.col2 as col2, 
        null as col3 /*In a case the field doesnt exist in the target */, 
        src.col4 as col4
  FROM <tmp_src_table> src
);


Delete-Insert

This option creates a both a DELETE statement in the backend with all of the keys of incoming data that matches the keys of previously existing data in the table and an INSERT statement with the new column data that match the primary keys in the existing table. Both of these statements deploy within a TRANSACTION clause.

Delete-Insert does not preserve any duplicate keys. That is, any keys in the new data that match the keys in the destination table will automatically overwrite the data in the destination table. Advantages of Delete-Insert are that a unique primary key is not required, and metadata changes such as renamed columns or data type changes will not result in an error.

Example Backend Query for this process:

DELETE from WAREHOUSE.SCHEMA.metadata trgt_             
    USING WAREHOUSE.SCHEMA.tmp_metadata_b8fc00 src_             
    where trgt_._adset_id = src_._adset_id

INSERT INTO WAREHOUSE.SCHEMA.metadata(
    _adset_id,_campaign_name,_account_id,_campaign_id,_date_stop,_date_start,_ad_id,_adset_name,_ad_name,_account_name,_reach,_currency,
    _impressions,_actions)             
    SELECT _adset_id::STRING as _adset_id,
           _campaign_name::STRING as _campaign_name,
           _account_id::STRING as _account_id,_campaign_id::STRING as _campaign_id,
           _date_stop::STRING as _date_stop,
           _date_start::STRING as _date_start,
           _ad_id::STRING as _ad_id,_adset_name::STRING as _adset_name,
           _ad_name::STRING as _ad_name,_account_name::STRING as _account_name,
           _reach::STRING as _reach,_currency::STRING as _currency,
           _impressions::STRING as _impressions,_actions::STRING as _actions 
           from WAREHOUSE.SCHEMA.tmp_metadata_b8fc00


Merge

The Merge loading mode option utilizes the Snowflake-native MERGE command to update rows that already exist in the target table, and inserts new rows that do not already exist. This loading method maintains the table history within Snowflake.

Merge requires that the source keys loaded into the target table are unique, or true primary keys. If there is any duplication in the source or target data, the command will fail and highlight the duplicate row in the error message.

Example Backend Query for this process:

MERGE into WAREHOUSE.SCHEMA.RIVER_EXECUTIONS_TM trgt_                    
    USING WAREHOUSE.SCHEMA.tmp_RIVER_EXECUTIONS_TM_8665e1 src_                    
    ON trgt_.run_id = src_.run_id                    
    WHEN MATCHED THEN                      
        update set trgt_.river_id = src_.river_id,
            trgt_.task_id = src_.task_id,
            trgt_.run_id = src_.run_id,
            trgt_.status = src_.status,
            trgt_.current_files = src_.current_files,
            trgt_.total_size = src_.total_size,
            trgt_.total_files = src_.total_files,
            trgt_.row_update_date = src_.row_update_date,
            trgt_.run_date = src_.run_date,
            trgt_.run_end_date = src_.run_end_date,
            trgt_.account = src_.account,
            trgt_.error_description = src_.error_description,
            trgt_.task_ordinal = src_.task_ordinal,
            trgt_.last_task_activity_id = src_.last_task_activity_id,
            trgt_.datasource_type = src_.datasource_type,
            trgt_.is_hidden = src_.is_hidden,
            trgt_.logic_steps = src_.logic_steps,
            trgt_.units = src_.units,
            trgt_.is_cancel = src_.is_cancel,
            trgt_.target_name = src_.target_name,
            trgt_.source_name = src_.source_name,
            trgt_.scheduler_id = src_.scheduler_id,
            trgt_.is_multi = src_.is_multi,
            trgt_.total_rows = src_.total_rows,
            trgt_.total_tables = src_.total_tables,
            trgt_.env_id = src_.env_id                    
    WHEN NOT MATCHED THEN                      
        INSERT (
            river_id,
            task_id,
            run_id,
            status,
            current_files,
            total_size,
            total_files,
            row_update_date,
            run_date,
            run_end_date,
            account,
            error_description,
            task_ordinal,
            last_task_activity_id,
            datasource_type,
            is_hidden,logic_steps,
            units,is_cancel,
            target_name,
            source_name,
            scheduler_id,
            is_multi,
            total_rows,
            total_tables,
            env_id
            ) 
        values (
            src_.river_id,
            src_.task_id,
            src_.run_id,
            src_.status,
            src_.current_files,
            src_.total_size,
            src_.total_files,
            src_.row_update_date,
            src_.run_date,
            src_.run_end_date,
            src_.account,
            src_.error_description,
            src_.task_ordinal,
            src_.last_task_activity_id,
            src_.datasource_type,
            src_.is_hidden,
            src_.logic_steps,
            src_.units,
            src_.is_cancel,
            src_.target_name,
            src_.source_name,
            src_.scheduler_id,
            src_.is_multi,
            src_.total_rows,
            src_.total_tables,
            src_.env_id
            )

Loading Option Best Practices

While all three options perform an update of the rows in the destination rows with new data, there are some considerations to take into account when selecting which one to use.

If the use-case does not require the preservation of Snowflake's "time-travel" feature, and your dataset is very large and without a cluster key set, sticking with the default Switch-Merge is your best option. Switch merge is optimized to work on updating very large datasets.

If the use case requires preservation of Snowflake 'time-travel', and your dataset is of small to medium size, or large with cluster keys set, both Delete-Insert and Merge will preserve table history.

If the use case depends on the uniqueness of a key in the source, Merge is the recommended option. The Merge option is most often seen with relational database sources, as these tend to have defined and maintained primary keys in the source.

If the use case dictates that the metadata and columns may vary often with each load (new columns added to table, change in data types,  column mapping etc.), Delete-Insert would be the recommended loading mode due to its ability to handle such changes and deal with duplication. In addition, many API sources output datasets without a true primary key. If this is the case, Delete-Insert would be advantageous as uniqueness is not required.


Fallbacks and Considerations

Because of the need of managing the table schema (metadata) as well as making the merge, and in order to avoid deadlocks and make sure a full recovery of the target table in cases of error, Rivery fallbacks to the switch merge merge method automatically on every change detected on the table schema for the specific run only.
For example, even when a merge method is set to merge, when a schema change detected, Rivery will preform switch merge mechanism, in order to make the schema being updated correctly. The next runs will return to use the merge method defined in the river.






Was this article helpful?