Debugging Data Discrepancies
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Debugging Data Discrepancies

  • Dark
    Light
  • PDF

Article summary

Discrepancies with API sources

In our experience, discrepancy issues in aggregated API sources can be solved quickest when comparing exactly apples-to-apples from source to target. This is because some metrics in these sources are ‘uniques’, meaning that sliced at different levels of detail, can lead to misleading sums to compare from system to system.

Debugging Steps

  1. Confirm apples-to-apples comparison. Check that the steps to produce data in the source system is at the same granularity as the configuration set in the river. For example, if the discrepancy is on Google Analytics data, and the screenshot is showing totals by week, but the river has a time resolution of ‘daily’, then that is a good indicator of not comparing the exact same dimensions and what is causing the perceived discrepancy.
  2. Narrow down the search. For example, select a single campaign on a single day and compare those metrics to the source. If that matches, slowly work your way up to higher levels of aggregation to pinpoint the point where the data no longer matches.
  3. Check your loading mode (set in the Target tab). The river may be unexpectedly overwriting or appending data. If the loading mode is set to upsert-merge, confirm that the keys set in the river are the desired key(s) to define the level of detail of the dataset.
  4. If all else fails, create a support ticket by clicking the 'Support' button in the Rivery console. We're happy to help!

Discrepancies with Database Sources

More often than not, this has to do with the configuration of the river itself. Some pieces of database multi-table river configurations that may cause for confusion or contribute to potential data discrepancies:

Extraction method - when set to ‘Incremental’, Rivery requires an incremental field. This can be a timestamp or a running number. If set to a timestamp, it should be a timestamp field that updates in the source whenever a change to a record is made. If a running number field is selected, it should represent that a new record is produced for every single change. We have had past issues where this was NOT the case in the source database, so the resulting incremental behavior was not as expected, since we would not pick up every change. Neither of these options (selected timestamp incremental field or a running number) should ever have NULL values in the source data.

Loading mode - this causes the most confusion when set to upsert-merge. In our multi-table rivers we pre-detect the primary keys, but some tables may not have primary keys defined in the source database and thus the user will select their own in the column mapping. No field selected as a primary key should have NULL values in the source.

Schedule frequency - sometimes discrepancies are truly just ‘we didn’t load the latest data yet' since the schedule is only set to something like ‘daily’, for example. Thus, the source database has new records, but we haven’t loaded them yet.


Was this article helpful?