- 28 Jun 2022
- 9 Minutes to read
Source to Target River
- Updated on 28 Jun 2022
- 9 Minutes to read
A Source to Target River allows data from a source connector to be ingested into a cloud target. We'll go over the methods for creating, scheduling, and monitoring a Source to Target River in this tutorial.
To create a Source to Target River, click Create and then select Create Source to Target River.
The Source step is where you select the data source you would like to ingest data from and fill out respective parameters to further customize this ingestion.
You are first given a list of supported Rivery connectors to choose from as your source.
Regardless of which data source you choose, you'll first have to choose or define your connection. If you've already made a connection to the source in the 'Connections' tab, you should see this connection available in the dropdown. If you haven't yet made a connection, you can create one on the fly by selecting 'New Connection.'
Depending on which data source you choose, the Source tab will provide different options for customizing and shaping the data to be pulled into your target.
Using Predefined Reports option, you can select multiple predefined reports with predefined set of fields simultaneously. Those predefined reports provide a simple and easy configuration of your source data structure with a few clicks of a button.
For example, when connecing to Hubspot, you'll be able to choose the your schema from a list of predefined reports which were built with best practices.
Per each data source, you can view the documentation of the predefined reports describing all reports, fields and schema definitions. For example, HubSpot Predefined Reports.
While if you are looking for custom reports and advanced configuration you can use the Custom Report option to create a custom single report by selecting report type and fields.
For example, when connecting to an API connector such as Facebook Ads, you'll be presented with parameters and options that reflect what the API offers. For example, in Facebook Ads, you are first given the option to select which Facebook Ad report you would like to pull into your target.
For connecting to a Database Source , the Source tab will look a bit different. Instead of pulling predefined reports and selecting accounts, campaigns, date ranges, etc. we are prompted to select how we would like to pull data from our source database.
Selecting Time Periods
Depending on what type of source connector you are using, you will have different options for the range of data you wish to pull.
In most API connections, there will we an option to select a time period to pull data from:
In the above dropdown you'll find pre-canned options for time period to pull, as well as a custom date range option.
A note about the custom date range option: Upon the first run with a custom date range, the river will pull data from the Start Date indicated to the End Date indicated (if End Date is left empty, it will pull data from the Start Date until the current time). The next time the river runs, the Start Date will be updated to reflect the date/time of the latest run, and the End Date will be turned empty. Thus, the river load will pick up where it left off from the previous load.
'Last Days Back' option
In some source connections, there is a 'Last Days Back' option. This means that in addition to the time period specified, Rivery will add X number of retroactive days to the data pull. For example, if the river time period is set to 'Yesterday' and 'Last Days Back' = 1, then the river will pull both yesterday's and the prior day's data.
In our RDBMS sources (MySQL, SQL Server, Oracle, PostgreSQL), as well as some others (Salesforce, Netsuite, to name a couple) there is an option to set the Extract Method to load the data incrementally.
What this means is that instead of extracting all of the data in a given table or report, you can identify a field that will serve as your increment. Most commonly this is a timestamp or ID field of some sort. For example, if I set my river to load incrementally based on a field in my table called DATE_LAST_MODIFIED, then each time the river runs, Rivery will check to see what the largest DATE_LAST_MODIFIED value was from the previous run, and then only take records with a greater DATE_LAST_MODIFIED to be sent to the target data warehouse.
It is important to note that the "Start Date" for the incremental field is inclusive, while a set "End Date" defaults to exclusive, meaning that for the following scenario:
Start Date = 2020/01/10 00:00:00,
End Date = 2020/01/11 03:00:00,
The query in the backend would have the following range:
start_date >= 2020/01/10 00:00:00
end_date < 2020/01/11 03:00:00
There is an option to specify that the end value should be included in these increments:
which would then change the above query range to look like:
start_date >= 2020/01/10 00:00:00
end_date <= 2020/01/11 03:00:00
The incremental loading method can save time and RPUs as since it is only processed the delta between runs, the data volume will be presumably lower than extracting all data every run.
Once you have configured the Source tab to your requirements, move on to the Target tab.
In the Target tab, you will first be prompted to select your cloud target. This is where the data coming from the source will land.
Currently, Rivery supports the following cloud data warehouses and cloud file storage as targets:
Regardless of what target you choose, you will need to set the connection. Again, you can either choose an existing connection or create one on the fly.
If your target is a cloud data warehouse, the Target tab will look something like this:
You'll have to define a database, schema, and target table. This is where the data pulled from your selected source will land. Rivery will automatically detect available databases and schemas for you to choose.
Next, choose the loading mode you require:
- Overwrite: Replace the target table completely with the data from this run. If the target table does not yet exist, it will be created automatically.
- Upsert-Merge: Only add new or changed records to the target table based on a defined key in the Column Mapping.
- Append Only: Union the data from this load onto the existing target table.
If you choose to use Upsert-Merge functionality, you can define your merge key by clicking in the Column Mapping (the next tab).
Upsert-Merge also has the ability to Filter on Logical Keys. This allows for deduplication within the Source records (Existing Target records will not be deduplicated). This can be done by adding an expression (in Target Database syntax) to the 'Filter Order Expression' field. This acts an ORDER BY function in SQL with the River only pulling the first record.
If your target is cloud storage, the parameters will look like this:
You'll need to identify the desired bucket and file path for the data to land. The default is the bucket used to set up the connection.
The auto-detection of a table schema can be found under the Schema tab, allowing for extra customization and updates to the metadata of the target table before the data is loaded.
To auto-detect the schema outlined from the Source tab you configured, select 'Auto-Mapping.' The corresponding fields and their data types will populate.
If Upsert-Merge was selected as your loading mode in the Target tab, you can define the primary key to use to define a match for when data from the source should be merged. You can do this by clicking the key symbol next to the desired key field so that it is colored yellow. If you click more than one field to be the key, then this will act as a combination key, meaning that the combination of the two field values will act as the primary key in the upsert-merge.
In addition to setting a primary key, you can edit the target field name, data type, and mode of any field. This enables renaming fields and casting data types during the load process itself.
The 'Cluster Key' option can be used to define a partition field.
The 'Expression' option allows for SQL syntax to be used to either define a new field or alter the definition of an existing field. The SQL syntax to be used is that of the target data warehouse you have selected.
Schedule & Notifications
Once you've configured your river's Column Mapping step, you can either execute it on demand (by clicking Run at the bottom of the screen) or schedule it.
Click 'Schedule Me' under the Settings tab to access the scheduling options:
To set more custom scheduling frequencies, you can use a Cron expression.
You can enter your Cron expression under the Custom tab:
If you would like to be notified upon a failure, warning, or when a run-time threshold is reached, simply set this to true and enter your email address.
Once the river is running, click into the Activity tab to monitor the progress. For rivers pulling multiple entities at once, you can group the run statuses by individual run or by target table.
If an error is encountered, you can drill down into the error message of the river.
By drilling into an individual run, you can see the time it took to run, as well as the RPU amount the execution used.
Download Activity Logs
View our Activity Logs document.
Sub Rivers allow for building sub-processes into a single river. These sub-processes can inherit settings from the main river or have separate definitions.
To enable Sub Rivers, navigate to the Settings & Schedule tab and click the 'Enable Sub Rivers' toggle.
- Google Ads
- Google Ad Manager
- Google Analytics
- YouTube Reporting
- Bing Ads
- Facebook Social
- Facebook Ads
- Twitter Ads
- Linkedin Social
- LinkedIn Ads
- Snapchat Marketing
- Instagram Social
- Active Campaigns
- REST (via Actions)
- Google BigQuery
- Azure Synapse
- Amazon Redshift
- Databricks SQL
- Azure Blob Storage
Once Sub Rivers are enabled, a new tab called 'Sub Rivers' will appear in the river.
To create the first Sub River, click 'New Sub River'. This will add a Sub River called 'Sub River 1' that can be renamed as desired.
For example, selecting 'Source Connection' from the parameter list will add this parameter to the Sub River list and allow for each Sub River to utilize a different Rivery source connection in the data pull.
Each parameter gives the option to either inherit from the main river or set a distince value.
In addition, Sub Rivers can each have their own schedule (or inherit from the main river). Click the clock icon to the right of the parameter dropdown to set schedules at the Sub River-level.