MongoDB Best Practice For Pulling Large Amounts Of Data
  • 5 Minutes to read
  • Dark
    Light
  • PDF

MongoDB Best Practice For Pulling Large Amounts Of Data

  • Dark
    Light
  • PDF

Article summary

In historical data migration / large collection(s) cases, there is a need to pull a vast amount of data from MongoDB into any of the optional target DWH availeble in the Rivery console.

This guide will walk you through the best practice and efficient way to connect your MongoDB to Rivery.

Source Configuration

Extract Method:
Best practice suggestion for pulling data from your collection will be to use the “Incremental” extract mode. This mode pulls data by an incremental value that performs as an Index, that is fetching only the data that had been changed since the last run. In the Incremental extract method there are additional options:

  • Incremental Attributes:
    Set the name of the incremental attribute to run by. e.g. UpdateTime.
  • Incremental Type:
    Timestamp | Epoch. Choose the type of incremental attribute.
  • Date Range:
    • In Timestamp, choose your date increment range to run by.
    • In Epoch, choose the value increment range to run by.
    • If no End Date / Value is entered, the default is the last date/value in the table. The increment will be managed automatically by Rivery.
  • Include End Value:
    Should the increment process take the end value of the run or not.
  • Interval Chunks:
    On what chunks the data will be pulled by. Split the data by minutes, hours, days, months, or years, It is recommended to split your data into chunks in the case of a large amount of returned data to avoid a call timeout from the source side.
  • Max Rows In Batch:
    In the process of parsing the data in the file zone from BSON to a JSON file, as best practice, it is essential to add a batch max rows limit (the number of rows that will be read in a single batch before moving to the next one). In order to determine the size of the rows, you can follow the next principle:

mongo_best_1.png

After every 50% reduction, test the river and check if it is running successfully, follow this process up until you set a size that enable the river to run as expected.

  • Filter (Optional):
    Filter the data that will be pulled. The filter format will be a MongoDB Extended JSON. e.g. {"account":{"oid":"1234567890abcde"},"datasource": "google", "is_deleted": {"ne": true}}.

  • Limit (Optional):
    Limit the rows to pull.

  • Auto Mapping: Rivery will determine your table / query metadata. You can choose the set of columns you want to fetch from the source, add a new column that hadn’t got in Rivery determination or leave it as it is. If you don’t make any of Auto Mapping, Rivery will get all of the columns in the table.

Schema Configuration

  • Converting Entire Key Data As a STRING (Optional):
    In some cases, the data is not as expected by a target, like key names starting with numbers, or flexible and inconsistent object data. A possible option to overcome this will be converting attributes to a STRING format by setting their data types in the mapping section as a STRING.
    An example for this can be the following key:value combination -
    {
    "name": "myname".
    "id": 1,
    "sites": {"www.google.com": {"count": 3},
    "www.yahoo.com": {"count": 4},
    "www.amazon.com": {"count": 4}
    }
    The mapping, in this case, would have the following structure:

mongo_best_2.png

As seen, the site’s key, in this case, has inconsistent changing values, and the values themselves (sites) are being mapped as columns. In order to deal with this situation, the mapping solution will be to define the sites data type to a STRING and parse this column by using an open JSON function on the target’s side:

mongo_best_3.png

Note:

Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.

Enhancing Data Preservation and Mapping

Only when employing the Legacy River mode and selecting Snowflake as the Target.

Preserve the raw data of the collection exclusively for Snowflake as the target platform. When dealing with JSON objects or arrays that feature schemaless keys, potentially changing per row or including special characters or blanks, Snowflake users can leverage its built-in capability to query complex JSON structures using the Semi-Structured Query Mechanism. It's advised in such scenarios to reclassify the entire column as a VARIANT column and then extract the necessary columns within Snowflake's console or through a Logic river using any of Snowflake's "Open JSON" functions.

To achieve this, activate the "Preserve Raw Data" toggle (located in the Target tab under file settings). This will automatically introduce an Array into the column mapping, which defaults to loading as a variant column. This approach helps maintain the row's granularity.

image.png

Please Note:
The "Preserve Raw Data" toggle is specifically available in the Legacy River mode; it's not supported in Change Streams.

The mapping outcome appears as shown below:
mongo_best_4.png

Additionally, you have the option to utilize the "expression" column mapping feature to extract a specific column from the raw JSON data. This extracted column can function as a key for the Upsert merge loading mode:

mongo_best_practice_key.png

Indexing Incremented Fields for Performance in MongoDB

When working with MongoDB and incrementing fields, it's crucial to ensure that the incremented field is indexed properly. Failing to do so can lead to significant performance issues, especially when pulling large amounts of data. MongoDB does not automatically index "datetime" columns, so manual intervention is necessary for optimal performance.

Why Index Incremented Fields?

Indexes in MongoDB significantly speed up query performance by allowing the database to locate specific documents more efficiently. When you increment a field without indexing it, MongoDB has to scan through the entire collection to find the relevant documents. This process can be time-consuming and resource-intensive, especially as the collection size grows.

Best Practices for Indexing Incremented Fields

  • Identify Incremented Fields: Determine which fields in your MongoDB documents are frequently incremented or used in queries involving increments.

  • Create Indexes: For each identified field, create an index using the createIndex method in MongoDB. Specify the field to be indexed and the desired index type (e.g., ascending, descending).

  • Regular Maintenance: Periodically check and optimize indexes to ensure they remain effective as data grows and usage patterns change. Use tools like MongoDB's explain method to analyze query performance and index usage.


Was this article helpful?