Databricks SQL as a Target
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Databricks SQL as a Target

  • Dark
    Light
  • PDF

Overview

Databricks SQL is a managed service of Databricks for processing and transforming data upon datalake. Databricks SQL is based on Databricks' Delta Lake, an open source solution for building, managing and processing data using Lakehouse architecture.

This article will discuss on how to configure and what are the options on Databricks SQL as a target in a Data Source To Target River.

Requirements

  1. A valid Databricks account or workspace, please create one here.
  2. A valid Databricks SQL workspace. If you don't have any Databricks SQL workspace in your Databricks account, please check out the next article.
  3. A valid Databricks SQL connection. In order to configure one, please check out Setting Up a Connection for Databricks.

Databricks SQL as a target configurations

Databases and tables

Databricks SQL is a DWH based on databases.tables hierarchy. However, using Rivery, there's no need to create any table in advanced and then populate it. Rivery manages, creates (if not exists), populate and make the needed changes for your tables automatically.

Create new database

It is recommended to create new database on Databricks SQL that will dedicated to Rivery. In order to make it:

  1. On Databricks SQL console, go to Queries -> +New Query
    image.png
  2. On the query screen, paste and run the next SQL command:
CREATE DATABASE IF NOT EXISTS `rivery`
COMMENT 'Rivery Tables Database';

Managed Tables

Databricks SQL provides and recommend the ability to use managed tables by Databricks SQL engine. Rivery lays mostly on this ability, and manages, push and define table's data and metadata upon managed tables, under a managed database. Rivery creates and uses managed tables as a default.

External Tables

Databricks SQL also provides the ability to create, manage and populate external tables (as the "old" fashioned tables in SparkSQL). You may use an external location prefix for creating a Delta table under external location. In order to do that, you can check the Use External Location box and set the externalLocation as you like.
The location will be set as under the /{externalLocation}/{schema}/{table} path automatically.
image.png
For further reading about external delta tables, click here

Loading Mode

Learn more about Rivery loading to DWH process here.

Rivery provides 3 types of loading mode in Databricks SQL as a target.

Overwrite mode will create a whole new table in Databricks SQL database,

Append - This will use INSERT INTO the Target Table (matching on columns) and append any new records, keeping all existing records. If any change of metadata preformed during the loading, Rivery will CLONE the target table, preform any metadata changes above the new cloned table, and then clone the table back. This method provides maximum durability and avoid locks on currently used tables.

Upsert Merge - Preforms a merge of the new data with the target table's data by defined keys. Rivery uses in Databricks the Switch-Merge mode by default.

How does Rivery handle data loads and table metadata?

Rivery manages the tables metadata updates alongside with handling the data loading.
Here is the complete scheme of the loading process in Rivery database:
Databricks Mapping Columns

Creating the target database if not exists

Rivery checks and makes sure the database on the process exists in Databricks SQL workspace. Therefore, if the database doesn't exist, Rivery creates the database for you.

Your personal access token should include a sufficient permissions to create any new database in the system. If it isn't, it may fail the source to target process. Please ask your databricks account manager or admin to provid you a token that have a sufficient CREATE DATABASE permissions in Databricks SQL.

Rivery creates the database using the next clause:

CREATE DATABASE IF NOT EXISTS `<database>`
COMMENT 'Created By Rivery On 'YYYY-MM-DDTHH:mm:SS';

Default Database Name
If no database provided on the process, the default database will be default.

List Databases Created By Rivery
In order to make sure the database is created by Rivery, run the next query:

DESCRIBE DATABASE EXTENDED <database name>;

in the results, you should have the next description:

|----------------------------|---------------------------------|
|database_description_item   | database_description_value      |
|----------------------------|---------------------------------|
|Database Name               |            <database_name>      |
|Description                 | Created by Rivery at YYYY-MM-DD |
|----------------------------|---------------------------------|

Staging Table Loading

  1. Creating a table with tmp_ prefix with the correct metadata.
  2. Loading the data to the tmp_ table using COPY command. If the source data format is csv - creates the tmp_ using the fields in the mapping (typed-converted). If json - create a one-columed table with string data type.
  3. Flattening the data (if json type in the source) using Json Extract Functions and casting with cast([column] as [dataType]) as [alias] syntax.
  4. Auto-scaling Integer columns to the right Integer data type based on their data length (SMALLINT->INTEGER->BIGINT->NUMERIC).
  5. Dropping the old tmp_ table with the json_data field (if exists).

Load from STG Table to Target Table

In this stage - all of the following steps are performed in one transaction (Based on Postgres ability to run DDL and DML inside one transaction):

  1. Defining and updating any metadata changes on the target table, excluding dropping columns.
  2. If the target table does not exist - create it using CREATE AS SELECT clause.
  3. If append loading mode - clone the target table aside, and Add/Drop/Alter columns in the clone table.
  4. Load the data to the target table using the loading mode (as described above).
  5. Use DEEP CLONE to clone the tmp_ table into target table.
  6. Drop the tmp_ tables in the process.

Was this article helpful?