Setting Up Connection for Databricks SQL
  • 23 Jan 2022
  • 3 Minutes to read
  • Dark
    Light
  • PDF

Setting Up Connection for Databricks SQL

  • Dark
    Light
  • PDF

Overview

This guide will show you how to set up the required credentials and configurations for using Databricks SQL with Rivery.

Recommendation

The recommended route for Databricks SQL connection creation in Rivery is to ulitize the Databricks Partner Connect guide. This will automatically create a valid Databricks SQL connection in Rivery.

What is Databricks SQL?

Databricks SQL is a managed service of Databricks for processing and transforming data upon datalake. Databricks SQL is based on Databricks' Delta Lake, an open source solution for building, managing and processing data using Lakehouse architecture.

Requirements

Before you use this guide, Please make sure you have a valid Databricks workspace and account.
The requirements for setting up a Databricks SQL as a target in Rivery are:

  1. A valid Databricks account or workspace, please create one here.
  2. A valid Databricks SQL workspace. If you don't have any Databricks SQL workspace in your Databricks account, please check out the next article.

If you or your organization already has a Databricks account and workspace, you can connect/create a user for Rivery in your account as described below.

SQL Endpoints

In order to run Databricks SQL as target, there's a need to run upon current or new SQL Endpoint in databricks.

Create a New SQL Endpoint

  1. Login into your databricks workspace account.

  2. Go to SQL console.
    image.png

  3. Click on Endpoints and then +New SQL Endpoint at the right top.
    image.png

  4. In the modal opened for new SQL endpoint detail, Name your endpoint (for example "RiverySQLEndpoint"), choose the right Cluster Size and set the Auto Stop to at least 60 minutes of no-activity. Click on Create.
    image.png

Configure Data Access on SQL Endpoint

  1. After the SQL Endpoint is created, on the Endpoints screen, click on Settings -> SQL Admin Console.
    image.png

  2. On the SQL Admin Console go to SQL Endpoint Settings tab.

image.png

  1. In the Data Access Configuration textbox, paste the next configurations:
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl.disable.cache true
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
  1. Click Save.

  2. Under the SQL Configuration Parameters, set the next configuration:

ANSI_MODE false
  1. Click Save.
  2. If you're using any Instance Profile choose the one you want to use.

Get the SQL Endpoint Credentials

Rivery requires credentials per SQL Endpoint configured in Databricks SQL console. Every connection in Rivery represents mostly one SQL Endpoint.

  1. In the Databricks SQL Console, go to Endpoints and click on the endpoint you've created in the steps above.
    image.png

  2. Go to Connection Details and copy aside the next parameters:

    • HTTP Path
    • Port
    • Server Hostname

image.png

Create New Personal Access Token

Connecting into Databricks SQL requires also creating a Personal Access Token with the user connected to your Databricks SQL console. In order to create a personal access token:

  1. In the Databricks SQL console, go to Settings -> User Settings.
  2. Go to Perosnal Access Tokens tab and click on +Generat New Token button.

image.png

  1. On the modal opened, Name your token (for example "Rivery") and set the expiration Lifetime to be sufficient enough for consistent and reliable running.
    In our example, we've set the expiration lifetime to 1825 days (5 years).
    image.png

(Optional) Open Rivery IPs in your Databricks workspace

If an IP restriction is required in your Databricks workspace, you must open Rivery IPs in to run successfully any operation from Rivery.
In order to open Rivery IPs:

  1. Go to Databricks' documentation for IP access lists in order to get the complete IP access list operations available in Databricks.
  1. Send the next POST request to your Databricks workspace API:
curl -X POST -n \
  https://<databricks-instance>/api/2.0/ip-access-lists
  -d '{
    "label": "Rivery",
    "list_type": "ALLOW",
    "ip_addresses": [
        "52.14.86.20/32",
        "13.58.140.165/32",
        "52.14.192.86/32",
        "34.254.56.182/32"
      ]
    }'

Create new Databricks SQL Connection in Rivery

  1. Log in to your Rivery account.

  2. Go to Connections, and click on + Create New Connection button.

  3. Search for Databricks SQL and click on it.
    image.png

  4. Name the connection and set the Description.

  5. Under the Server Hostname, paste the server hostname you've copied above from the endpoint.

  6. Set the Port (default of 443)

  7. Paste the HTTP Path from the SQL Endpoint connection details above.

  8. Scroll down and paste the Personal Access Token you've just created.

  9. Test the connection.

  10. Click on Save

image.png

Next Steps

  • After creating new Databricks SQL connection in Rivery, you can create new river source to target river that pushes data into your Databricks SQL.
  • Read about Databricks SQL Target configurations.

Was this article helpful?