- 3 Minutes to read
-
Print
-
DarkLight
-
PDF
Setting Up Connection for Databricks
- 3 Minutes to read
-
Print
-
DarkLight
-
PDF
Overview
This guide will show you how to set up the required credentials and configurations for using Databricks with Rivery.
The recommended route for Databricks connection creation in Rivery is to ulitize the Databricks Partner Connect guide. This will automatically create a valid Databricks connection in Rivery.
What is Databricks?
Databricks is a managed service of Databricks for processing and transforming data upon datalake. Databricks is based on Databricks' Delta Lake, an open source solution for building, managing and processing data using Lakehouse architecture.
Requirements
Before you use this guide, Please make sure you have a valid Databricks workspace and account.
The requirements for setting up a Databricks as a target in Rivery are:
- A valid Databricks account or workspace, please create one here.
- A valid Databricks workspace. If you don't have any Databricks workspace in your Databricks account, please check out the next article.
If you or your organization already has a Databricks account and workspace, you can connect/create a user for Rivery in your account as described below.
SQL Endpoints
In order to run Databricks as target, there's a need to run upon current or new SQL Endpoint in databricks.
Create a New SQL Endpoint
-
Login into your Databricks workspace account.
-
Go to SQL console.
-
Click on Endpoints and then +New SQL Endpoint at the right top.
-
In the modal opened for new SQL endpoint detail, Name your endpoint (for example "RiverySQLEndpoint"), choose the right Cluster Size and set the Auto Stop to at least 60 minutes of no-activity. Click on Create.
Configure Data Access on SQL Endpoint
-
After the SQL Endpoint is created, on the Endpoints screen, click on Settings -> SQL Admin Console.
-
On the SQL Admin Console go to SQL Endpoint Settings tab.
- In the Data Access Configuration textbox, paste the next configurations:
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl.disable.cache true
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
-
Click Save.
-
Under the SQL Configuration Parameters, set the next configuration:
ANSI_MODE false
- Click Save.
- If you're using any Instance Profile choose the one you want to use.
Get the SQL Endpoint Credentials
Rivery requires credentials per SQL Endpoint configured in Databricks console. Every connection in Rivery represents mostly one SQL Endpoint.
-
In the Databricks Console, go to Endpoints and click on the endpoint you've created in the steps above.
-
Go to Connection Details and copy aside the next parameters:
- HTTP Path
- Port
- Server Hostname
Create New Personal Access Token
Connecting into Databricks requires also creating a Personal Access Token with the user connected to your Databricks console. In order to create a personal access token:
- In the Databricks console, go to Settings -> User Settings.
- Go to Perosnal Access Tokens tab and click on +Generat New Token button.
- On the modal opened, Name your token (for example "Rivery") and set the expiration Lifetime to be sufficient enough for consistent and reliable running.
In our example, we've set the expiration lifetime to 1825 days (5 years).
(Optional) Open Rivery IPs in your Databricks workspace
If an IP restriction is required in your Databricks workspace, you must open Rivery IPs in to run successfully any operation from Rivery.
In order to open Rivery IPs:
- Go to Databricks' documentation for IP access lists in order to get the complete IP access list operations available in Databricks.
- Send the next
POST
request to your Databricks workspace API:
curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "Rivery",
"list_type": "ALLOW",
"ip_addresses": [
"52.14.86.20/32",
"13.58.140.165/32",
"52.14.192.86/32",
"34.254.56.182/32"
]
}'
Create new Databricks Connection in Rivery
-
Log in to your Rivery account.
-
Go to Connections, and click on + Create New Connection button.
-
Search for Databricks and click on it.
-
Name the connection and set the Description.
-
Under the Server Hostname, paste the server hostname you've copied above from the endpoint.
-
Set the Port (default of 443)
-
Paste the HTTP Path from the SQL Endpoint connection details above.
-
Scroll down and paste the Personal Access Token you've just created.
-
Test the connection.
-
Click on Save
Next Steps
- After creating new Databricks connection in Rivery, you can create new river source to target river that pushes data into your Databricks.
- Read about Databricks Target configurations.