- 2 Minutes to read
- Print
- DarkLight
- PDF
Configure Custom FileZone for Databricks
- 2 Minutes to read
- Print
- DarkLight
- PDF
Requirements
- S3 (IAM role connection type only)
Setting up a Custom File Zone in Rivery is an optional feature, with the default option relying on the Managed File Zone provided by the platform, which requires no setup.
The main advantage of setting up a Custom File Zone is the ability for organizations to ensure that data is stored within their own file zones, as opposed to being stored in Rivery's Managed File Zone. This also enables organizations to use the Custom File Zone as a data lake, where raw data can be stored before it is loaded into a Target cloud data warehouse. Furthermore, organizations have the ability to define their own retention policies for data stored in the Custom File Zone. Rivery's Managed File Zone (default) retains data for a period of 48 hours.
Before you use this guide, please make sure you’ve signed up for AWS and you have a console admin user.
If you don’t have one of these prerequisites, you can start here.
Rivery needs an S3 bucket
to be a FileZone before your data is loading up to Databricks. You can either use the FileZone bucket or objects as a base to other Hadoop or spark operation by Amazon EMR, or by your other services.
Note: You can find the up to date documentation of S3 operations and getting started here .
Create an S3 Bucket
Go to S3 Management in AWS Console .
Click on Create Bucket
Give the bucket a name, and choose the same region your redshift will be on (in most of the cases - US-East (N.Virginia)).
Use the S3 wizard defaults by reviewing and following the wizard screen, and click on Create Bucket .
Configure custom FileZone for Databricks in Rivery
Rivery uses Amazon S3 bucket to upload your source data into it and then push that data to Databricks. Databricks uses COPY INTO with Assuming Role mechanism on AWS. Therefore, there is a need to create a role in AWS, that will have permissions the relevant bucket and provide Rivery AWS account a permission to to get into the bucket. Creating an AWS role is in this case is mandatory in order to connect Databricks with an custom filezone.
Open your Databricks Connection, by going to Connections->Create New Connection, and choose Databricks.
In the connection, check the Custom File Zone
Create custom FileZone connection or you can create a new connection.
Choose the region your bucket is configured to
Under the Credential Type choose one of IAM Role - Automatic or IAM Role - Manual.
Follow the instructions of creating the IAM role for Rivery.
Name your S3 File Zone Connection and Save.
Now you can test your connection .
After saving, Choose your default bucket for your FileZone area.
Use the bucket you've created in above:
Save the Databricks connection.