Amazon Athena as a Target
  • 8 Minutes to read
  • Dark
    Light
  • PDF

Amazon Athena as a Target

  • Dark
    Light
  • PDF

This is a step-by-step tutorial for setting Amazon Athena as a Target.

Amazon Athena as a Target is currently in Beta.

Overview

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Prerequisite

Create a Bucket

A bucket is an object container. To store data in Amazon Athena, you must first create a bucket and specify a bucket name as well as an AWS Region. Then you upload your data as objects to that bucket in Amazon Athena. Each object has a key (or key name) that serves as the object's unique identifier within the bucket.
Let's begin by logging into AWS and searching for Buckets:

Note:
This is a tour of the console. Please hover over the rippling dots and read the notes attached to follow through.

Add a Policy

A bucket policy is a resource-based policy that allows you to grant access permissions to your bucket and the objects contained within it.
Now that you've created a bucket, let's create a policy to grant the necessary permissions:


Here's the policy's code:

{
   {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RiveryManageFZBucket",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketCORS",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::*"
        },
        {
           "Sid":"RiveryManageFZObjects",
            "Effect":"Allow",
            "Action":[
                 "s3:ReplicateObject",
                  "s3:PutObject",
                "s3:GetObjectAcl",
                  "s3:GetObject",
                  "s3:PutObjectVersionAcl",
                  "s3:PutObjectAcl",
                  "s3:ListMultipartUploadParts"],
            "Resource":["arn:aws:s3:::<athena_query_results_bucket_name>/*","arn:aws:s3:::<athena_query_results_bucket_name>","arn:aws:s3:::<athena_data_bucket_name>/*","arn:aws:s3:::<athena_data_bucket_name>"]
        },
        {
            "Sid": "AthenaAccess",
            "Effect": "Allow",
            "Action": [
                "athena:*",
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "RiveryHeadBucketsAndGetLists",
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        }
    ]
}

Rivery User in AWS

Now, in order to connect to the Amazon Athena (described in the following section) in Rivery console, you must first create an AWS Rivery user:


Create a Workgroup in Amazon Athena

Follow the Amazon Athena User Guide to create a Workgroup.

Please Note:
Make sure the Override client-side settings box is not checked.
image.png



Connection Procedure

Enter the Rivery Console and select Connection from the main menu on the left, and Look up 'Athena'.

AWS Keys

  1. Type in the Connection Name.
  2. From the drop-down menu, choose your Region.
  3. Enter Workgroup.
  4. Select AWS Keys credentials type.
  5. Enter Your AWS Access key id and Secret access key.
  6. Use the Test Connection function to see if your connection is up to the task.
    If the connection succeeded, you can now use this connection in Rivery.

image.png

IAM Role - Automatic

  1. Type in the Connection Name.
  2. From the drop-down menu, choose your Region.
  3. Enter Workgroup.
  4. Select IAM Role - Automatic credentials type.
  5. To initiate the AWS CloudFormation Stack, click the Launch Stack button.
  6. Replace the External ID in the Parameters section with the one you were given in the Rivery console.

image.png

  1. Check 'I acknowledge that AWS CloudFormation may create IAM resources' in the Review tab, then click Create.
  2. Copy the value of 'RiveryAssumeRoleArn' from the Output tab in the stack.
  3. Paste the Role ARN Key.
  4. Use the Test Connection function to see if your connection is up to the task.
    If the connection succeeded, you can now use this connection in Rivery.

IAM Role - Manual

  1. Type in the Connection Name.

  2. From the drop-down menu, choose your Region.

  3. Enter Workgroup.

  4. Select IAM Role - Automatic credentials type.

  5. Initiate the AWS IAM console.

  6. Click Policies on the side menu, and select Create Policy.
    a. Navigate to the JSON tab.
    b. Copy the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RiveryManageFZBucket",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketCORS",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::*"
        },
        {
           "Sid":"RiveryManageFZObjects",
            "Effect":"Allow",
            "Action":[
                 "s3:ReplicateObject",
                  "s3:PutObject",
                "s3:GetObjectAcl",
                  "s3:GetObject",
                  "s3:PutObjectVersionAcl",
                  "s3:PutObjectAcl",
                  "s3:ListMultipartUploadParts"],
            "Resource":["arn:aws:s3:::<athena_query_results_bucket_name>/*","arn:aws:s3:::<athena_query_results_bucket_name>","arn:aws:s3:::<athena_data_bucket_name>/*","arn:aws:s3:::<athena_data_bucket_name>"]
        },
        {
            "Sid": "AthenaAccess",
            "Effect": "Allow",
            "Action": [
                "athena:*",
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "RiveryHeadBucketsAndGetLists",
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        }
    ]
}

c. Paste the Policy it into the description box, then click Review Policy.

  1. Name the Policy - 'Rivery-Athena-Policy' and click Create Policy.
  2. Click Roles on the side menu, and select Create Role.
  3. Select Another AWS Account and change the Account ID to the one you were given in the Rivery console.
  4. Check Require External ID, and set External ID to the one you were given in the Rivery console.

image.png

  1. Click Next.
  2. Attach the 'Rivery-Athena-Policy' to the Attach Policy form.
  3. Set 'Rivery-Athena-Role' as the role name.
  4. Copy the Role ARN From the Role's window and paste it into the field below.
  5. Use the Test Connection function to see if your connection is up to the task.
    If the connection succeeded, you can now use this connection in Rivery.


Configuration Processes

Source to Target River

  1. Now that you've established a Connection, enter your Connection Name.
  2. Select your Target Connection from the drop-down list.
  3. Input the File Name and the FileZone path from your S3 storage. Here's an example:

image.png

  1. Enter the Bucket Name.
  2. Click the curved arrow next to Schema on the right side of the row. After the refresh is complete, click the row and choose the Schema where the data will be stored.
  3. Enter the Table Name.
  4. Set the Loading Mode.

Loading Modes:
Upsert Merge - Based on the keys columns you've selected, the system will match rows and decide whether to replace matching rows, keep unmatched rows, or add new ones.
This mode is recommended for continuous merge runs because it does not retain duplication and creates an index for better performance.
Upsert Merge includes two merge techniques:

Append Only - This will append any new records while keeping all of the current records using INSERT INTO the Target Table (matching on columns).

Overwrite - This will TRUNCATE the old records and INSERT the new records in the target table.

image.png

  1. Set the period partition timeframe.

image.png

  1. Choose the type of file you're using.

image.png

  1. Set the period partition timeframe for a FileZone folder.
    Note: You can instruct Rivery to divide the data according to the data insertion day, the Day/Hour, or the Day/Hour/Minute. This means that Rivery will produce data files from your sources under folders that correspond to the correct partition you've selected.

Logic River

  1. Now that you've established a Connection, enter your Connection Name.
  2. In the Source section, type your SQL query or SQL script.

Please Note:

  • The SQL script is not encrypted in any way. Please avoid using credentials in the script.
  • The SQL script runs according to user configurations. Users are responsible for any changes to the table, schema, or data that occur due to the script.

image.png

  1. in the Target section, select Table or Variable.
  • If you selected 'Variable', select the variable's type and then, from the drop-down list, choose the variable you want to use as a Target.
  • if 'Table' was your choice, move on to step 5.
  1. Click the Bucket Name section and pick the bucket you want to load into from the drop-down list.
  2. Enter the FileZone path from your S3 storage. Here's an example:

image.png

  1. Click the Schema section and select the schema you want from the drop-down list.
  2. Type in your Table Name.
  3. Set the Loading Mode.

Loading Modes:
Append Only - This will append any new records while keeping all of the current records using INSERT INTO the Target Table (matching on columns).

Upsert Merge - Based on the keys columns you've selected, the system will match rows and decide whether to replace matching rows, keep unmatched rows, or add new ones.
This mode is recommended for continuous merge runs because it does not retain duplication and creates an index for better performance.
Upsert Merge includes two merge techniques:

Overwrite - This will TRUNCATE the old records and INSERT the new records in the target table.

image.png

  1. Set the period Partition Type and Granularity for a bucket.

Please Note:
A new column with the selected granularity value will be added to the target table in Athena. Here's an example:
Assume you want to partition data in our mapping with a granularity of one day based on the "order_date" column. This column is of type DATE. After a successful River run on the selected Target table, you should see in the Athena console, a new column with the name "order_date_day" and the value of the day from "order_date." The granularity of this partitioned column will be just the day.
image.png

  1. Now simply click the Column Mapping button.
  2. Configure the Columns to your preferences as they will appear in your Target.

image.png

Note:

  • To add a new field to your Target, select +Add Field.
  • If a Bucket is selected next to a particular field, you would be able to specify the number of buckets to target that field by clicking Save and entering a Number of Buckets in a newly created section in the Target form.
    The range of defined buckets is 2 to 100.
  • Only the Date and Timestamp fields can be partitioned, and the partition checkbox must be selected in order to do so.
  1. Click Save.
  2. You can now Select Run to run your River.

Type Mapping

We match Amazon Athena data types to Rivery data types while extracting your data. If a data type is not supported, we will map it as a String type.

The following table shows the mapping of Amazon Athena data types to Rivery-compatible types:

Amazon Athena Type Rivery Type
STRING string
TIMESTAMP timestamp
DATETIME date
BOOLEAN boolean
BIT int
FLOAT float
INTEGER int
INT int
BIGINT bigint
SMALLINT smallint
DATE date
TIME timestamp
RECORD string
ARRAY string
VARCHAR string
TEXT string
VARIANT string
NVARCHAR(MAX) string
NVARCHAR string
TINYINT tinyint
DATETIME2 timestamp
DATETIMEOFFSET timestamp
REAL float
SMALLDATETIME timestamp
UNIQUEIDENTIFIER string
BINARY binary
NUMERIC float
CHAR char
DECIMAL decimal
NCHAR string
VARBINARY string

Was this article helpful?