Implementing a Table Create & View Update in Athena using AWS Lambda
2022 February 01

I have a DynamoDB table that is automatically exported as JSON to compute some reports. I wanted to automate the table creation process and load steps. Amazon Athena allows querying from raw files stored on S3, which allows reporting when a full database would be too expensive to run because it’s reports are only needed a low percentage of the time or a full database is not required. It’s billed by the amount of data scanned, which makes it relatively cheap for my use case.

I initially used AWS Glue to create the underlying export table schema. I then copied and edited the schema definition it created to be a format string that takes some parameters.

def get_create_table_ddl(exportName, s3path):
    CREATE_TABLE_DDL = f"""
        CREATE EXTERNAL TABLE `sampledb`.`{exportName}`(
    `item` struct<the_url:struct<s:string>,as_of_when:struct<s:string>,user_hashes:struct<ss:array<string>>,user_id_hashes:struct<ss:array<string>>,accumulated_count:struct<n:string>,today_count:struct<n:string>,last_hit_at:struct<s:string>> COMMENT 'from deserializer')
    ROW FORMAT SERDE 
    'org.openx.data.jsonserde.JsonSerDe' 
    WITH SERDEPROPERTIES ( 
    'paths'='Item') 
    STORED AS INPUTFORMAT 
    'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    '{s3path}/data/'
    TBLPROPERTIES (
    'classification'='json', 
    'compressionType'='gzip',  
    'typeOfData'='file')
    """
    return CREATE_TABLE_DDL

One might think that it’s possible to SQL inject this, and one would be right if there was any user accessible inputs, but all inputs come from trusted upstream sources, like S3 events and information derived from the DynamoDB export metadata.

The next step is to use the start_query_execution from boto3 to send the various queries, and get_query_execution to check on the status of the execution. It’s important to note that start_query_execution is an asynchronous api, so we need to poll to determine when it is completed. Generally a DDL query is relatively quick (typically 1 to 2 seconds), so it’s okay to poll inside of a lambda function.

We are going to create 2 tables. One will be the actual export table item, and the other will be a view with some cleaned up results.

def issue_athena_query(sql, athena):
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={
            'Database': 'sampledb',
            'Catalog': 'AWSDataCatalog'
        },
        ResultConfiguration={
            'OutputLocation': '${YOUR S3 BUCKET PATH}',
        },
        WorkGroup='primary'
    )
    time.sleep(0.4)
    return response

def create_athena_table_for_export(exportName, s3path):
    athena = boto3.client('athena')
    ddl = get_create_table_ddl(exportName, s3path)
    response = issue_athena_query(ddl, athena)    
    execId = response['QueryExecutionId']
    response = athena.get_query_execution(
        QueryExecutionId=execId
    )
    print('export table', response)

    while response['QueryExecution']['Status']['State'].startswith('R'):
        response = athena.get_query_execution(
            QueryExecutionId=execId
        )
        print("checking status of export table, ", response)
        time.sleep(0.1)

    if response['QueryExecution']['Status']['State'].startswith('F'):
        raise ValueError(response)

The above code issues a single Athena query, and checks the status of the execution with a short wait between polls. It raises an exception that the lambda function fails and the errors will show up in CloudWatch where various alarms can fire off. You’ll also notice the lack of a proper lambda handler, because it’s tailored to my use case. Yours will probably look a different from mine, as I use S3 events from the DynamoDB export to trigger the view updates.

The next step is to configure the CloudFormation for creating the lambda function and giving the proper permissions. The permissions for Athena are a bit weird, as we need to touch a lot of different services. We need to allow GETs and PUTs into the results location in S3. A couple of glue functions must also be named. I like to declare these permissions inline in my CloudFormation, instead of using a full blown policy object.

TriggerViewUpdateFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: "./code.zip"
      Runtime: python3.8
      Handler: hit_count.trigger_view_update_handler
      Environment:
        Variables:
          AWS_EXECUTION_ENV: 'TRUE'
      Policies:
      - Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - 's3:GetBucketLocation'
            Resource:
              - 'arn:aws:s3:::*'
          - Effect: Allow
            Action:
              - 's3:PutObject'
              - 's3:GetObject'
              - 'glue:GetTable'
              - 'glue:GetDatabase'
              - 'glue:GetDatabases'
              - 'glue:CreateTable'
              - 'glue:UpdateTable'
              - 'athena:StartQueryExecution'
              - 'athena:GetWorkGroup'
              - 'athena:GetQueryExecution'
            Resource:
              - 'arn:aws:s3:::${S3_OBJECT_PATH}/*'
              - 'arn:aws:glue:us-east-1:${YOUR_AWS_ACCOUNT_ID}:catalog'
              - 'arn:aws:glue:us-east-1:${YOUR_AWS_ACCOUNT_ID}:database/sampledb'
              - 'arn:aws:glue:us-east-1:${YOUR_AWS_ACCOUNT_ID}:table/sampledb/${UNDERLYING_EXPORT_PREFIX}*'
              - 'arn:aws:glue:us-east-1:${YOUR_AWS_ACCOUNT_ID}:table/sampledb/${YOUR_VIEW_NAME}'
              - ${ARN_OF_PRIMARY_OR_PREFERRED_ATHENA_WORKGROUP}

All permissions were determined through trial and error to get it to work. You’ll notice that the ARNs have been masked to prevent leaking sensitive information and there’s a couple of wild cards. The masked values are things like AWS Account IDs, and where you want the Athena results to live. It’s important to note that if the underlying export tables need to be protected the IAM policy statements should be split into the create and update portions. Although if your view does not already exist you may need to include the create permission in the update portion.

The ARN of an athena workgroup that will process the queries must be specified for the athena:StartQueryExecution, otherwise the start_query_execution calls will not work. The region part of the ARN must be a wildcard. The s3:GetBucketLocation permission is required for athena to determine what region a given bucket lives in, and how it wants to get the required files. That permission needs to be in a separate IAM block because it’s allowed to operate on any s3 resource, and might need to. To lock it down, just specify it as a specific s3 bucket.

I hope you found this technical tutorial helpful. These are mostly my notes on how to make this work. One might need to go through additional steps depending on if the account has various permissions and restrictions configured.

UPDATE Feb 9, 2021 Not all required permissions were listed. The post was updated with all additional permissions and details on the required resources.


Remember you can also subscribe using RSS at the top of the page!

Share this on → Mastodon Twitter LinkedIn Reddit

A selected list of related posts that you might enjoy:

*****
Written by Henry J Schmale on 2022 February 01
Hit Counter