Getting started on AWS Data Wrangler and Athena

pip install awswrangler
import awswrangler as wr
name = wr.sts.get_current_identity_name()
arn = wr.sts.get_current_identity_arn()
print(name)
print(arn)

#python lib path for Glue python shell job

#If you are using GLUE python shell then make sure you add this python lib path every time you create a new job. The latest release can be found here

s3://aws-glue-xxxxxxxxxxxx-us-east-1/glue_wheels/awswrangler-1.8.1-py3-none-any.whl

# AWS data wrangler read data from Athena

import awswrangler as wr
import pandas as pd
import numpy as np

#databases = wr.catalog.databases()
#print(databases)
df = wr.athena.read_sql_query("""
SELECT *
FROM table_name limit 100
""", database="test_database")
df = df.melt(id_vars=['col1','col2','col3'])

# AWS data wrangler write data to Athena as table

Using data wrangler you can read data in any type(CSV, parquet, Athena query, etc etc) anywhere (local or glue) as a pandas dataframe and write it back to s3 as an Object and create table on Athena simultaneously. For instance, the following code snippet can perform:

res = wr.s3.to_parquet(
df=df,
path=f"s3://xxxxxxx/aws-wrangler-test", #s3 path where you want to dump
dataset=True,
database="test_database", #<your database name>
table="table_test_wrangler", #<your new table name>
mode="overwrite",
description="testing data"
)

BONUS: Athena cache

When calling read_sql_query, instead of just running the query, we now can verify if the query has been run before. This is disabled by default and can be enabled by passing max_cache_seconds Greater than 0.

df = wr.athena.read_sql_query(query, database="test_database", max_cache_seconds=900)

Transforming complex data into powerful communications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store