AWS Athena Top 10 Performance Tuning Tips — Athena Engine 2 (2021)

Dheeraj Inampudi
9 min readJun 18, 2021

Amazon Athena is an interactive query service that makes it easy to analyze data directly in s3 using standard SQL. Athena engine version 2, is based on Presto 0.217. More details about all the improvements are given in here.

I don’t want to get into details as the intent is NOT to introduce Athena but talk about Athena Engine 2 release along with tips and tricks to adapt the changes by following the AWS best practices.

We at propellor.ai, use Athena for various use cases such as querying cloudwatch logs , cloudtrail logs, VPC flowlogs, etc partitioned and stored on S3 buckets. This article follows the footsteps of 2017 article (One of the best article on Athena) by Mert Hocanin and try to improvise on top of it. I am writing this article with the same Title & Style but with an intention to bring Athena Version 2 flavour to it.

So, let’s get started.

Top 10 Tips are as follows:

  1. Partition Your data
  2. Bucket your data
  3. Use Compression
  4. Optimize file sizes
  5. Optimize columnar data store generation
  6. Optimize ORDER BY
  7. Optimize joins
  8. Optimize GROUP BY
  9. Use approximate functions
  10. Only include the columns that you need

1. Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns.

Improvement-1: Schema evolution supportSchema evolution support has been added for data in Parquet format. Schema evolution allows you to keep track of changes and let Athena Querying Engine adapt to the new schema changes instantaneously. Check out the following version 0 and Version 1 of parquet data glue tables screenshots from AWS Wrangler official documentation.

Schema Version 0 on Glue Catalog
Schema Version 1 on Glue Catalog

Improvement-2: Tables — Tables that have unsupported partition types can no longer be created — This is an issue because when a junior developers create a derived tables with unsupported partitions, Engine 1 used to allow that to happen which is an issue when querying those tables. We developed a mechanism to mitigate that but it’s a nice touch to see Athena adding this feature directly to the engine.

Improvement 3: Window functions — Improved performance of queries containing window functions with identical PARTITION BY and ORDER BY clauses.

Improvement 4: Window functions — Improved performance of certain window functions (for example, LAG) that have similar specifications

2. Bucket your data

Improvement 1: Writing to S3: Improved performance for writing to bucketed tables when the data being written is already partitioned appropriately (for example, when the output is from a bucketed join).

Although AWS claims it is a performance improvement. I believe it is a bug fix as stated below in the prestoDB official release statement. We often use partition_by rather than bucketing and bucket joining. So, for now, there is no way to test it. Will perform our own benchmarking test and update this article later.

Hive Bucketed Table Fixes Presto 0.254.1
For queries over Hive bucketed tables, Presto will attempt to limit scans to the buckets that could possible contain rows that match the WHERE clause. Unfortunately, the algorithm we were using to select the buckets was not correct, and sometimes we would either select the wrong files or fail to select any files. We have aligned the algorithm with Hive and now the optimization works as expected.

We have also improved the algorithm for detecting tables that are not properly bucketed. It is common for tables to declare bucketing in the Hive metadata, but not actually be bucketed in HDFS. When Presto detects this case, it fallback to a full scan of the partition. Not only does this change make bucketing safer, but it makes it easier to migrate a table to use bucketing without rewriting all of the data.

3. Use Compression

Compressing your data can speed up your queries significantly. There are no improvements in this area, although we believe an improvement to gzip decompression will have an added advantage.

Here is a table in 2017 article that compares the algorithms:

common compression formats and their attributes

Athena engine version 1 can cause records to be silently dropped when the BZIP2 format is used. For this reason, use of the BZIP2 format in Athena engine version 1 is not recommended. Version 2 does not have this issue anymore.

We use and recommended Apache Parquet and Snappy compression for most of our use-cases. Make sure you select the right option. Always remember,

High Compression ratio DOES NOT mean it’s a best practice, it just means less data scan so less cost but MORE time.

4. Optimize file sizes

As explained in the article, If your object is too small (<128 MB) then Athena engine might be spending additional time with the overhead of

  • opening Amazon S3 objects,
  • listing prefix directories,
  • getting object metadata,
  • setting up data transfer,
  • reading file headers,
  • reading compression dictionaries, and so on.

On the other hand, if your file is not splittable and the files are too large, the query processing waits until a single reader has completed reading the entire file. That can reduce parallelism.

Sufficiently large files better performance

You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost. You can follow “Aggregate files based on a pattern ” in this link to perform this operation.

5. Optimize columnar data store generation

Unnesting Structure Datatypes with Parquet and ORC

Improvement 1: Nested schemaSupport has been added for reading nested schema, which reduces cost. Unnesting Structure Datatypes with Parquet and ORC

Improvement 2: predicate pushdown — Query Planning Performance

In simple language, predicate pushdown (predicate filtering) is a technique used to skip reading unnecessary data at storage level from a dataset. This is part of Query planning as it decreases the amount of data Athena Engine scans from s3 at the time of querying.

Both Parquet and ORC format supports predicate pushdowns. Operations like casting, filtering, etc are most commonly used predicate operations.

In Engine 2, follow predicate pushdowns are brought in. Since we often use them in our GLUE jobs when we partition the data and store into S3, I am planning on to press on this topic in a different article in detail. But for now, following are the performance improvement added in the new release:

  • Predicate evaluations,
  • Predicate pushdown support for casting,
  • Predicate inference and pushdown

Query tuning

Athena uses Presto underneath the covers. Understanding how Presto works provides insight into how you can optimize queries when running them.

This section details the following best practices:

  1. Optimize the LIKE
  2. Only include the columns that you need.

6. Optimize ORDER BY

Improvement 1: Default Behaviour — ORDER BY operations are now distributed by default. Previously we explicitly used to mention information_schema

Improvement 2:Window functions — Improved performance of queries containing window functions with identical PARTITION BY and ORDER BY clauses.

Improvement 3: GROUPING SETS — Added support for ORDER BY clauses in aggregations for queries that use GROUPING SETS.

Improvement 4: ORDER BY and LIMIT — Improved plans, performance, and memory usage for queries involving ORDER BY and LIMIT to avoid unnecessary data exchanges.

7. Optimize joins

When you join two tables, specify the larger table on the left side of join and the smaller table on the right side of the join. Presto distributes the table on the right to worker nodes, and then streams the table on the left to do the join. If the table on the right is smaller, then there is less memory used and the query runs faster. (REPEATING as it is one of the important aspect when joining)

Improvement 1: JOIN and AGGREGATE operations — The performance of JOIN and AGGREGATE operations has been enhanced — Honestly, haven’t observed any significant improvement yet.

Improvement 2: Structural types — Improved performance of queries that process structural types and contain scan, joins, aggregations, or table writes.

Improvement 3: Planning performance — Improved planning performance for queries that join multiple tables with a large number of columns.

Join Performance
Improvement 4: Joins with map columns — Improved the performance of joins and aggregations that include map columns.

Improvement 5: Joins with solely non-equality conditions — Improved the performance of joins with only non-equality conditions by using a nested loop join instead of a hash join.

Improvement 6: Outer joins — The join distribution type is now automatically selected for queries involving outer joins.

Improvement 7: Range over a function joins — Improved performance of joins where the condition is a range over a function (for example, a JOIN b ON b.x < f(a.x) AND b.x > g(a.x)).

JOIN … USINGJOIN ... USING now conforms to standard SQL semantics. Previously, JOIN ... USING required qualifying the table name in columns, and the column from both tables would be present in the output. Table qualifications are now invalid and the column is present only once in the output

8. Optimize GROUP BY

Improvement 1: Complex grouping — Added support for complex grouping operations.

Complex grouping operations are often equivalent to a UNION ALL of simple GROUP BY expressions

Improvement 2: GROUPING SETS, CUBE, ROLLUP — Improved performance for queries involving GROUPING SETS, CUBE or ROLLUP, which you can use to aggregate multiple sets of columns in a single query

Improvement 3: grouping() — Improved performance for queries involving grouping().

The grouping operation returns a bit set converted to decimal, indicating which columns are present in a grouping. A bit is set to 0 if the corresponding column is included in the grouping and to 1 otherwise

9. Use approximate functions

Improvement 1: approx_distinct() — The approx_distinct() function now supports the following types: INTEGER, SMALLINT, TINYINT, DECIMAL, REAL, DATE, TIMESTAMP, TIMESTAMP WITH TIME ZONE, TIME, TIME WITH TIME ZONE, IPADDRESS, and CHAR.

Note: approx_distinct() DOES NOT gives an exact number but can be used for EDA. Be careful when you use approx functions.

Improvement 2: avg(), sum() — The avg() and sum() aggregate functions now support the INTERVAL data type.

Improvement 3: lpad(), rpad() — The lpad and rpad functions now work on VARBINARY inputs. — Never used it before.

Improvement 4: min(), max() — The min() and max() aggregation functions now allow unknown input types at query analysis time so that you can use the functions with NULL literals.

Improvement 5: regexp_replace() — Variant of the regexp_replace() function added that can execute a Lambda function for each replacement.

Improvement 6: sequence() — Added DATE variants for the sequence() function, including variant with an implicit one-day step increment.

All of these aggregate functions ignore null values and return null for no input rows or when all values are null. For example, sum() returns null rather than zero and avg() does not include null values in the count. The coalesce function can be used to convert null into zero.

10. Only include the columns that you need

When running your queries, limit the final SELECT statement to only the columns that you need instead of selecting all columns

Bonus-1 — Partition Projection

Partition values and locations are calculated from configuration rather than read from a repository like the AWS Glue Data Catalog.

Use-cases:

  • Queries against a highly partitioned table do not complete as quickly as you would like.
  • You regularly add partitions to tables as new date or time partitions are created in your data. With partition projection, you configure relative date ranges that can be used as new data arrives.
  • You have highly partitioned data in Amazon S3. The data is impractical to model in your AWS Glue Data Catalog or Hive metastore, and your queries read only small parts of it.

https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

Bonus Tips:

  • Parameterized queries — Parameterized queries are not supported in Athena engine version 1 but are supported in Athena engine version 2
  • Querying Parquet columns with complex data types — When you query columns with complex data types (array, map, struct), and are using Parquet for storing data, Athena engine version 1 reads an entire row of data instead of selectively reading only the specified columns. This issue does not occur in Athena engine version 2.
  • Here is email response by AWS team to my feedback on missing Athena Engine 2 feature like SHOW STATS in prestoDB 0.217 — “Providing a list of non-supported Presto functions sounds like a good idea, so I’ve passed your request on to the service team to see if we can make that happen.” Hope they list them in their next documentation release.

Reference:

--

--

Dheeraj Inampudi

Talks about AWS, AI & ML Engineering, Event Driven Architecture, Serverless, Application Security, Data Analytics and SaaS