How we accidentally burned $40,000 by calling recursive patterns

Dheeraj Inampudi
Level Up Coding
Published in
4 min readNov 14, 2022

--

Background

A couple of years ago, I was a consultant for an airline IoT company, where I was responsible for building the analytics architecture, including the complete data engineering pipeline and analytics dashboards.

3000+ IoT devices send 200 GB/day to the ETL pipelines. This data is stored temporarily in a SQL database, and every four hours, it is dumped into S3 as CSV files. With input from their head of engineering, I settled on the following plan of action.

Setting the Context (Edited): This incident happened in 2019, and I’m not the AWS admin. However, as part of our policy:

  • We Informed their AWS admin to configure cloudwatch alarms prior to the commencement of the project, which they have not done
  • I manage a team of four engineers, one of whom wrote the functions
  • It’s a temporary function that should only work for two weeks, which is why it is written on Lambda
  • Serverless services will scale on demand, configuring reserved concurrency or provisioned concurrencies are new concepts in 2019
  • There are too many factors in the equation that are beyond my control, and the purpose of this article is to make developers aware about recursive patterns so that they proceed with caution.

As a consultant, I proposed the following four major changes to the existing architecture:

  1. Entire historical data will be transformed to hour-level Date Partitioned Snappy Compressed Parquet format (DPSCP)
  2. Existing pipelines continue to dump data to S3 in CSV, but we use our tried-and-true scripts to convert each file into DPSCP over the two-week migration period — The problem occurred here
  3. Develop a new pipeline that dumps the upcoming data into DPSCP
  4. Use Athena queries to create the final BI dashboard (Note: the real-time dashboard is a different use case)

Success criteria: Providing everything in working order. We must be able to query all accessible data with minimal pay-as-you-go query costs and just pay for the data storage on S3.

Problem & Cause

During migration (Step 2), we created a 2-week temporary Lambda that listens for S3 bucket events, transforms them into Parquet, and publishes them to S3. Due to the fact that both the source and destination buckets and prefixes were assigned to the same bucket, it triggered millions of times.

Ideally, Lambda should convert 24 CSV files each day to Parquet for two weeks. But since we didn't delete the CSV files after processing (for validation) and made a trigger based on an S3 event, whenever a new parquet file is dumped after processing, the S3 event triggers Lambda again, and Lambda repeats the process.

Recursive Pattern

The infinite loop led Lambda to scale to use all available concurrency, while S3 continued to write objects and create new events for Lambda.

Cost breakdown

By the end of the 2-day period, the loop had run for about 40 hours and cost approximately $40,000, mostly because of Lambda concurrency and S3 put requests.

AWS Cost breakdown

Prevention ways

  • Use a positive trigger: For instance, an S3 object trigger might use a naming convention or meta tag that only works the first time it is called. This stops objects that were written by the Lambda function from calling the same function again.
  • Use reserved concurrency: When the reserved concurrency of a function is set to a lower limit, the function can’t scale concurrently past that limit. It doesn’t stop the recursion, but as a safety measure, it limits the amount of resources used. During the development and testing phases, this can be helpful.
  • Use CloudWatch monitoring and alarming: By setting an alarm on a function’s concurrency metric, you can get alerts if the concurrency suddenly goes up which allows you to take the correct actions.

In the following AWS articles, similar patterns can be seen in other situations and services.

How we Fixed It

You can also use a prefix or suffix to filter by object key when setting up event notifications in an S3 bucket. We avoided recursive patterns by having two different S3 prefixes, ‘Original’ and ‘Processed’ and Sufix ‘.csv’ .

I found this article on AWS blogs, which explained this in detail.

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective to connect with the best startups and tech companies

--

--