Carbon software developer, Jordan Hoggart, recently shared some insights at Prestocon about how Carbon was able to improve performance of its real-time dashboards by swapping Athena – Amazon’s serverless Presto option – to Ahana Cloud for running queries using our own managed cluster. You can see the recording and slides from the talk here. This article gives a summary of the key points.

The background

At the base of Carbon’s real-time, first party data platform is our analytics component, which combines a range of behavioural, contextual and revenue data, which is then displayed within a dashboard in a series of charts, graphs and breakdowns to give a visual representation of the most important actionable data. Whilst we pre-calculate as much of the information as possible, there are different filters that allow users to drill deeper into the data, which makes querying critical.

AWS Athena – the good and the bad

For the past 2 years Athena has been our Presto provider of choice. Athena is Amazon’s offering of a serverless query engine, using Presto under the hood. The serverless nature of the service made it very easy to get going as all of our data already lived in S3, so Athena was more than capable of running any query we threw at it. The biggest benefit by far was the fact we only paid for what we used, specifically data scanned – no need to worry about over/under provisioning a cluster.

Once more data was coming in, Athena started to show signs of struggle. Anyone who has pushed Athena too far will recall the dreaded “Resources exhausted at this scale factor” error – the message you get when you hit the memory limit for the query. Amazon’s advice here is to write better queries that don’t hit the limit. While we can all agree it’s decent advice, it doesn’t help when you hit the tipping point with data where scaling is necessary. This effectively gave us a hard cap on the compute power we had available and in the long run was only going to cause bigger headaches.

Another big drawback is the single tenant nature of the service. Athena is a resource that is shared between all AWS accounts. Each account has restrictions applied in the way of a concurrency limit – the docs say 20 but in some cases we’ve seen queries entered a “queued” state as low as 2. Along with this, each account only gets a single queue, which means it’s possible for any random query to block a production dashboard query from running.

It’s worth saying that Athena really shines for running low importance ad-hoc queries. We still make use of it for some quick data science checks that are fine to sit in a queue for a bit. It’s also great at running infrequent stuff like queries for gathering metrics. We have a system in place that runs a bunch of quick queries and sends results to Grafana, it basically costs nothing. Athena is a fantastic tool but in the wrong use case it can really leave your hands tied, but we needed something to meet the demand of our more intensive queries. Enter Ahana Cloud.

Ahana Cloud

Ahana Cloud is a service that takes care of the annoying parts of getting a cluster going. It makes use of cloudformation templates to set up the compute plane and launch any Presto clusters with configuration that’s defined in either a web UI or API. This kept the operational complexity down while giving us the ability to scale the cluster to what we needed. The cluster gets deployed in AWS and has connectors for glue so we didn’t have to shift data about and could get straight to testing.

One of the biggest concerns was that we had no idea how powerful Athena actually was. We had a rough idea of the scale of query that would cause it to fail, but couldn’t really tell what sort of node configuration it was using behind the scenes. So the first thing we did was set up a bunch of different cluster configurations to try and get a performance similar to Athena.

Ahana and the numbers

From the dashboard we picked the seven most important queries for benchmarking covering queries on overview data, brand, date, location, site, interest categorisation data and demographic. Even on a relatively small cluster we started to see query times that were the same, if not better, than Athena. The categorisation and demographic breakdowns were pulling in the most data, around 4 billion rows, so they were always going to be the toughest. Even then it was good to see that we could get close. There was always the fear that we would have had to provision a cripplingly large cluster to get the performance we wanted. Although the cost was definitely going to be more than Athena, this showed it was viable and would give us the stability we were after.

The above shows the run times for 5 standard queries across our dashboard using Ahana clusters vs Athena.
The above shows the run times for the 2 most demanding standard queries across our dashboard using Ahana clusters vs Athena.

From the dashboard we picked the seven most important queries for benchmarking covering queries on overview data, brand, date, location, site, interest categorisation data and demographic. Even on a relatively small cluster we started to see query times that were the same, if not better, than Athena. The categorisation and demographic breakdowns were pulling in the most data, around 4 billion rows, so they were always going to be the toughest. Even then it was good to see that we could get close. There was always the fear that we would have had to provision a cripplingly large cluster to get the performance we wanted. Although the cost was definitely going to be more than Athena, this showed it was viable and would give us the stability we were after.

Cluster NameWorker config$/hr*
Athena??
A3 x c5.2xlarge1.02
B5 x c5.2xlarge1.70
C10 x c5.xlarge1.70
D20 x c5.large1.70
E10 x c5.2xlarge3.40
The above is the running cost per hour for each cluster, along with the worker node configuration (Number_of_nodes x Node_name). Athena is a question mark since we don’t know the actual cluster under the hood.

Having our own cluster also let us go a bit further with optimization. Before we couldn’t make use of session parameters, which meant we couldn’t do things like influence the join strategy. After a bit of parameter tweaking, swapping from parquet to the ORC data format and using a hive metastore, we found a setup that actually gave a decent speed boost to the queries. Across all queries it gave around a 90% speedup.

OverviewBrandCategorisationDateDemographicLocationSite
Before711183812186
After5312259834
The above shows the run time (in seconds) before and after applying some of the optimisations from having our own clusters

Key Takeaways

Overall, Presto makes it easy to query data without shifting it around and there are some great options to get going with various connectors. Whilst Athena is great for running small volume loads, it can be sensitive to issues such as its single query queue and low compute ceiling – both of which can limit query speed & scale. When seeking something that allows for more scale Ahana Cloud provides a good next step with more control over your own cluster, as well as more predictable performance.