At Carbon we enable our clients to unlock their large audiences through analytics and segmentation for improved activation and monetisation. This creates the need to query the data of hundreds of millions of user profiles on demand. We need to do this quickly, and we need to capture a lot of this data to be used in reporting and further querying.
When we first looked at this problem several big data solutions came to mind, including the offerings of the big cloud providers, Redshift and BigQuery. These didn’t quite work for us as whilst they are super fast at returning aggregated data, they aren’t fast to write out large amounts of rows from the result of a query.
After a lot of work and benchmarking we settled on the (new at the time) Athena product from AWS. Athena is essentially a managed version of Apache Presto (though AWS had made some noticeable improvements). The big sell of this product is that it’s fully managed and you pay per query (based on data scanned). It should scale under the hood and the advertised query limits of 25 at a time were more than sufficient at the time and we had room to grow with limit increases. Athena was also able to perform well with the large scale writes we wanted it to do.
Query exhausted resources at this scale factor
After some work Athena was in production and it was great for a time… but then we started to see problems. Firstly less queries were running sequentially than expected, (more like 5 than 25). And then the infamous “Query exhausted resources at this scale factor” errors started to flow in. This got worse with time, both due to us running larger and more complex queries and (we believe) changes behind the scenes at AWS to how resources are allocated.
The problem is partly that Athena makes you optimise for data scanned. You pack as much data in as you can through compression and formats like parquet and avoid joins, especially multiple. Presto as a technology requires everything be held in memory, which means if you load in more than the resource you have, the query fails. This means clients that don’t have the data they need to power a campaign, and a developer needs to investigate and mitigate the issue.
We knew this solution was becoming unworkable and it’s easy to jump to the reason being a bad product. I don’t believe this is the case, we still use it alot for other things and it’s fantastic. Cheap, very easy to setup and fast. I do think there are major lessons to learn here though, in particular;
We want to scale the query system both horizontally and vertically, e.g. we want to allow queries to run faster and we want to run more at a time. Athena gives us no power to do either, if we need more resources we can’t do anything about it.
We needed a replacement fast, that wouldn’t fall over and gave us full control over scalability. Naturally we thought we could just roll out our own presto, but it’s here we learned the differences between Presto and Athena are there, and one of the big advantages of Athena is we didn’t need to implement a query / queue / result / error system. And some differences between Presto and Athena were causing issues that stopped this being a quick fix. As we already had a lot of experience in the team with Spark and (around this time we had started to work with Databricks more closely) we made a few quick prototypes to de-risk the solution and then started a POC to replace the system with Spark.
Athena / Presto Vs Spark
Presto and Spark have a lot of overlap but there are a few key differences. They can both run queries over very large datasets, both are pretty fast and both use clusters of machines. However Presto is more limited in the types of operations you can do as it’s more similar in use to a SQL database, but you use files on disk vs inserting into an indexed database. Spark is much more general purpose and can be used for queries, but also things like ETL and machine learning.
The key difference for us is the way the solutions handle shuffling. Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design . The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing .
Spark is not millisecond fast but that wasn’t a requirement, queries can take minutes, they just need to not fail, and not be expensive. The solution we designed uses a small always on spark cluster (a total of 16 cores and 120GB memory) that processes queries throughout the day. At the same cost point as Athena, queries are running much faster and we have had no failures due to volume. We can increase the size of the cluster to increase query speed and add more clusters to increase parallelism. Scaling both ways requires just ~2-3 minutes using Databricks allowing us to react very quickly and in different ways.
Managed services are great for getting things up and working fast, and saving you time and money in maintaining the system. But there is a huge difference between a service like Athena which gives you no ability to scale yourself, and RDS / EMR / RedShift which allows you to pay to have more resources at any time. Don’t hold yourself hostage, ensure you can control your scale. When the fires are burning, having a lever you can pull is essential.
 This is generally true but there are many exceptions. The most notable we have found is if doing large aggregations that aren’t sumable. E.g. if you do a collect_set or collect_list operation in a group by. If you exceed memory here you will spill to disk and it can cause significant slowdowns.