In a recent engagement with a client, we had the opportunity to do a deep dive into  AWS-managed Elasticsearch with the goal of evaluating its performance for storage and aggregation of time series data. While AWS offers Elasticsearch as a “turn-key” managed service, we found that a multitude of nuanced setup and tuning was required for our use case. This post will detail the path we took to deliver an AWS-managed Elasticsearch solution, the challenges faced, and their resolutions.

More of an Art than a Science

One of the immediate properties we discovered about Elasticsearch was the high degree of correlation between the optimum cluster configuration and the target data models and query strategies. While some settings are configurable post-cluster setup, some, like index sharding, are not. For this reason, many teams iterate on their cluster config, testing, re-configuring, and re-testing to find the optimum setup. Since we needed to shorten this process, our approach consisted of benchmarking the performance of different setups using representative workloads and queries.

Lay of the Land

For our client’s use case, real-world sensor data for tens of thousands of devices needed to be ingested real time, with each device reporting data from a variety of sensors. Queries against the cluster were primarily from 2 sources:

(1) users requesting historical records from the UI at irregular intervals, and (2) 10-minute recurring batch jobs which calculated various KPIs across groups of devices’ historical records.

Queries for use-case 1 could be of any geometry, i.e. a single sensor for a group of devices or a group of sensors for a single device. Queries for use-case 2 were predominantly for a single sensor for multiple groups of devices. Since some of the KPI calculations were YTD, we needed to support at least a 1 year retention period. Due to the cost of storing the volume of data, any data more than 1-year-old needed to be discarded.

Hardware

The first cluster we set up used AWS’s general-purpose instances. While it was usable for our initial POC, we quickly found it sub-optimal for doing anything with production-like data. This got us thinking: could a worthwhile performance improvement be achieved by selecting different machine types and configurations? We rolled up our sleeves and got to work. Of the options AWS provides, the most obvious place to start was the machine class and storage configuration.

For each of the configurations, we performed load-testing with various query geometries and sampled the “took” parameter of the response to indicate performance. The testing matrix we constructed included vCPU-equivalent memory-optimized r-series, and storage-optimized i-series, using the general-purpose m-series as a control. For our use case, we found that both r and i types provided equivalent, albeit modest (<25%), performance improvements over m. Surprisingly, we did not see any performance difference between 1K and 15K provisioned IOPs, which suggested that the performance of our use case was not constrained by disk contention.

Digging Deeper

One of the features we found helpful for analyzing query performance was Elasticsearch’s “profile” property. When enabled for a particular query, Elasticsearch will respond with query diagnostics, akin to the results returned by SQL server’s query profiler. For queries returning large numbers of documents, we found that response serialization represented around 25% of the total execution time on the cluster. By subtracting the “took” parameter from the client’s request duration, we found that 25% of the request time seen by the client was incurred by response transmission. Since we were seeing these numbers from clients within the same VPC, our only recourse was to adopt the practice of setting Elasticsearch’s “filter_path” parameter to strip unutilized properties from the response. We also found that AWS did not support API compression for Elasticsearch. While minimally impactful for smaller responses, larger responses containing the same type of documents benefit significantly. This sparked our investigation into Elasticsearch’s alternate response encodings. Initially, this seemed promising, however, we ultimately felt uncomfortable relying on it due to an open issue requesting removal of binary protocols from Elasticsearch.

In the end, the solution we adopted was two-fold. First, for the 10-minute batch jobs, we moved as many of the aggregations as possible from Spark to Elasticsearch aggregations and stored the results back to an additional index when necessary. Second, for the user-initiated queries, we implemented a stateless asynchronous API using SQS and S3.

Fragmentation

When it comes to data stores, often the largest and hardest-to-correct mistakes arise from the misalignment of design patterns and the implementation of underlying systems. In that regard, one of the key characteristics of Elasticsearch is that documents are immutable. Inherently, updating or deleting documents causes fragmentation of the underlying index. This impacts both the size of the affected index on disk and the performance of queries against it.

While Elasticsearch is far from the only datastore with an update-fragmentation correlation, we were unable to find a viable solution for reducing fragmentation once it occurred. While it is possible to defragment an index by issuing a force-merge command, it behaves poorly on read-write indices and with the size of our datasets, it proved to be prohibitively costly to run on production environments. With this in mind, our solution ultimately relied on inserts and grouping documents into indices such that documents outside of our retention window could be removed without impacting the fragmentation or performance of existing indices.

Indexing

Since the optimal number of data nodes and index sharding is dictated by the index structure, we began the next part of the investigation by determining an appropriate indexing strategy. The factors considered here were support of the 1 year retention period, remaining within a target of 30Gb per shard, and parallel execution of queries. Based on an index we created with sample data, we estimated around 27Gb per month with 1 replica.

With that in mind, we decided on per-month, 1-shard, 1-replica, indices. For this, we used an index template from which new alias-tagged indices were created when needed.

{"index_patterns": ["our-index-*"],"aliases": {"our-index-": 
{}},"settings": {"index": {"number_of_shards": 
"1","number_of_replicas": "1"}}}

Index Template Configuration (Partial)

With this strategy, the retention period was implemented by simply dropping expired indices. This proved to be an excellent approach since it allowed for fragmentation-free support of the retention period, meeting our target shard size, and sufficient parallel execution since querying the index alias queried each index in parallel. While queries that only targeted one specific month did not benefit from parallel execution given this setup, we found that overall performance was maximized on our demo clusters.

Mappings

Our client expressed concern regarding the size of the data on disk, so we needed to get creative with our mappings. An idea brought to the table by the client was to group signals together by device and timestamp. 3 concerns immediately came to mind:

  1. The ingestion pipeline sent values to our Elasticsearch writer one at a time. If we could not reliably group all of the related values into a single write operation, it would result in significant fragmentation.

  2. The signal names were dynamic (new signals could be added/removed over time) and it was not obvious how we would efficiently store them dynamically with the structure.

  3. The structure would disproportionately favor queries that requested data for a group of sensors over queries for data from single sensors.

As usual, it came down to testing. With production-like workloads, we found that no more than 1 in 5000 records required updating, and that Lucene’s automatic merging tended to fill-in the gaps. As for signal naming, we settled on dynamic templates where new values were assigned a common data type. This worked out well since unused properties were effectively expunged when expired indices were dropped. Regarding query performance, we did find that queries for groups of sensors performed better than queries for individual sensors, however the overall reduced number of documents allowed the structure to scale better than any of the other document structures we tested.

"mappings": {"our-mapping": {"dynamic_templates": [{"our-mappings": 
{"match_mapping_type": "long","mapping": {"type": 
"float"}}}],"properties": {"device_id": {"type": "integer"},"timestamp 
": {"type": "date"},"values ": {"properties": {}}}}}

Index Template Configuration (Partial)

Scaling

Since the client’s goal was to support a growing user base, the solution needed to scale. From our other investigations, we noticed a roughly linear degradation in response time for additional concurrent requests. In order to test if scaling the cluster addressed the issue, the testing matrices we created contained between 1 and 3 replicas across a 12-data-node and a 24-data-node cluster. For each of these combinations, we varied the number of concurrent requests between 1 and 4. The results of these tests were good, showing a linear increase in the cluster’s capacity to handle concurrent requests with increasing node counts and reaffirming our belief that the solution would perform for the client.

Conclusion

Throughout our process, we found ourselves coming back to the same core concepts. While we encountered them in the public cloud, in theory, they are applicable to any Elasticsearch setup. Distilled, these concepts were:

  1. Avoid document updates wherever possible

  2. Logically partition data into separate indices

  3. Use index templates to automate index creation and aliasing

  4. Calculate aggregates on the cluster whenever possible

  5. Use nested bucket aggregations where applicable

  6. Keep overall response sizes small by filtering out unused properties

  7. Benchmark multiple index configurations when determining sharding