outcoldman
outcoldman Denis Gladkikh

Elasticsearch: How to avoid index throttling, deep dive in segments merging

elasticsearch, lucene, segments, and databases

This blog post is written based on source code of Elasticsearch 5.5.0 and Lucene 6.6.

If you are managing Elasticsearch cluster it is very important to understand what are the segments in the index, why and when they are getting merged, and what is the right configuration.

If your Elasticsearch cluster is fairly big, default configuration might not work for you. Not sure why the documentation for the Merge Policy is gone from Index Modules, but you can find it in source code (MergePolicyConfig.java). At the bottom you can find really important note about the max_merged_segment, which is set to 5gb by default.

Note, this can mean that for large shards that holds many gigabytes of data, the default of max_merged_segment (5gb) can cause for many segments to be in an index, and causing searches to be slower.

So what is the many gigabytes of data? Let’s try to answer on this question.

First I would highly recommend you to look on Visualizing Lucene’s segment merges by Michael McCandless. If you are not familiar with Lucene you should also look into Elasticsearch from the Bottom Up. The third video in first link presents TieredMergePolicy. It is the merge policy you should be most interested in. All the other policies were deprecated in Elasticsearch 1.6 and removed in Elasticsearch version 2.0. In the article mentioned above you will find very good explanation on how TieredMergePolicy works.

What helped me more, when I looked in the source code of Lucene implementation of method TieredMergePolicy.findMerges. That and looking on default configuration of Elasticsearch helped me to understand what to expect.

Back to the article mentioned above

TieredMergePolicy first computes the allowed “budget” of how many segments should be in the index, by counting how many steps the “perfect logarithmic staircase” would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.

Let’s look on how that is implemented

First we get a collection segments infosSorted, sorted in descending order by size.

Collections.sort(infosSorted, new SegmentByteSizeDescending(writer));

Block calculates total size of the index (sum of all segments) and size of the smallest segment

long totIndexBytes = 0;
long minSegmentBytes = Long.MAX_VALUE;
for(SegmentCommitInfo info : infosSorted) {
  final long segBytes = size(info, writer);
  // ... skipped ... //

  minSegmentBytes = Math.min(segBytes, minSegmentBytes);
  // Accum total byte size
  totIndexBytes += segBytes;
}

Now we have two variables, totIndexBytes is the size of all indices and minSegmentBytes is the minimum segment size.

Next we exclude all segments larger than max_merged_segment/2.0 (with default value 5gb it is 2.5gb)

int tooBigCount = 0;
while (tooBigCount < infosSorted.size()) {
  long segBytes = size(infosSorted.get(tooBigCount), writer);
  if (segBytes < maxMergedSegmentBytes/2.0) {
    break;
  }
  totIndexBytes -= segBytes;
  tooBigCount++;
}

Seem like this loop can be easily combined with previous one. PR#219.

Using value of floor_segment (default is2mb) if the smallest segment is less than that

minSegmentBytes = floorSize(minSegmentBytes);

Next block is very important, it calculates number or allowed segments. The number which triggers the merge, when the number of segments is more than that

long levelSize = minSegmentBytes;
long bytesLeft = totIndexBytes;
double allowedSegCount = 0;
while(true) {
  final double segCountLevel = bytesLeft / (double) levelSize;
  if (segCountLevel < segsPerTier) {
    allowedSegCount += Math.ceil(segCountLevel);
    break;
  }
  allowedSegCount += segsPerTier;
  bytesLeft -= segsPerTier * levelSize;
  levelSize *= maxMergeAtOnce;
}
int allowedSegCountInt = (int) allowedSegCount;

Taking default configuration with floor_segment equal to 2mb and assuming that you have segment with size lower or equal to floor_segment we can estimate the allowedSegCountInt to be around 40 segments and our perfect logarithmic staircase should look like

lucene perfect logarithmic staircase

Some observations from above:

Only if number of eligible segments more than allowedSegCountInt, lucene proceeds with finding candidates for merges

if (eligible.size() > allowedSegCountInt) {
    // ...
}

Where eligible is the list of all segments with size less than max_merged_segment/2.0 and has not been included in any merges yet.

The code under this if statement trying to find the possible combination of segments to include in merges which will bring eligible.size() under allowedSegCountInt.

The algorithm is simple, it starts from the largest segment and trying to find N segments (where N < max_merge_at_once), which in merge will result segment with size less than max_merged_segment. This is a reason, why actually the perfect logarithmic staircase should not happen, because this merge policy does not look for how to merge together smallest segment, but actually trying to find how to merge the largest segments first.

This is an example

first merge

Some observations from this code:

Now it is much clear for me how segments are getting merged in Lucene. Let’s come back to the Elasticsearch and look on index throttling. Index throttling happens in EngineMergeScheduler.beforeMerge. It happens when TieredMergePolicy.findMerges returns more merges than the value of index.merge.scheduler.max_merge_count. The value of max_merge_count is defined in MergeSchedulerConfig. By default this value is set to index.merge.scheduler.max_thread_count + 5, where max_thread_count = Math.min(4, numberOfProcessors / 2). This value can be changed dynamically as any other index setting (but for some reason it is not documented)

curl -XPUT http://localhost:9200/*/_settings -d'{
  "index.merge.scheduler.max_merge_count": 100
}'

If you are planning to play with the configuration for merge policy I would highly recommend you to change this to higher value than default, that will help you to avoid index throttling.

What else can cause index throttling? It really depends. First, very obvious is enabled throttling for merges indices.store.throttle.type or very low value of indices.store.throttle.max_bytes_per_sec. Seems like in Elasticsearch 6.0 these settings will be removed, so merges will never be throttled. To find what else can cause index throttling I would recommend to look on current sizes of segments, use the algorithm from the TieredMergePolicy, predict what kind of merges will be scheduled. That should help you to answer the question. Can be that you have too many small segments or maybe too many large segments.

Have feedback or questions? Looking for consultation?

My expertise: MongoDB, ElasticSearch, Splunk, and other databases. Docker, Kubernetes. Logging, Metrics. Performance, memory leaks.

Send me an email to public@denis.gladkikh.email.

The content on this site represents my own personal opinions and thoughts at the time of posting.

Content licensed under the Creative Commons CC BY 4.0.