Elasticsearch reindexing without downtime
Introduction
I'm a simple guy. I see Uptime Percentage Chart and I want it to stay at the level of 100%. I could probably spend all day on Chaos Engineering and have a lot of fun from randomly destroying infrastructure components and observing high level SLA. There's nothing better than satisfaction of well-designed and reliable infrastructure - for me. :)
Situation overview
Unfortunately, this blog post is not about beautiful world of Chaos Engineering, but about zero-downtime reindexing of Elasticsearch indices with new fields mapping (new Index Template). I think it's still pretty interesting topic, because our goal was to guarantee zero-downtime migration for both: indexing and searching documents operations. I hope you'll find it interesting, since during my preparations to the operation I've read few blog posts about that, but unfortunately - all of them were describing usage of Index Alias for indexing and searching documents operations. Link to the description of operation: https://www.elastic.co/blog/changing-mapping-with-zero-downtime
I couldn't use that solution. In our project case, we're using Index Alias (people configured at Index Template level) for search operations, but the application that's sending documents to Elasticsearch has hardcoded index name (with month and year, for the purpose of that article, let's say it's "people-072021"). Moreover, I couldn't reconfigure and restart that process at the moment. I'll just mention that we had around 3-5 of Index Document requests/sec. The same was for search operations - users would not be able to use the platform if ES won't return any data for search requests.
Above I mentioned usage of Index Templates, so let me introduce you also to the "timeline" of what happened:
- Initial index template was created.
- First documents started to arrive into index for current month (people-202107).
- After few days, we improved the index template with new mapping.
A lot of fun ahead, right? ;)
Our way of handling that
Here's step by step list of actions which we taken in order to handle remapping without a downtime:
- Make sure that Index Template is correct (mappings). All indices matching name pattern specified in Index Template will have Index Alias people.
- Add Ingest Pipeline. In our case, it will redirect all incoming documents to another destination index, called (in our case) people-tmp. It only adds Ingest Pipeline, but it's not "active" yet.
- Configure previously added Ingest Pipeline as default pipeline for people-072021 index in its settings. That will immediately start redirecting documents to people-tmp, as specified in Ingest Pipeline definition. Also, people-tmp index (if it's matching naming pattern of Index Template) will already have the mapping fixed - since it's new index created from Index Template.
- Add new, empty index - people-fixed.
- Remove its index alias before reindexing documents. That will prevent search queries from returning duplicated results (since documents will exist in both indices, old and new one).
- Reindex documents from people-072021 into people-fixed. That will already fix the fields mapping, since it's new index created from Index Template.
- Add people Index Alias to people-fixed index and remove people Index Alias from people-072021 - to prevent search queries of returning duplicated documents - until we delete people-072021 index.
- Delete people-072021 index and recreate it back (or next Index Document request will recreate it). All incoming documents will be indexed back into people-072021. When we removed the index it also removed it's Ingest Pipeline setting.
- Reindex/copy all documents from people-tmp and people-fixed back to people-072021. That's the weak part. I couldn't find a better way to handle that. During the Reindex operation, search queries may return duplicated results - so I tried to minimize the impact as much as possible by playing around with Index Aliases. If you have a better idea how to handle that - please, let me know!
- Remove people-tmp and people-fixed indices.
UPDATE: Point 9.: Few colleagues suggested me a potential solution of that problem. In order to handle reindex without duplicated search results, I could write a script that: reads a document, writes it in the new index (people-072021), checks that is available for search and finally deletes the same document from the origin index.
The job is done.
Example resources to replicate the situation
If you're interested, I provide you below a list of requests and sh scripts to do the same locally (e.g. run Elasticsearch and Kibana in Docker).
Background shell scripts
ingestion.sh
- that tool uses ali - load testing tool available on GitHub. You can replace it with curl. :)
#!/usr/bin/env zsh
ES_URL=http://localhost:9200
ES_INDEX_NAME=people-072021
ali --rate=6 --duration=3m -m=POST --header="Content-Type: application/json" --body='{"name":"John","age":"50","timestamp":"'$(date +'%s')'"}' ${ES_URL}/${ES_INDEX_NAME}/_doc/
search.sh
- dumps responses to search.log file.
#!/usr/bin/env zsh
ES_URL=http://localhost:9200
ES_INDEX_NAME=people
echo "" > search.log
while true; do
curl -XGET -L -H "Content-Type: application/json" ${ES_URL}/${ES_INDEX_NAME}/_count -d '{ "query": {"match": {"_id": "well-known-id"}}}' >> search.log;
echo "" >> search.log
sleep 0.1;
done
If you see in curl response count: 1
that means document was found. :)
Example:
{"count":1,"_shards":{"total":2,"successful":2,"skipped":0,"failed":0}}
Elasticsearch requests
# Add basic index template with wrong mapping
PUT _index_template/people_template
{
"index_patterns": ["people*"],
"template": {
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"name": {
"type": "keyword"
},
"age": {
"type": "text"
},
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
},
"aliases": {
"people": {}
}
}
}
# Add doc with *well-known-id*, it will be used for search operations in background to ensure search is working. We will search by document ID.
POST people-072021/_doc/well-known-id
{"name":"John","age":"50","timestamp":"1626175541"}
###################
# Start search in background
# Start ingestion in background
###################
# Fix mapping in index template
PUT _index_template/people_template
{
"index_patterns": ["people*"],
"template": {
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"name": {
"type": "keyword"
},
"age": {
"type": "short"
},
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
},
"aliases": {
"people": {}
}
}
}
# Add ingest pipeline that will redirect documents to tmp index
PUT _ingest/pipeline/tmp-redirect-pipeline
{
"description": "That pipeline will redirect documents to tmp index",
"processors": [
{
"set": {
"field": "_index",
"value": "people-tmp"
}
}
]
}
# Configure tmp-redirect-pipeline as default for broken index - people-072021. New docs should be indexed in people-tmp index from now on.
PUT people-072021/_settings
{
"index": {
"default_pipeline": "tmp-redirect-pipeline"
}
}
# Add new index where we will reindex docs
PUT people-fixed
# Remove it's alias to prevent search operations on it
POST /_aliases
{
"actions": [
{
"remove": {
"index": "people-fixed",
"alias": "people"
}
}
]
}
# Reindex broken index into new one - that will fix field mapping
POST _reindex
{
"source": {
"index": "people-072021"
},
"dest": {
"index": "people-fixed"
}
}
# Add people alias to fixed index and remove from people-072021
POST /_aliases
{
"actions": [
{
"add": {
"index": "people-fixed",
"alias": "people"
}
},
{
"remove": {
"index": "people-072021",
"alias": "people"
}
}
]
}
# CHECK: Should show wrong mapping
GET people-072021/_mapping/field/age
# Delete people-072021 index
DELETE people-072021
# Copy all documents back to original index. It will not update any documents and ignores conflicts. You can remove "conflicts" and "op_type" keys if you want to override people-072021 with docs from -tmp and -fixed.
POST _reindex
{
"conflicts": "proceed",
"source": {
"index": "people-fixed,people-tmp"
},
"dest": {
"op_type": "create",
"index": "people-072021"
}
}
# CHECK: Should show correct mapping
GET people-072021/_mapping/field/age
DELETE people-tmp
DELETE people-fixed
###################
# DONE!