Elasticsearch has recently become one of the first choices for applying full-text search or analyzing logs for your applications. At my company, we’re also using Elasticsearch in some applications as a secondary database for full-text search. And it’s configured through AWS with Elasticsearch Service for reducing admin overhead. But a few days ago, a problem happened with our Elasticsearch cluster and we spent half a day recovering it. To avoid this kind of problem in the future, I decided to learn more depth about Elasticsearch and how it works in the right way. This story is my result after days of researching this search engine. Hope it’ll help you understand essential parts and get some fun when working with Elasticsearch.
1. What is Elasticsearch?
“Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured” — Introduction by Elastic Company. Elasticsearch is a distributed system using cluster and node architecture built around Apache Lucene (open-source software for search written by Java) as a core component, so it achieves HA naturally and can scale very easily by adding or removing nodes from the cluster. Elasticsearch can process petabytes of data without problem mainly for 2 purposes: full-text search and analyze data or logs.
2. How Elasticsearch to store data
Data in and out from Elasticsearch as a JSON document like MongoDB. So basically, Elasticsearch is a NoSQL database and it doesn’t support something like joins or relationships natively.
In real-world, Elasticsearch is used mainly as a secondary database and only stores required data for full-text search or analysis from primary database like Cassandra, MongoDB, or DynamoDB (through some tools like streaming)
Besides that, based on the field types (data types in Elasticsearch, not JSON data types) of JSON documents, Elasticsearch treats this in many ways like pre-processing data, choose how it must be store, … to improve performance when retrieving documents or make this field full-text searchable. For example: If the field type is Text, Elasticsearch pre-processes raw data with an Analyzer (I will explain in details later) before saving processed data to an Inverted Index (An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.). It makes this field full-text searchable when clients do a query. In another case, if the field type is Geo Point or Geo Sharp, Elasticsearch will store this data in a BDK tree, so clients can retrieve data breathtakingly.
3. Elasticsearch core components
Cluster
Simply, the Elasticsearch cluster is a set of Elasticsearch nodes (a running instance, physical device, or virtual machine) with the same “cluster. name” working together in a network to share their data and workload. Through that, the cluster provides indexing and searching capabilities for clients that want to perform queries on Elasticsearch.
Node
Node is a running instance of the Elasticsearch cluster. Node is where Elasticsearch performs indexing and searching action when the user interacts with the cluster. Elasticsearch documents are distributed in nodes, and each node knows exactly where a document lives and can forward user requests directly to the node that holds data.
Based on the node function, Elasticsearch has some kind of node, and the 3 most important are: Master Node, Master-Eligible Node, and Data Node.
Data nodes hold data and perform data-related operations such as CRUD, search, and aggregations.
A master node in charge of cluster-wide management and configuration actions such as add/remove nodes, create/update/delete index, … A cluster has only one master node at a time. If a master node fails, Master-Eligible Nodes in the cluster elect a new master node from the master-eligible node pool.
Master-eligible node which can be voted to become a new master node when disaster happens with the master node.
In a cluster with only one node, it’s both master node and data node
Index
You can think of an index in Elasticsearch as a database in the world of relational databases. To add data to Elasticsearch, we need to create an index. In reality, an index is just a logical namespace, data actually divided and stored into many shards. All data-related operations like CRUD perform on shards instead of index, index acts as a representative for hiding complex.
Shard
Simply, the shard is a single instance of Lucene. It stores data and can perform any data-related operations. A shard can be a primary shard or replica shard. Any document in an index belongs to a single primary shard. A replica shard is simply just a copy of a primary shard. It provides redundant copies and helps protect data when problems happen with primary shards. Replica shard also improves read performance, because it can serve read requests like primary shard but you only can perform write requests on the primary shard.
When creating an index, you should specify the number of primary shards. This number is fixed after the index is created, but you can change the number of replica shards by changing index settings.
Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. You can think of it as a table schema in the world of relational databases. For example: If one field is defined as Text in mapping, its data should be processed by an Analyzer (make it full-text searchable) before stored.
4. Summary
These are some essential parts that I have learned about Elasticsearch. And I hope it’ll help you understand how Elasticsearch works behind the surface. But Elasticsearch also has tons of other components and terms for learning. I couldn’t cover them all, so I recommend referring to the Elastic document for additional information