Elasticsearch is a powerful and versatile search and analytics engine known for its scalability, speed, and ease of use. Its architecture has several key components and features that make it suitable for a wide range of applications:
- Cluster and Node Architecture:
- Cluster: A collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name.
- Node: A single server that is part of the cluster, stores data, and participates in the cluster’s indexing and search capabilities.
- Index and Document Model:
- Index: An index in Elasticsearch is like a ‘database’ in a relational database. It is the place where the data is stored and must be defined before data can be indexed.
- Document: A document is a basic unit of information that can be indexed. Elasticsearch stores complex real-world entities as structured JSON documents. Each document has a unique ID and a type.
- Sharding and Replication:
- Sharding: Elasticsearch provides horizontal scalability through the concept of ‘shards’. An index can potentially store a large amount of data that can exceed the hardware limits of a single node. To solve this, Elasticsearch provides the ability to subdivide the index into multiple pieces called shards.
- Replication: To ensure high availability and data durability, Elasticsearch allows an index to be replicated, which means each shard of the index can have one or more copies known as replica shards.
- Distributed Nature:
- Elasticsearch automatically distributes data and query load across all the nodes in the cluster. It also manages the reorganization of data in the cluster as nodes are added or removed.
- Near Real-Time (NRT) Search:
- Elasticsearch is a near real-time search platform. This means there is a slight latency (usually one second) from the time a document is indexed until it becomes searchable.
- RESTful API:
- Elasticsearch uses a RESTful API interface. Interactions with Elasticsearch are done using JSON over HTTP. This makes it easy to use and integrate with other applications.
- Search and Query DSL:
- Elasticsearch provides powerful search capabilities. It supports structured, unstructured, geographical, and metric searches. The Query DSL (Domain Specific Language) is very flexible and allows for the construction of complex queries.
- Analysis and Text Processing:
- Elasticsearch performs extensive text analysis and processing, which enables advanced full-text search capabilities. This includes language analyzers that can break text into tokens, filter them, and convert them into terms stored in an inverted index for fast search.
- Aggregations:
- Elasticsearch supports a variety of aggregations that allow you to summarize, calculate, and analyze your data in real time. This is crucial for analytics and visualization purposes.
- Elastic Stack Integration:
- Elasticsearch is often used as a component of the Elastic Stack (formerly known as ELK Stack), which includes Elasticsearch, Logstash (for data processing and ingestion), and Kibana (for data visualization).
- Scalability and Performance:
- Designed for scalability, Elasticsearch performs well in distributed environments. It is capable of handling petabytes of data and can run on hardware ranging from cloud instances to dedicated servers.
- Security and Access Control:
- Advanced features for security, like role-based access control, SSL/TLS encryption, and audit logging are available, ensuring data protection and compliance with various standards.
Data Ingestion and Storage
In Elasticsearch, the process of data ingestion, saving, and replication involves several key steps and components.
- Data Ingestion:
- Ingestion Methods: Data can be ingested into Elasticsearch through various methods, including using Elasticsearch’s RESTful API, Logstash (for log and event data), Beats (lightweight data shippers), or other custom methods.
- Data Formats: Data is typically ingested in JSON format. However, tools like Logstash can process and transform various formats before they are sent to Elasticsearch.
- Data Indexing:
- Document Indexing: Once the data reaches Elasticsearch, it is treated as a document. Each document is indexed and stored in a specified index. If the index does not exist, it is created automatically.
- Analysis and Tokenization: During indexing, the document’s text fields are analyzed and broken down into tokens (terms) to facilitate efficient searching. This process includes lowercasing, removing stop words, stemming, etc.
- Distributed Architecture:
- Sharding: An index in Elasticsearch is divided into shards, which are essentially smaller, manageable pieces of the index. Sharding enables Elasticsearch to horizontally scale as data grows and distribute data across multiple nodes.
- Primary and Replica Shards: Each index has primary shards and replica shards. Primary shards are where the data is first written. Replica shards are copies of primary shards, providing redundancy and increased read capacity.
- Data Passing and Storage:
- Storage on Nodes: Each shard is stored on a node in the Elasticsearch cluster. When a document is indexed, it is routed to a particular shard based on a hash of its ID.
- Document Storage: Internally, Elasticsearch uses Lucene to store data in an inverted index for fast text searches. The data is saved on disk in segments.
- Data Replication:
- Synchronization: When a document is indexed or updated in a primary shard, the same operation is automatically replicated to the corresponding replica shards.
- Fault Tolerance: Replica shards are distributed across different nodes. This ensures that in case of a node failure, a replica shard from another node can be promoted to a primary shard, ensuring data availability and reliability.
- Write Consistency: Elasticsearch ensures write consistency by waiting for a specified number of replica shards to respond before considering an indexing operation successful.
- Near Real-Time (NRT) Search:
- Refresh Interval: Elasticsearch periodically refreshes shards to make newly indexed documents available for search. This interval is by default set to 1 second, providing NRT search capability.
- Cluster Health and Balancing:
- Cluster Management: Elasticsearch continuously monitors cluster health and rebalances data within the cluster by moving shards between nodes as needed to ensure uniform data distribution and to handle node additions or failures.
- Scalability and Performance:
- Horizontal Scaling: Elasticsearch scales horizontally, meaning you can add more nodes to the cluster to distribute the load and data.
- Load Balancing: Read and write requests are automatically load balanced across the nodes in the cluster, with considerations for shard location and node health.
Near Real-Time (NRT) Search in ES
Elasticsearch achieves Near Real-Time (NRT) search through a combination of techniques and architectural choices, primarily focusing on how data is indexed and made available for search. Here’s a technical breakdown:
- Inverted Index:
- At its core, Elasticsearch uses an inverted index, which is particularly efficient for quick text search. An inverted index lists all unique words that appear in any document and identifies all documents each word occurs in.
- When a document is indexed in Elasticsearch, it is converted into a set of tokens which are then stored in the inverted index.
- Lucene Index:
- Elasticsearch is built on top of Apache Lucene, a high-performance text search engine library. Lucene maintains an index for each field in a document. These indexes are then used in executing search queries.
- The Lucene index is divided into segments, which are immutable. When new data is indexed, Lucene writes it to an in-memory buffer. Periodically, this buffer is flushed to create new segments.
- Refresh Interval:
- The refresh interval in Elasticsearch is a key factor in its NRT capabilities. The refresh process makes newly indexed documents available for search.
- By default, this interval is set to 1 second. This means that documents indexed in Elasticsearch will, on average, become searchable within one second.
- The refresh interval can be configured based on the requirements of the application using Elasticsearch.
- Transactional Log:
- When a document is indexed, it is first written to a transaction log. This ensures durability and provides a mechanism to recover not yet committed data in case of a crash.
- After a crash, Elasticsearch can recover data from this log, ensuring no data loss.
- Doc Values and Field Data Cache:
- For aggregations and sorting, Elasticsearch uses Doc Values, which are a columnar data representation format where each column contains the values of a single field of the documents.
- Field data cache is used for text fields to support aggregations and sorting. This data structure is built on the fly and can be memory intensive.
- Segment Merging:
- Over time, Elasticsearch merges smaller segments into larger ones in the background. This process is resource-intensive but necessary for maintaining search performance.
- Merging helps to keep the index size under control and improves query performance by reducing the number of segments that must be searched.
- Distributed Architecture:
- Elasticsearch’s distributed nature allows for horizontal scaling. This means that as data and query volume grows, more nodes can be added to distribute the load effectively.
- Optimization for Read and Write:
- Elasticsearch optimizes for both read and write operations. Write operations are fast because of the use of in-memory buffers and the eventual flushing of these buffers to create new segments.
- Read operations are fast because of the efficient data structures like the inverted index and the use of caches.
Sharding and Replication in ES
Sharding and replication are two fundamental aspects of Elasticsearch’s architecture, enabling it to handle large datasets and maintain high availability and resilience. Here’s how they work:
Sharding
- Purpose of Sharding:
- Horizontal Scaling: Sharding allows Elasticsearch to distribute data across multiple nodes, enabling horizontal scaling. This means large indices that cannot fit into the storage of a single node can be broken down and stored across several nodes.
- How Sharding Works:
- Index Division: When an index is created in Elasticsearch, it is divided into a number of shards. This number can be specified at index creation time and cannot be changed later for that index.
- Shard Placement: Each shard is a fully functional and independent “index” that can be hosted on any node in the cluster.
- Document Routing: When a document is indexed, Elasticsearch uses a hash of the document’s ID to determine which shard the document will be stored in. This ensures a uniform distribution of documents across shards.
- Types of Shards:
- Primary Shards: The original shards where data is first stored. The number of primary shards is defined when an index is created.
- Replica Shards: Copies of primary shards. They provide redundancy and increased data availability. Replica shards can serve read requests, thereby increasing read throughput.
Replication
- Purpose of Replication:
- High Availability and Fault Tolerance: Replication ensures that in the event of hardware failure or node loss, no data is lost and the search and indexing operations can continue without interruption.
- Increased Read Throughput: Replica shards can serve read requests (search and retrieve operations), thus increasing the read capacity of the system.
- How Replication Works:
- Automatic Replica Creation: When an index is created, Elasticsearch not only creates the primary shards but also creates replica shards based on the specified replication factor.
- Dynamic Replica Allocation: Replica shards are distributed across different nodes than their corresponding primary shards. Elasticsearch dynamically manages the placement of replica shards to ensure even distribution across the cluster.
- Synchronization: When a document is indexed in a primary shard, the same operation is replicated to all of its corresponding replica shards. This replication is near real-time, ensuring that replica shards are almost always synchronized with the primary shards.
- Failover Process: If a node holding a primary shard fails, one of the replica shards (from another node) is promoted to a primary shard. Elasticsearch then creates a new replica shard on a different node to maintain the desired level of replication.
- Consistency and Performance:
- Write Consistency: Elasticsearch ensures write consistency by confirming that write operations are successfully committed on the primary and a configurable number of replica shards before acknowledging the write operation.
- Read Performance: Since replica shards can handle read requests, they can be used to balance the read load, improving overall read performance.