Elastic Search/ OpenSearch

Elasticsearch is a search engine built on top of Apache Lucene.

It is:

A NoSQL document store (stores JSON documents)

Designed for full-text search, filtering, and analytics

Scalable and distributed

It is accessible from RESTful web service interface

You store documents, not rows like in SQL.

🧱 Key Concepts

🖥️ Node

A Node is a single running instance of Elasticsearch.

A single physical or virtual server can host multiple nodes, depending on the system’s resources like RAM, storage, and CPU.

🧮 Cluster

A Cluster is a collection of one or more nodes that together hold your data and provide distributed search and indexing capabilities.

Why Clusters?

🔁 Scalability — Add more nodes to increase capacity.

🛡 Fault Tolerance — If one node fails, others continue operating.

📦 Core Terms in Elasticsearch

Concept	Description
Index	Like a table in SQL. A collection of documents
Document	A single JSON object — like a row in SQL. Every document has a unique ID (UID)
Field	A key-value pair in a document
Mapping	Like a schema: defines field types
Query	How you search documents

🧩 Shard

An Index can grow large, so Elasticsearch splits it into smaller pieces called shards.

Each shard is a self-contained index and can reside on any node.

Shards enable distributed storage and parallel processing.

Types of Shards:

🔹 Primary Shard — The original piece of the index.

🔸 Replica Shard — A copy of the primary shard used for redundancy and load balancing.

Why Shards?

⚡ Scalability — Distribute data across nodes.

⚙️ Performance — Indexing and search operations run in parallel.

Example:

If an index has 5 primary shards and your cluster has 5 nodes, each node can host one shard, balancing the load evenly.

♻️ Replica — The Backup Copies

To protect against data loss and improve search performance, Elasticsearch uses replica shards, which are copies of primary shards.

✅ Ensures high availability — if a node or shard fails, the replica takes over.

🚀 Boosts search performance — queries can hit either primary or replica shards.

Key Points:

You can configure the number of replicas per index.

A replica is never stored on the same node as its corresponding primary shard — to avoid a single point of failure.

🛠️ Getting Started

Step 1: Install and Run Elasticsearch

🐳 With Docker (Easiest)


docker run -d --name elasticsearch \
  -p 9200:9200 -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.0

Test it:


curl http://localhost:9200

Step 2: : Creating an Index, Mapping

🏗️ Step 1: Creating an Index

An index in Elasticsearch is like a table in SQL — it stores a collection of JSON documents.

You can create an index with default settings like this:


PUT /library

Or, to include custom settings (like number of shards and replicas):


PUT /library
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

🧬 Step 2: Define a Mapping

Mappings in Elasticsearch define the structure of your documents — similar to a schema in a relational database. You define field types such as text, keyword, date, integer, etc.

Here’s a sample mapping for a book document:


PUT /library/_mapping
{
  "properties": {
    "title": {
      "type": "text"
    },
    "author": {
      "type": "keyword"
    },
    "published_date": {
      "type": "date"
    },
    "pages": {
      "type": "integer"
    },
    "available": {
      "type": "boolean"
    }
  }
}

✅ Alternatively, create index + mapping in one go:


PUT /library
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "author": { "type": "keyword" },
      "published_date": { "type": "date" },
      "pages": { "type": "integer" },
      "available": { "type": "boolean" }
    }
  }
}

Step 3: Add & Updating a Document

📄 Adding Document

Once the index and mapping are ready, you can start inserting data.


POST /products/_doc/1
{
  "name": "Wireless Mouse",
  "price": 25.99,
  "stock": 50,
  "category": "electronics"
}

📌 Auto-ID Example:


POST /products/_doc
{
  "name": "Gaming Keyboard",
  "price": 59.99,
  "stock": 30,
  "category": "electronics"
}

✏️ Updating Documents in Elasticsearch

If you reindex a document with the same ID, it replaces the existing document.


PUT /library/_doc/1
{
  "title": "Elasticsearch Essentials - Updated",
  "author": "Abhishek Tiwari",
  "published_date": "2023-08-01",
  "pages": 340,
  "available": false
}

Note: This replaces the entire document. If you omit a field, it gets deleted!

🛠️ 2. Partial Update (Only Specific Fields)

Use the _update API to modify only certain fields:

This preserves the rest of the document, updating only what’s inside the doc object.


POST /library/_update/1
{
  "doc": {
    "available": true,
    "pages": 350
  }
}

🔍 Step 4: Search Documents (Query DSL)

💡

🔍 What Is Analyzed Search in Elasticsearch?

Analyzed search refers to processing both the document content and the search query through an analyzer before they are stored or compared. This is the default behavior when using the text data type in Elasticsearch.

🔧 What Is an Analyzer?

An analyzer is a combination of:

Tokenizer: Breaks text into individual terms (tokens).

Filters: Modify the tokens (e.g., lowercase, remove stop words, stemming).


Input: "The Quick Brown Foxes"
Default (standard) analyzer → Tokens: ["the", "quick", "brown", "fox"]

🔄 When Does Analyzing Happen?

At index time (when the document is stored): the value in a text field is analyzed into tokens.

At search time (when you query with match, multi_match, etc.): the query string is also analyzed.

✅ Benefits of Analyzed Search

Feature	Description
Case-insensitive	`"Quick"` matches `"quick"`
Flexible	Can match partial phrases: `"brown fox"`
Supports stemming	`"running"` can match `"run"` if stemmer is enabled
Language-aware	Can handle language-specific rules

🆚 Analyzed vs Non-Analyzed

Feature	Analyzed (`text`)	Not Analyzed (`keyword`)
Tokenization	✅ Yes	❌ No
Query type used	`match`, `match_phrase`, etc.	`term`, `terms`
Case sensitivity	❌ Case-insensitive (usually)	✅ Case-sensitive (unless lowercased manually)
Sorting, aggregations	❌ Not supported directly	✅ Supported

⚡ Basic structure:


GET /products/_search
{
  "query": {
    "match": {
      "name": "keyboard"
    }
  }
}

This is using the Query DSL (Domain-Specific Language) — JSON-based syntax for querying.

What is Query DSL?

It has the format:


{
  "query": {
    "match" | "term" | "bool" | "range" | ...
  }
}

💡

📌 Query Examples

🔸 Example : Match query

Document:


{
  "description": "The quick brown fox jumps"
}

✅ Match query (Analyzed Search)


{
  "match": {
    "description": "Quick Fox"
  }
}

Both query and document are analyzed.

Tokens compared: "quick", "fox"

Match ✅

❌ Term query (No Analyzed Search)


{
  "term": {
    "description": "Quick Fox"
  }
}

Query is not analyzed

Exact match expected against full text "The quick brown fox jumps" → No match ❌

🔸 Example : Range Query


{
  "query": {
    "range": {
      "price": {
        "gte": 20,
        "lte": 60
      }
    }
  }
}

🔸 Example : Bool Query (AND, OR, NOT)


{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "mouse" } },
        { "term": { "category": "electronics" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 50 } } }
      ]
    }
  }
}

must Clause: Ensures that documents match both the name and category conditions. These matches contribute to the relevance score, allowing Elasticsearch to rank the results based on how well they match these criteria.

filter Clause: Applies a price constraint, filtering out documents where the price is greater than 50. This condition does not affect the relevance score, ensuring that scoring is based solely on the must conditions.

should: like OR

must_not: like NOT

🎯 Summary of What You Learned

Operation	Example
Create Index	`PUT /products`
Add Document	`POST /products/_doc/1`
Get Document	`GET /products/_doc/1`
Search	`GET /products/_search`
Match Query	`match: { name: "mouse" }`
Exact Match	`term: { category: "electronics" }`
Range	`range: { price: { gte: 20 }}`
ㅤ	ㅤ

Mapping

1. Defining Mappings at Index Creation

When creating a new index, wrap your mapping under the mappings key:


PUT /my_index
{
  "mappings": {
    "properties": {
      "<field1>": { <field1_definition> },
      "<field2>": { <field2_definition> },
      ...
    }
  }
}

properties: container for field definitions Elastic.

<fieldN>: each field’s name; each must specify at least a type (e.g., text, keyword, date, integer)

Example


PUT /products
{
  "mappings": {
    "properties": {
      "name":       { "type": "text" },
      "description":{ 
        "type": "text",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },
      "price":      { "type": "integer" },
      "brand":      { "type": "keyword" }
    }
  }
}

This creates a products index where:

description is a full‐text text field, with a sub‐field description.keyword for exact matches.

2. Updating Mappings on an Existing Index

Use the Put Mapping API to add new fields or multi‐fields:


PUT /products/_mapping
{
  "properties": {
    "<new_field>": { <definition> }
  }
}

💡

You cannot change existing field types; you can only add new properties or multi‐fields.

Correct Way: Delete and Recreate the Index with Proper Mapping

3. Key Components of a Mapping Definition

3.1. Field Types (type)

Each field must declare a type. Common types include:

text: analyzed string for full‐text search

keyword: not analyzed; good for aggregations, sorting, exact match

Numeric types: integer, long, float, double

date: date formats

boolean, geo_point, nested, etc.

3.2. Analyzers and index Settings

analyzer: specify custom analysis chain (tokenizer, filters) for text fields

index: set to false if you want to store the field but not index it for search

3.3. Multi-Fields (fields)

Allow indexing the same data in different ways. Example: full‐text and exact match:


"description": {
  "type": "text",
  "fields": {
    "raw": { "type": "keyword" }
  }
}

3.4. _source and Metadata Fields

_source: controls how the original JSON document is stored/retrieved; can disable or apply include/exclude filters

Meta-fields: _id, _type (deprecated in newer versions), _all (removed), etc.

3.5. Dynamic Mappings (dynamic, dynamic_templates)

dynamic: true (default)—automatically add new fields; false—ignore new fields; strict—reject documents with unmapped fields

dynamic_templates: pattern-based rules to apply custom mappings to matching field names.

4. Full Example


PUT /articles
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "author": {
        "type": "keyword"
      },
      "publish_date": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "comments": {
        "type": "nested",
        "properties": {
          "user":    { "type": "keyword" },
          "message": { "type": "text" },
          "date":    { "type": "date" }
        }
      }
    },
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

Here:

new unmapped fields cause errors (dynamic: "strict")

all string fields without explicit mapping become keyword via dynamic_templates

🎯 Elasticsearch Search Queries

1. 🔸 Match Query — Analyzed Full-Text Search

Use this when you want to search text fields that are analyzed (tokenized and normalized).

Example:


{
  "query": {
    "match": {
      "description": "quick fox"
    }
  }
}

Query terms and document fields are both analyzed.

Matches documents containing tokens like "quick" and "fox" in the description field.

2. 🔸 Term Query — Exact Match (No Analysis)

Use this for exact matches on keyword or non-analyzed fields.

Example:


{
  "query": {
    "term": {
      "category.keyword": "electronics"
    }
  }
}

Searches for exact term "electronics" in the category.keyword field.

No text analysis; case sensitive and exact match.

3. 🔸 Range Query — Numeric or Date Ranges

Filter documents with values between specified boundaries.

Example:


{
  "query": {
    "range": {
      "price": {
        "gte": 20,
        "lte": 60
      }
    }
  }
}

Finds documents where price is between 20 and 60 (inclusive).

4. 🔸 Bool Query — Combine Queries with AND, OR, NOT

Compose complex queries using must (AND), should (OR), and must_not (NOT).

Example:


{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "mouse" } },
        { "term": { "category.keyword": "electronics" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 50 } } }
      ],
      "must_not": [
        { "term": { "brand.keyword": "brandX" } }
      ]
    }
  }
}

Documents must have "mouse" in the name AND category "electronics".

Price must be less than or equal to 50.

Excludes documents from "brandX".

5. 🔸 Fuzzy Query — Search with Typos or Approximate Matches

Great for user input with spelling mistakes or typos.

Example:


{
  "query": {
    "fuzzy": {
      "name": {
        "value": "wirless",
        "fuzziness": "AUTO"
      }
    }
  }
}

Matches similar terms like "wireless" or "wiresless".

6. 🔸 Prefix Query — Autocomplete with Prefix Matching

Search documents where a field starts with a given prefix.

Example:


{
  "query": {
    "prefix": {
      "name": "blu"
    }
  }
}

Matches documents with terms like "bluetooth", "blue light".

7. 🔸 Completion Suggester — Efficient Autocomplete Suggestions

Designed for fast autocomplete and suggestion features.

Setup:


PUT /products
{
  "mappings": {
    "properties": {
      "suggest": {
        "type": "completion"
      }
    }
  }
}

Indexing Document:


POST /products/_doc/1
{
  "name": "Wireless Mouse",
  "suggest": {
    "input": ["wireless mouse", "mouse", "computer accessory"]
  }
}

Query:


POST /products/_search
{
  "suggest": {
    "product-suggest": {
      "prefix": "wire",
      "completion": {
        "field": "suggest"
      }
    }
  }
}

Returns autocomplete suggestions as user types "wire".

8. 🔸 Nested Query — Querying Arrays of Objects

If your documents contain nested objects (arrays of JSON objects), use nested queries to query fields inside the same nested object.

Document Example:


{
  "name": "Gaming Laptop",
  "features": [
    { "name": "RAM", "value": "16GB" },
    { "name": "GPU", "value": "NVIDIA" }
  ]
}

Mapping:


PUT /electronics
{
  "mappings": {
    "properties": {
      "features": {
        "type": "nested"
      }
    }
  }
}

Query:


json
CopyEdit
POST /electronics/_search
{
  "query": {
    "nested": {
      "path": "features",
      "query": {
        "bool": {
          "must": [
            { "match": { "features.name": "GPU" } },
            { "match": { "features.value": "NVIDIA" } }
          ]
        }
      }
    }
  }
}

Matches documents where the same nested object has "GPU" as name and "NVIDIA" as value.

📊 Elasticsearch Aggregations

Aggregations are Elasticsearch’s way to summarize and analyze your data. They work like group by, count, sum, avg, and other analytics in SQL.

Basic Request Structure:


GET /index_name/_search
{
  "query": {
    // your query here (e.g. match, bool, range)
  },
  "aggs": {
    // your aggregations here
  }
}

Common Aggregation Types (Clauses)

1. terms — Group by field values (like GROUP BY in SQL)

Groups documents by unique values of a field and returns counts per group.


json
CopyEdit
"aggs": {
  "by_category": {
    "terms": {
      "field": "category.keyword",
      "size": 10
    }
  }
}

Returns the top 10 categories and how many documents each has.

Use .keyword for exact term aggregation on text fields.

2. avg — Average of numeric field

Calculates average of a numeric field.


"aggs": {
  "average_price": {
    "avg": {
      "field": "price"
    }
  }
}

3. sum — Sum of numeric field

Calculates the total sum.


"aggs": {
  "total_sales": {
    "sum": {
      "field": "sales"
    }
  }
}

4. min and max — Minimum and maximum values


"aggs": {
  "min_price": {
    "min": { "field": "price" }
  },
  "max_price": {
    "max": { "field": "price" }
  }
}

5. stats — Summary stats (count, min, max, avg, sum)


"aggs": {
  "price_stats": {
    "stats": {
      "field": "price"
    }
  }
}

6. date_histogram — Group by date intervals

Great for time-series data, groups documents by fixed time intervals.


"aggs": {
  "sales_over_time": {
    "date_histogram": {
      "field": "order_date",
      "calendar_interval": "month"
    }
  }
}

7. filter — Apply a filter inside aggregations

Filter documents inside aggregation.


"aggs": {
  "electronics_sales": {
    "filter": {
      "term": { "category.keyword": "electronics" }
    },
    "aggs": {
      "avg_price": { "avg": { "field": "price" } }
    }
  }
}

Nested Aggregations

You can nest aggregations inside others for deeper insights.

Example: Top categories → average price per category


"aggs": {
  "by_category": {
    "terms": {
      "field": "category.keyword"
    },
    "aggs": {
      "average_price": {
        "avg": {
          "field": "price"
        }
      }
    }
  }
}

Full Example: Search with Aggregations


GET /products/_search
{
  "query": {
    "range": {
      "price": { "gte": 20 }
    }
  },
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.keyword",
        "size": 5
      },
      "aggs": {
        "average_price": {
          "avg": { "field": "price" }
        }
      }
    },
    "price_stats": {
      "stats": { "field": "price" }
    }
  }
}

Filters products priced 20 or above.

Groups by top 5 categories and calculates average price per category.

Returns overall price statistics.

🧠 Final Thoughts

Aggregation Type	Purpose	Basic Syntax Example
terms	Group by unique values	`"terms": { "field": "category.keyword" }`
avg	Average of numeric field	`"avg": { "field": "price" }`
sum	Sum of numeric field	`"sum": { "field": "sales" }`
min / max	Minimum and maximum values	`"min": { "field": "price" }`, `"max": {...}`
stats	Count, min, max, avg, sum summary	`"stats": { "field": "price" }`
date\_histogram	Group by date intervals	`"date_histogram": { "field": "order_date", "calendar_interval": "month" }`
filter	Filter aggregation scope	`"filter": { "term": { "category.keyword": "x" } }`