DECEMBER 14, 2022

Elasticsearch Demystified - Part 2

HiTech

Big Data

Technical Blog

Data Science & Engineering

Data Architecture

Aditya Chouhan

This is part two of the Elasticsearch series. In case you have not read the first part, please check it here. The first part was all about the basics of Elasticsearch (ES), the Index – the underlying data structure and the different features of ES. In this second part, we will dive deeper into some practical aspects. We will see how to index the data and various ways of querying or searching the data in Elasticsearch. This blog has the following sub-topics designed to explain CRUD operations in Elasticsearch.

Getting Started with Elasticsearch
Creating and Inserting Data into Index in Elasticsearch
Querying/ Searching Data in Elasticsearch
Relevance Scoring in Searching
Deleting Data/Indexes in Elasticsearch

Getting Started with Elasticsearch

There are various ways of setting up Elasticsearch. You can either set it locally which is self-managed, or you can opt for managed services too. All leading cloud providers (AWS, Azure, and GCP) offer Elasticsearch as a managed service. In addition to these external cloud providers, we also have "ElasticCloud," which is owned and managed by the creators of Elasticsearch. You can make relative comparisons, and based on your requirements, you can select any of these cloud options for enterprise-level implementation.

In addition to this, one can also set up Elasticsearch locally. It offers installation packages for all popular OS systems, including Windows, Linux, Mac, etc. You can also run Elasticsearch as Docker containers after downloading the required images from the Elastic Docker Registry.

Please refer to this official tutorial on how to set up Elasticsearch. This tutorial covers everything you need, right from downloading to installing and configuring it locally.

When we run it locally, by default Elasticsearch runs on port 9200 and Kibana runs on port 5601. More about Kibana in the third part of this series. Elasticsearch is a JSON document-based database and is REST API compatible, which means you can use standard HTTP methods (GET, PUT, POST, and DELETE) to perform operations here.

Creating and Inserting Data into Index in Elasticsearch

Once you have it running on your local system, you create a structure where you can insert the data. As mentioned in my previous blog, Elasticsearch is a document database, and it stores data in the form of JSON documents. Similar to tables in RDBMS, we have Indices here. You can either use a direct HTTP PUT REST call to create an index or you can log in to a query editor (Kibana Developer Console) and manually run the PUT command to create one. When creating an index, you need to specify the following:

Index Name- Required, string type
Index settings- Optional, object type
Index Aliases- Optional, object type
Index Column mappings- Optional, object type

As you can see, only the index Name field is mandatory, and the rest of the things are optional here.

Settings are used to specify the configuration you want for your index. Sample below-

Now, if you do not specify this settings object, then it will create an index with the default configuration.

Aliases are used to specify secondary names for an index or group of indexes. Mappings are used to specify the property or column names, their datatypes, and other mapping parameters. Sample below-

Now, this "mapping" is an optional property. You should use this only when you are certain about the format of JSON documents you would want to store in an index. You can skip this property and instead use a sample JSON document while creating the index, like the "Select Into." clause in SQL which uses sample records to create and insert records in temporary RDBS tables. This is like the “Design First” vs the "Data first" approach. If you have a sample JSON document with you then you can use the same while creating an index that will automatically extract all column names, and data types based on the sample JSON object.

The "PUT Index" API mentioned above is just one of the ways of inserting data into Elasticsearch. BULK API is another way, it allows you to insert data into multiple indexes in a single API call. We also have something called "Ingest Pipelines" using which one can transform the data before inserting it into indexes. This pipeline has a series of processors, and each processor will do some transformation over data.

Querying/Searching Data in Elasticsearch

Querying/searching the data is one of the "USP" of this data technology. Slow search results are one of the biggest turn-offs for customers on any website. On a few websites, I have observed search results in seconds. Not a good user experience in an enterprise application.

There are multiple factors that contribute to this slow search. Poor data modeling, indexing, hosting, and incorrect selection of databases are a few to name. You can optimize the data modeling, scale up your hosting infrastructure (both vertically and horizontally), rewrite your queries, and implement best practices like indexing, defragmentation, etc. to improve your database performance, but sometimes even all these practices are not sufficient.

Traditional databases might not perform well with a large scale of data, but Elasticsearch gives you search results in microseconds (both keyword and full-text searching) even with millions of records. In case you are wondering why data querying is so fast here, that is mainly because of the underlying data structure and the way data is stored here. Please check the first part of this series for more information on this.

There are various ways of querying the data in the Elasticsearch world. Query DSL, EQL- Event Query Language), and Elasticsearch SQL are a few to name. In this blog, we are going to discuss only Query DSL as it is the most popular way of data retrieval.

ES provides JSON-based full domain-specific query language for querying. Query DSL uses Search API, and you can specify your search criteria in the form of request JSON. "Search API" returns all the documents from the index that match the search criteria. The below query will search "my-index-0001" index and return all JSON documents in this index where the user.id matches with "kimchy".

There are two ways/ contexts in DSL- Query Context and Filter Context. Before we dig into the difference between the two, let’s touch base on an important parameter in Elasticsearch- "Scoring".

Relevance Scoring in Searching

When it comes to a good user experience, good search results are equally important as fast search results. The biggest problem with database technologies is that we often must choose between two things: either fast searching or relevant search results (as all that explicit filtration makes your search slow), but what if I let you know that Elasticsearch gives you both- "fast relevant” search results.

Elasticsearch uses a scoring system to filter and rank the results, this scoring is based on fields in the input query. Behind the scenes, Elasticsearch uses the practical scoring function of Lucence (the foundation of Elasticsearch) to implement this scoring and relevant searching. It calculates scores for all fetched documents, and then this "score" is used to sort the documents, thus giving us good "relevant" search results. Now, one may argue that they can achieve similar results by adding "order by desc" in SQL queries but remember that here we are talking about "quick" relevant results and not relevant results.

Take the example of this search button on our Encora Insights website.

Here, we would like to perform a search matching the title of blogs and not the content or body of the blog. We can achieve this by the "boosting" feature of Elasticsearch. Fields in an index can be boosted to elevate the relevance score to fetch the most relatable documents. We can do this by boosting either while indexing the document or while writing a search query. In our example, we can keep the boost of the title field higher than the boost of blog content.

Now, you know what relevant scoring in Elasticsearch is and how it is used. Let's go back to two ways of querying the data here- Query and Filter Context. Based on your problem statements, you can specify any combination of these two while querying the data. Query clause answers, "how well the document matches the query clause" and Filter clause answers, "whether the document matches the query clause". Fields mentioned in the filter clause simply do the filtering of results; they do not affect the scoring of matching documents. Also, filter context caches the results, it's faster compared to the query context.

The "Query" keyword in the above search query indicates query context, whereas the "Filter" keyword indicates filter context. "Term" and "range" fields will filter out the matching documents, but they will not affect the scoring of final matching documents.

The below table shows a quick comparison between two-

Clauses for Searching: Elasticsearch offers various clauses for querying/ searching data-

match_all- Here, all the records will come with a relevancy score of 1 since we didn't specify any search criteria.

"query": {
"match_all": {}
}

match- Returns all documents with field name matching the mentioned keyword.

"query": {
"match": {"field":"keyword"}
}

exists- Returns all documents where field "ABC" exists.

"query": {
"exists": {"field":"ABC"}
}

must- To apply "and" of two or more conditions

"bool": {
"must": [
{"match": {}},
{"match": {}}
]
}

should: "Nice to have" conditions could be specified using should. “must” and “must not” take precedence over should.

multi_match: "Or" between two or more fields.

match_phrase: All the terms must appear in the field, and they must have the same order as the input value.

range: gte, lte, gt, lt- To specify greater than, less than conditions in the search criteria.

Deleting Data/Indexes in Elasticsearch

You can use Delete Index API to delete one or multiple indexes. This is like dropping a table in RDBMS. Deleting an index removes all its data, metadata, and shards, but it does not delete associated Kibana components like dashboards, reports, etc.

If you want to delete specific data/ documents from an index, then you can use the "Delete by Query" API. The below query will delete all the matching documents-

Here you can specify/ use the same syntax as Search API. You can specify match all () if you want to delete all the data from an index.

In this part of the series, we saw how to perform CRUD operations in Elasticsearch using various available REST APIs- PUT, POST, GET, DELETE, Delete_by_query, etc. You can run all these commands directly in the Kibana developer console or you can do HTTP REST calls using CURL, Postman, or even invoke these calls from any programming framework. Almost all popular programming languages/ frameworks including .NET, Java, Python, Ruby, etc. have a package for Elasticsearch integration.

We also explored the "relevance scoring" concept in Elasticsearch which enables ES to provide ranked/ score-based search results. In the next and final part of this series, we will cover what Kibana is, how to use it, and how to visualize data in it. We will also see a case study on the enterprise-level implementation of Elasticsearch.

Stay tuned, and happy learning!

References

https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

https://oneture.com/blog/improving-relevance-using-elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

About Encora

Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.