Simple Search Service Using SpringBoot and ElasticSearch-1

Satyam Lal
Groww Engineering
Published in
6 min readJun 4, 2020

--

Springboot and ElasticSearch

Problem Statement and Use-Cases:

In this post , we shall be learning how to create a Simple Search Service using SpringBoot and ElasticSearch. We shall be learning how to ingest bulky data efficiently into an ElasticSearch Index , how to define analyzers and custom mappings for an ElasticSearch Index and how to use Java’s high level rest client(JHLC) for all communications between SpringBoot and ElasticSearch server. We shall also learn about booleanQueries in ElasticSearch and how to use fuzzy queries to optimize the search and get better results.

Why ElasticSearch over SOLR?

  1. ElasticSearch is more lightweight and easier to install than SOLR
  2. Our dataset contain JSON documents which are easily supported and stored in ElasticSearch indices.

Introduction:

For our sample POC, we shall be building a Simple Search Service by ingesting an employee data into an ElasticSearch index and then sending various types of queries to it and see how efficiently the results match our query string.

Working Project:

The working project can be found here :

Pre-Requisites:

  1. IntelliJ IDEA configured and Installed.
  2. JDK 1.8.
  3. ElasticSearch installed and running.
  4. Basic knowledge of Springboot, Springdata JPA , Elasticsearch.
  5. Postman Installed.

Dataset:

The dataset that we shall be using for our service is downloaded from:

Head to the above mentioned link. There we can see 2 datasets available, one called “Employee100K” and another called “Employee50K” containing 100K and 50K data points respectively. Download any one of the datasets. For our example, we shall be using the Employee50K dataset.

Preprocessing:

Before starting off with the configurations, we need to carry out a certain set of pre-processing on the dataset we just downloaded. Right now, our “Employee50K.json” file looks something like this :

Top 5 Documents taken from the dataset

Let’s clean this file to a file that just contains the employee JSONs and removes the unnecessary JSON lines. So go ahead and open up “Find & Replace” and do the following:

Step 1 : Replace All ‘{“index”:{“_index”:”companydatabase”,”_type”:”employees”}}’ with blank lines

Step 2: Remove all blank lines from the file.

Mappings:

ES stores data as “JSON Documents”. Schema in ElasticSearch can be thought of as a “mapping” that describes the various fields in the JSON documents along with their data types. So we use mappings in order to define what are the various fields in a JSON document along with their datatypes etc. Read more about mappings here :

Analyzers:

In order to understand how ES uses the analysis phase in order to effectively match the required documents, let us first understand how ES indexes the document.

When documents are inserted in the ES Index, ElasticSearch by default uses something called the Standard Analyzer on all fields of the documents whose datatype is defined as “Text” during mappings(Unless custom analysis is explicitly defined). Analysis phase itself consists of 3 internal phases :

1) Character Filters : The character filter has the ability to perform addition , removal or replacement actions on the input text given to them. The most common use case of this is to remove html tags from input text. For example : “html_strip” character filter shall convert “My <b> name-is </b> Satyam” to “My name-is Satyam”

2) Tokenizer : The transformed text from the character filter is given as an input to the tokenizer ,which would transform the text into series of tokens. For example , the standard tokenizer which is the default tokenizer used by ES provides grammar based tokenization and works well for most use cases. So it shall transform the text “My name-is Satyam” to [“My” , “name” , “is” , “Satyam”].

3) Token-Filter : The token filter takes individual token from the Tokenizer phase and can modify, add or remove them. For example, to facilitate a better search , we can take the individual tokens and add synonyms for that token , convert them to lower case, remove stop words etc. So a lower-case filter shall transform the tokens [“My” , “name” , “is” , “Satyam”] to [“my” , “name” , “is” , “satyam”]

After the Analysis phase is over during index time , all these individually modified tokens are indexed in another ES index called the “Reverse Index” or “Inverted Index”, indicating which tokens match which documents.

This Inverted Index is used to identify the most relevant documents when a query string is sent.

When a query string is sent during search time, the same analyzer that was used during index time is used on the query string to transform it into tokens and a look-up is performed in the Reverse Index to find out the matching documents in the main index. Of course we can define custom mappings and different analyzers during index-time and search-time Analysis.

For our dataset, we make it searchable on the following fields:

1) First Name.

2) Last Name.

3) Designation(defined as a keyword so no analysis).

4) Marital Status(defined as a keyword so no analysis).

5) Interests.

The custom mappings and custom analyzers that I have used for the same is as follows :

Let us call our Index as “employee_index”

Go ahead and copy this entire definition in the request body of Postman and send a PUT request to the ES server at the URL : http://127.0.0.1:9200/employee_index

An Index with the required mappings should be created at the server.

Note : Download a chrome extension called “elastic-head” to get a better view of the indexes on the server.

Gain more insights about different types of analyzers here.

Creating a Sample SpringBoot Project:

Head to the above mentioned link and Generate a Sample Spring Project. For uniformity, use the following snapshot for reference:

Configurations:

Before we start writing the Controllers(APIs) it is important to correctly import all the maven dependencies and configure ElasticSearch with SpringBoot. To correctly do so, carry out the following steps :

Step 1 : Import the project ‘search-service-es’ in IntelliJ IDEA.

Step 2: Head to the ‘pom.xml’ file, clear all the existing contents of the file and add the following dependencies :

Note : It is a wise choice to Re-import all Maven dependencies before continuing.

Step 3 : Add the following lines in the application.properties file under resources package:

server.port=8181

By default SpringBoot uses Apache tomcat server running on port 8080 but just for reference we have explicitly defined it to run on port 8181.

Java Rest Client :

Java Rest Client is the official ElasticSearch client that is used by Java to connect and communicate to ES. We would be using Java High-Level Rest Client for the same purpose. JHLC is used to expose API-specific method that enables transfer of objects between source and destination as request and response. In simpler words , we can use JHLC to connect to ElasticSearch server and then perform various operations on ElasticSearch Indexes like Insert , Update , Ingestion etc. As you would’ve probably seen , we have included the JHLC dependencies in pom.xml file.

Now create a Java class called “ESConfig” where we shall be telling JHLC which ports to listen to for all ES communications. By default ES runs on Port 9020. Add the following lines inside ESConfig class to do the same:

Further Notes

In this part of the blog , we have learnt how to carry out the initial setups and configurations and hopefully been able to understand and get our basics clear on all the terms and technologies that shall be used to implement the search application. In order to actually implement the search application and write the APIs , do visit Part 2 of the blog.

If you found this story helpful, please click the 👏 button and share it to help others find it! Feel free to leave a comment below.

--

--