When we are using stack-overflow if we need to search for an answer we will ask a question in the search box. If we type the question, it will give some related questions.Our problem statement is similar to this only.
Given a question can we find a similar questions in a repository of questions and answers that we have.
The main objective is to get the searched similar kind-off questions within a small amount of time. Basically the search engines work in such a way that they will give us the results within less than 500 milli second. Ordering of related questions is also very important. The speed is extremely important while designing these kind-off engines.
We want to have high precision and high recall. The computational and server costs should be low.
We have picked up a dataset known as STACK-SAMPLE from sample.
One can find the dataset in the below link.
StackSample: 10% of Stack Overflow Q&A
Text from 10% of Stack Overflow questions and answers on programming topics
Using elastic search.
Elastic search gives us an inbuilt implementation of inverted index. It also gives us default scoring using TFID based schemes and also gives us the flexibility to build our own scoring function. It is distributed. It runs in realtime so that the latency will be very low.
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. Word embeddings are representation of words in an N-dimensional vector space so that semantically similar (e.g. “king” — “monarch”) or semantically related (e.g. “bird” — “fly”) words come closer depending on the training method (using words as context or using documents as context). When it comes to texts, one of the most common fixed-length features is the bag-of-words. But this method neglects a lot of information like ordering and semantics of the words.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.
We need to install docker. Within the docker, we need to install elastic search.
Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.
To install docker check this link.
Install Docker Engine on CentOS
To get started with Docker Engine on CentOS, make sure you meet the prerequisites, then install Docker. To install…
To install Elastic search type this commands.
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.7.0docker image lsdocker run -m 6G -p 9200:9200 -p 9300:9300 -e “discovery.type=single-node” — name myelastic docker.elastic.co/elasticsearch/elasticsearch:7.7.0docker psdocker statsdocker exec -it myelastic bash
yum -y updateyum install -y python3yum install -y vimyum -y install wgetyum clean allpip3.6 install — upgrade pippip3.6 — versionpip3.6 install elasticsearchpip3.6 install pandaspip3.6 install — upgrade — no-cache-dir tensorflowpip3.6 install — upgrade tensorflow-hub
First most thing is we need to convert the sentences to tensor. The whole code keeps the model in memory.
When we load the data which is in zip file it will be available in the system. We need to load that data into docker. To do so type the below command.
docker cp /universal-sentence-encoder_4.tar.gz elasticsearch:/usr/share/elasticsearch/searchqa/data
Elastic Search Indexing
From the data folder that we have if we take some questions, and lets assume that we read those questions one after the other. For those questions if we call our model, our model returns us the vector so that we insert it into the elastic search both the question text and vector. This whole thing is known as Indexing. The code for that is as follows.
This file indexES.py takes all of our questions, computes the vector, takes the title inserts all the data into elastic search and we have also created a index called question-index.
To check whether it has created a index or not use the command.
curl -X GET "localhost:9200/questions-index/_stats?pretty"
We can also search by id.
When we create a id for each question it looks something like this.
The top200KQues.py reads the data from Question.csv and creates a new file called top200KQuesData. In this file it will just print the id and title of each question. The code looks something like this.
We need to download USE4 model into disk and whenever we need it we can use it.
To search for the question similarity we can use the searchES.py . This basically uses cossine-similarity to fetch the similar type of question.
We can create a flask api and use it for searching the similar questions. Before doing it we need to install some packages.
To run the app in flask use the code searchES_FlaskAPI.py . The code looks something like this.
The final output of the search can be seen something like this.
The time taken for searching the similar questions looks something like this.
We can also visualize the graph using prometheus. The output looks something like this.
We can use grafana to visualize the memory consumption etc..
Thanks everyone for your patience. I am sure that you guys have enjoyed this usecase. I would also really like to thank my friend ADITYA GUPTA for being a team member with me and helping me to complete this project.