Quora Question Pair Similarity Problem

13 min readMay 20, 2021

1. Business Problem

1.1 Description

Quora is a place to gain and share knowledge — about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world. Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Problem Statement

Identify which questions asked on Quora are duplicates of questions that have already been asked.
This could be useful to instantly provide answers to questions that have already been answered.
We are tasked with predicting whether a pair of questions are duplicates or not.

1.2 Sources/Useful Links

- Source : https://www.kaggle.com/c/quora-question-pairs

1.3 Real world/Business Objectives and Constraints

The cost of a mis-classification can be very high because say if there are 2 questions which are actually not duplicates but if we declare them to be duplicates then all the answers which are there for question 1 will be referred to as answers to question 2 which is extremely dangerous or costly.
You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
No strict latency concerns.
Interpretability is partially important.

2. Machine Learning Problem

2.1 Data

2.1.1 Data Overview

- Data will be in a file Train.csv

- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate

- Size of Train.csv — 60MB

Number of rows in Train.csv = 404,290

2.1.2 Example Data point

“id”,”qid1",”qid2",”question1",”question2",”is_duplicate”

“0”,”1",”2",”What is the step by step guide to invest in share market in india?”,”What is the step by step guide to invest in share market?”,”0"

“1”,”3",”4",”What is the story of Kohinoor (Koh-i-Noor) Diamond?”,”What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?”,”0"

“7”,”15",”16",”How can I be a good geologist?”,”What should I do to be a great geologist?”,”1"

“11”,”23",”24",”How do I read and find my YouTube comments?”,”How can I see all my Youtube comments?”,”1"

2.2 Mapping the real world problem to an ML problem

2.2.1 Type of Machine Leaning Problem

It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

2.2.2 Performance Metric

Source: https://www.kaggle.com/c/quora-question-pairs#evaluation

Metric(s):

* log-loss : https://www.kaggle.com/wiki/LogarithmicLoss

* Binary Confusion Matrix

2.3 Train and Test Construction

We build train and test by randomly splitting in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with. A better way of splitting the data would have been time based splitting as the types of questions change over time.But we have not been the given time stamps.

Exploratory Data Analysis

In this section we will do the analyse the data to get sense of what`s happening in our data.

Each of the features has 404290 non-null values except ‘question1’ and ‘question2’ which have 1 and 2 null objects respectively. We will process these rows a differently. So this is the high level view of the data.

What is the distribution of final class labels?

The number of duplicates (similar or 1) questions is 149263 and the number of non-duplicate (non-similar or 0) questions is 255027 which constitute 36.92% and 63.08% of the data respectively. Only about 37% of the question pairs are similar!

Number of unique questions is 537933
Number of unique questions that appear more than ones is 111780 which is equal to 20.78 % of all the unique questions.
Max number of times a single question is repeated: 157
We have zero duplicate questions.

What is the number of occurrences for each question?

Note that the y-axis is the logarithmic axis, 10 power 0 is 1, and so on… Notice there is one question repeating around 160 times.

This looks like an exponential distribution.

It is clear from the above graph that most of the questions appear less than 40 times. We have very few outliers that happen to appear more than 60 times and an extreme case of a question that appeared 157 times.

As far as null values are concerned we will just replace them with an empty space.

Feature Extraction Before Data Cleaning

Let us begin with feature engineering. We will extract 11 possible features, some of these features may be useful and some may not at all. After constructing the below features, we will have 17 columns in total including the output feature is_duplicate. Consider below dummy data for illustration:

Let`s analyse the some features-

Word share: The feature ‘word_share’ has quite a lot of predictive power, as it is good at separating the duplicate questions from the non-duplicate ones. The violin plot on the left shows that duplicate questions share more common words than non-duplicates hence word_share is higher for duplicates. The distributions for normalized word_share have some overlap on the far right-hand side, i.e., there are quite a lot of questions with high word similarity.
Interestingly, it seems very good at identifying questions that are definitely different but is not so great at finding questions that are definitely duplicates.

The distributions for normalized word_share have some overlap on the far right-hand side, i.e., there are quite a lot of questions with high word similarity
The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)

This means that this feature has some value in separating the classes.

Word_Common: In the below plot, the distributions of the word_Common feature in similar and non-similar questions are highly overlapping. Hence, this feature is not that useful to us.

The distributions of the word_Common feature in similar and non-similar questions are highly overlapping. Hence this feature cannot be used for classification. Or in other words we can say that it has a very less value of predictive power.

Preprocessing of Text

Before we go into complex feature engineering ,we need to clean up the data. Some of the steps of preprocessing includes-

Removing html tags
Removing punctuation.
Performing stemming , process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

Removing Stop words. Some examples of stop words are: “a,” “and,” “but,” “how,” “or,” and “what.”
Expanding contractions such as replacing “`ll” to “ will”, “n`t” to “ not”,”$” to “ dollar “ etc.

Following are the 21 new features that we will extract:

Analysis of the features extracted above

The number of data points or questions in class 1 (duplicate pairs): 298526
The number of data points or questions in class 0 (non-duplicate pairs): 510054

After removing stop words from the questions:

The total number of words in duplicate pair questions: 16109861
The total number of words in non-duplicate pair questions: 33192952

We will first create word clouds. Word cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.

Here we can clearly see that words such as “donald ” , “trump” , “best” etc have a bigger size which implies that they have a large frequency in duplicate question pairs

In non duplicate questions pairs we see words like “not”, “India”, “will” etc.One this to note is that thee word ‘best’ has a substantial frequency even in non duplicate pairs, but here its frequency is quite less as its image has a smaller size.

These are the pair plots of few of the advanced features. One this we can observe is that almost all the plots have partial overlapping. Hence we can conclude that these features can provide partial separability. They all provide some predictive power.

Similarly features token_sort_ratio and fuzz_ratio also provides some separability as their PDFs have partial overlapping.

T-SNE (t-distribution Stochastic Neighbourhood Embedding) for Visualization

T-SNE is the most used dimension reduction technique. It is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

In the above 2-D graph, notice that there are few regions with maximum data points of only one class 0 (non-duplicates) or 1(duplicates) overlapped but there are few regions where you can quickly separate class region 0 or 1. Hence, this gives us a hint that the multiple dimensional features that we designed certainly have some value in differencing the two classes not perfectly but in some manner!

Featurization using TF-IDF weighted word2vec

We have noticed that some words occur more frequently in duplicate pairs of questions i.e., class 1 while some words occur more often in non-duplicate pairs of questions i.e., class 0 (refer to word clouds section).

In this method, we will calculate the TF-IDF value of each word then sum the multiplication of the TF-IDF value with the corresponding word vector and divide this sum by the sum of the TF-IDF values.

Note that instead of word2vec, we will be using Glove (Global vectors).

Both word2vec and Glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring out the semantic similarity of words that capture different facets of a word’s meaning. Word2vec embeddings are based on training a shallow feedforward neural network while glove embeddings are learned based on matrix factorization techniques.
Word2Vec algorithms treat each word equally because of their goal to compute word embeddings. The distinction becomes important when one needs to work with sentences or document embeddings; not all words equally represent the meaning of a particular sentence. And here different weighting strategies are applied, TF-IDF is one of those successful strategies.
At times, it does improve the quality of inference, so the combination is worth a shot.

Train and Test Split

The train-test split procedure is used to estimate machine learning algorithms performance when used to make predictions on data not used to train the model.

Think of temporal splitting or time-based splitting… assume your complete data is sorted by time and then you split the oldest data as train and the newest data as a test. Temporal splitting makes sense here as if the question asked is ‘Who is the current prime minister?’ then merging answers of the same question asked a few years back will give useless answers or results. As timestamp related feature is not available in the dataset so we cannot use the temporal splitting technique. Hence we build the train and test sets by randomly splitting in the ratio of 70:30 or 80:20 or whatever we choose as we have sufficient observations to work with!

Machine Learning Models

There is a total of 796 features in the dataset. The number of features in question1 w2v data frame and the number of features in question2 w2v data frame is the same i.e., 384. There is a total of 794 training features including the basic features extracted before data cleaning, the advanced NLP and fuzzy features extracted after data pre-processing, Tf-Idf vectors for question 1, and Tf-Idf vectors for question-2.

Here, we will split our data into train-test in the ratio of 70:30. The number of data points in train data: 283003 and the number of data points in test data: 121287. Let us call this test data, validation data because we already have test data that is completely unseen. You can download the test dataset from Kaggle.

What is log-loss?

Logarithmic Loss is the most important classification metric based on probabilities. It’s hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. Log loss is a metric that can lie between [0, infinity). For any given problem, a lower log-loss value means better predictions.

Log Loss is a slight twist on something called the Likelihood Function. In fact, Log Loss is -1 * the log of the likelihood function. So, we will start by understanding the likelihood function.

The likelihood function answers the question “How likely did the model think the actually observed set of outcomes was.” If that sounds confusing, an example should help.

For example, a model predicts probabilities [0.8, 0.4, 0.1] for three houses. The first two houses were sold, and the last one was not sold. So the actual outcomes could be represented numerically as [1, 1, 0].

Let’s step through these predictions one at a time to iteratively calculate the likelihood function.

The first house sold, and the model said that was 80% likely. So, the likelihood function after looking at one prediction is 0.8.

The second house sold, and the model said that was 40% likely. There is a rule of probability that the probability of multiple independent events is the product of their individual probabilities. So, we get the combined likelihood from the first two predictions by multiplying their associated probabilities. That is 0.8 * 0.4, which happens to be 0.32.

Now we get to our third prediction. That home did not sell. The model said it was 10% likely to sell. That means it was 90% likely to not sell. So, the observed outcome of not selling was 90% likely according to the model. So, we multiply the previous result of 0.32 by 0.9.

We could step through all of our predictions. Each time we’d find the probability associated with the outcome that actually occurred, and we’d multiply that by the previous result. That’s the likelihood!

Baseline Random Model

A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problems, such as classification accuracy or loss. The random prediction algorithm predicts a random outcome as observed in the training data. It means that the random model predicts labels 0 or 1 randomly. The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem. More…

The log loss on validation data using Random Model is 0.89. A lot of scope to reduce loss!

I have tried Logistic Regression as the data is fairly high dimensional and Linear SVM also. I have used SGDclassifier with ‘log’ loss which is actually a Logistic regression, and SGDclassifier with ‘hinge’ loss which is actually a Linear SVM and performed hyperparameter tuning on alpha(parameter of SGD classifier). But both the models didn’t gave promisible accuracy and hence log loss.

XGBoost Model

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. The performance of XGBoost is no joke — it’s become the go-to library for winning many Kaggle competitions. More…

After fitting XGBoost Model on the training data, following are the results:

The log loss on train data is 0.69 and on validation data is also approx. 0.69. Log loss reduced much better but again much scope for improvement!
After applying some hyperparameter tunning, at the end achieved the log loss for training data is 0.345 and for validation is 0.357 which is a significant improvement.
Also both the losses are close shows that model is neither over-fitting not under-fitting.

The true positives (17081) and true negatives(7600) are in large numbers which shows larger number of correct classifications. Check here for evaluation metrics used for classification problems and their interpretations.

The precision for class 0 is 83% and for class 1 is 80%.
Also the recall for class 0 is 90% and for class 1 is 69%.
Both are the significant improvements over above two linear models, logistic regression and linear SVM.

Now the task for you to apply same pre-processing steps for test data and apply above trained model for the predictions!

_______________________________________________________________

Thanks for reading ❤.