Natural Language Processing (NLP) Introduction:
NLP stands for Natural Language Processing which helps the machines understand and analyse natural languages. It is an automated process to extract required information from data by applying machine learning algorithms.
While applying for job roles that deal with Natural Language Processing, it is often not clear to the applicants the kind of questions that the interviewer might ask. Apart from learning the basics of NLP, it is important to prepare specifically for the interviews. Checkout the list of frequently asked NLP interview questions and answers with explanation that you might face.
NLP Interview Questions and Answers with Explanations
1. Which of the following techniques can be used for keyword normalization, the process of converting a keyword into its base form?
c. Cosine Similarity
Lemmatization helps to get to the base form of a word, e.g. are playing -> play, eating -> eat, etc..
Other options are meant for different purposes.
2. Which of the following techniques can be used to compute the distance between two word vectors?
b. Euclidean distance
c. Cosine Similarity
Answer: b) and c)
Distance between two word vectors can be computed using Cosine similarity and Euclidean Distance. Cosine Similarity establishes a cosine angle between the vector of two words. A cosine angle close to each other between two word vectors indicates the words are similar and vice a versa.
E.g. cosine angle between two words “Football” and “Cricket” will be closer to 1 as compared to angle between the words “Football” and “New Delhi”
Python code to implement CosineSimlarity function would look like this
return np.dot(x,y)/( np.sqrt(np.dot(x,x)) * np.sqrt(np.dot(y,y)) )
q1 = wikipedia.page(‘Strawberry’)
q2 = wikipedia.page(‘Pineapple’)
q3 = wikipedia.page(‘Google’)
q4 = wikipedia.page(‘Microsoft’)
cv = CountVectorizer()
X = np.array(cv.fit_transform([q1.content, q2.content, q3.content, q4.content]).todense())
print (“Strawberry Pineapple Cosine Distance”, cosine_similarity(X,X))
print (“Strawberry Google Cosine Distance”, cosine_similarity(X,X))
print (“Pineapple Google Cosine Distance”, cosine_similarity(X,X))
print (“Google Microsoft Cosine Distance”, cosine_similarity(X,X))
print (“Pineapple Microsoft Cosine Distance”, cosine_similarity(X,X))
Strawberry Pineapple Cosine Distance 0.8899200413701714
Strawberry Google Cosine Distance 0.7730935582847817
Pineapple Google Cosine Distance 0.789610214147025
Google Microsoft Cosine Distance 0.8110888282851575
Usually Document similarity is measured by how close semantically the content (or words) in the document are to each other. When they are close, the similarity index is close to 1, otherwise near 0.
The Euclidean distance between two points is the length of the shortest path connecting them. Usually computed using Pythagoras theorem for a triangle.
3. What are the possible features of a text corpus?
a. Count of the word in a document
b. Vector notation of the word
c. Part of Speech Tag
d. Basic Dependency Grammar
e. All of the above
All of the above can be used as features of the text corpus.
4. You created a document term matrix on the input data of 20K documents for a Machine learning model. Which of the following can be used to reduce the dimensions of data?
- Keyword Normalization
- Latent Semantic Indexing
- Latent Dirichlet Allocation
a. only 1
b. 2, 3
c. 1, 3
d. 1, 2, 3
5. Which of the text parsing techniques can be used for noun phrase detection, verb phrase detection, subject detection, and object detection.
a. Part of speech tagging
b. Skip Gram and N-Gram extraction
c. Continuous Bag of Words
d. Dependency Parsing and Constituency Parsing
6. Dissimilarity between words expressed using cosine similarity will have values significantly higher than 0.5
7. Which one of the following are keyword Normalization techniques
b. Part of Speech
c. Named entity recognition
Answer: a) and d)
Part of Speech (POS) and Named Entity Recognition(NER) are not keyword Normalization techniques. Named Entity help you extract Organization, Time, Date, City, etc..type of entities from the given sentence, whereas Part of Speech helps you extract Noun, Verb, Pronoun, adjective, etc..from the given sentence tokens.
8. Which of the below are NLP use cases?
a. Detecting objects from an image
b. Facial Recognition
c. Speech Biometric
d. Text Summarization
a) And b) are Computer Vision use cases, and c) is Speech use case.
Only d) Text Summarization is an NLP use case.
9. In a corpus of N documents, one randomly chosen document contains a total of T terms and the term “hello” appears K times.
What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “hello” appears in approximately one-third of the total documents?
a. KT * Log(3)
b. T * Log(3) / K
c. K * Log(3) / T
d. Log(3) / KT
formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)
= log(1 / (⅓))
= log (3)
Hence correct choice is Klog(3)/T
10. The algorithm decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents
a. Term Frequency (TF)
b. Inverse Document Frequency (IDF)
d. Latent Dirichlet Allocation (LDA)
11. The process of removing words like “and”, “is”, “a”, “an”, “the” from a sentence is called as
c. Stop word
d. All of the above
In Lemmatization, all the stop words such as a, an, the, etc.. are removed. One can also define custom stop words for removal.
12. The process of converting a sentence or paragraph into tokens is referred to as Stemming
The statement describes the process of tokenization and not stemming, hence it is False.
13. Tokens are converted into numbers before giving to any Neural Network
In NLP, all words are converted into a number before feeding to a Neural Network.
14. identify the odd one out
b. scikit learn
All the ones mentioned are NLP libraries except BERT, which is a word embedding
15. TF-IDF helps you to establish
a. most frequently occurring word in the document
b. most important word in the document
TF-IDF helps to establish how important a particular word is in the context of the document corpus. TF-IDF takes into account the number of times the word appears in the document and offset by the number of documents that appear in the corpus.
- TF is the frequency of term divided by a total number of terms in the document.
- IDF is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient.
- Tf.idf is then the multiplication of two values TF and IDF.
Suppose that we have term count tables of a corpus consisting of only two documents, as listed here
|Term||Document 1 Frequency||Document 2 Frequency|
The calculation of tf–idf for the term “this” is performed as follows:
tf(“this”, d1) = 1/5 = 0.2
tf(“this”, d2) = 1/7 = 0.14
idf(“this”, D) = log (2/2) =0
tfidf(“this”, d1, D) = 0.2* 0 = 0
tfidf(“this”, d2, D) = 0.14* 0 = 0
tf(“example”, d1) = 0/5 = 0
tf(“example”, d2) = 3/7 = 0.43
idf(“example”, D) = log(2/1) = 0.301
tfidf(“example”, d1, D) = tf(“example”, d1) * idf(“example”, D) = 0 * 0.301 = 0
tfidf(“example”, d2, D) = tf(“example”, d2) * idf(“example”, D) = 0.43 * 0.301 = 0.129
In its raw frequency form, TF is just the frequency of the “this” for each document. In each document, the word “this” appears once; but as document 2 has more words, its relative frequency is smaller.
An IDF is constant per corpus, and accounts for the ratio of documents that include the word “this”. In this case, we have a corpus of two documents and all of them include the word “this”. So TF–IDF is zero for the word “this”, which implies that the word is not very informative as it appears in all documents.
The word “example” is more interesting – it occurs three times, but only in the second document.
16. The process of identifying people, an organization from a given sentence, paragraph is called
c. Stop word removal
d. Named entity recognition
17. Which one of the following is not a pre-processing technique
a. Stemming and Lemmatization
b. converting to lowercase
c. removing punctuations
d. removal of stop words
e. Sentiment analysis
Sentiment Analysis is not a pre-processing technique. It is done after pre-processing and is an NLP use case. All other listed ones are used as part of statement pre-processing.
18. In text mining, converting text into tokens and then converting them into an integer or floating-point vectors can be done using
c. Bag of Words
CountVectorizer helps do the above, while others are not applicable.
text =[“Rahul is an avid writer, he enjoys studying understanding and presenting. He loves to play”]
vectorizer = CountVectorizer()
vector = vectorizer.transform(text)
[[1 1 1 1 2 1 1 1 1 1 1 1 1 1]]
The second section of the interview questions covers advanced NLP techniques such as Word2Vec, GloVe word embeddings, and advanced models such as GPT, ELMo, BERT, XLNET based questions, and explanations.
19. Words represented as vectors are called as Neural Word Embeddings
Word2Vec, GloVe based models build word embedding vectors that are multidimensional.
20. Context modeling is supported with which one of the following word embeddings
- a. Word2Vec
- b) GloVe
- c) BERT
- d) All of the above
Only BERT (Bidirectional Encoder Representations from Transformer) supports context modelling where the previous and next sentence context is taken into consideration. In Word2Vec, GloVe only word embeddings are considered and previous and next sentence context is not considered.
21. Bidirectional context is supported by which of the following embedding
d. All the above
Only BERT provides a bidirectional context. The BERT model uses the previous and the next sentence to arrive at the context.Word2Vec and GloVe are word embeddings, they do not provide any context.
22. Which one of the following Word embeddings can be custom trained for a specific subject
d. All the above
BERT allows Transform Learning on the existing pre-trained models and hence can be custom trained for the given specific subject, unlike Word2Vec and GloVe where existing word embeddings can be used, no transfer learning on text is possible.
23. Word embeddings capture multiple dimensions of data and are represented as vectors
24. Word embedding vectors help establish distance between two tokens
One can use Cosine similarity to establish distance between two vectors represented through Word Embeddings
25. Language Biases are introduced due to historical data used during training of word embeddings, which one amongst the below is not an example of bias
a. New Delhi is to India, Beijing is to China
b. Man is to Computer, Woman is to Homemaker
Statement b) is a bias as it buckets Woman into Homemaker, whereas statement a) is not a biased statement.
26. which of the following will be a better choice to address NLP use cases such as semantic similarity, reading comprehension, and common sense reasoning
b. Open AI’s GPT
Open AI’s GPT is able to learn complex pattern in data by using the Transformer models Attention mechanism and hence is more suited for complex use cases such as semantic similarity, reading comprehensions, and common sense reasoning.
27. Transformer architecture was first introduced with
c. Open AI’s GPT
ULMFit has an LSTM based Language modeling architecture. This got replaced into Transformer architecture with Open AI’s GPT
28. Which of the following architecture can be trained faster and needs less amount of training data
a. LSTM based Language Modelling
b. Transformer architecture
Transformer architectures were supported from GPT onwards and were faster to train and needed less amount of data for training too.
29. Same word can have multiple word embeddings possible with ____________?
EMLo word embeddings supports same word with multiple embeddings, this helps in using the same word in a different context and thus captures the context than just meaning of the word unlike in GloVe and Word2Vec. Nltk is not a word embedding.
30. For a given token, its input representation is the sum of embedding from the token, segment and position embedding
BERT uses token, segment and position embedding.
31. Trains two independent LSTM language model left to right and right to left and shallowly concatenates them
ELMo tries to train two independent LSTM language models (left to right and right to left) and concatenates the results to produce word embedding.
32. uses unidirectional language model for producing word embedding
GPT is a idirectional model and word embedding are produced by training on information flow from left to right. ELMo is bidirectional but shallow. Word2Vec provides simple word embedding.
33. In this architecture, the relationship between all words in a sentence is modelled irrespective of their position. Which architecture is this?
a. OpenAI GPT
BERT Transformer architecture models the relationship between each word and all other words in the sentence to generate attention scores. These attention scores are later used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation.
34. List 10 use cases to be solved using NLP techniques
- Sentiment Analysis
- Language Translation (English to German, Chinese to English, etc..)
- Document Summarization
- Question Answering
- Sentence Completion
- Attribute extraction (Key information extraction from the documents)
- Chatbot interactions
- Topic classification
- Intent extraction
- Grammar or Sentence correction
- Image captioning
- Document Ranking
- Natural Language inference
35. Transformer model pays attention to the most important word in Sentence
Ans: a) Attention mechanisms in the Transformer model are used to model the relationship between all words and also provide weights to the most important word.
36. Which NLP model gives the best accuracy amongst the following
Ans: b) XLNET
XLNET has given best accuracy amongst all the models. It has outperformed BERT on 20 tasks and achieves state of art results on 18 tasks including sentiment analysis, question answering, natural language inference, etc.
37. Permutation Language models is a feature of
XLNET provides permutation-based language modelling and is a key difference from BERT. In permutation language modeling, tokens are predicted in a random manner and not sequential. The order of prediction is not necessarily left to right and can be right to left. The original order of words is not changed but a prediction can be random.
The conceptual difference between BERT and XLNET can be seen from the following diagram.
38. Transformer XL uses relative positional embedding
Instead of embedding having to represent the absolute position of a word, Transformer XL uses an embedding to encode the relative distance between the words. This embedding is used to compute the attention score between any 2 words that could be separated by n words before or after.
There, you have it – all the probable questions for your NLP interview. Now go, give it your best shot.