Data Science is a comparatively new concept in the tech world, and it could be overwhelming for professionals to seek career and interview advice while applying for jobs in this domain. Also, there is a need to acquire a vast range of skills before setting out to prepare for data science interview. Interviewers seek practical knowledge on the data science basics and its industry-applications along with a good knowledge of tools and processes. Here is a list of 15 most common data science interview questions that might be asked during a job interview. Read along.
1. How is Data Science different from Big Data and Data Analytics?
Ans. Data Science utilizes algorithms and tools to draw meaningful and commercially useful insights from raw data. It involves tasks like data modelling, data cleansing, analysis, pre-processing etc.
Big Data is the enormous set of structured, semi-structured, and unstructured data in its raw form generated through various channels.
And finally, Data Analytics provides operational insights into complex business scenarios. It also helps in predicting upcoming opportunities and threats for an organization to exploit.
2. What is the use of Statistics in Data Science?
Ans. Statistics provides tools and methods to identify patterns and structures in data to provide a deeper insight into it. Serves a great role in data acquisition, exploration, analysis, and validation. It plays a really powerful role in Data Science.
3. What is the importance of Data Cleansing?
Ans. As the name suggests, data cleansing is a process of removing or updating the information that is incorrect, incomplete, duplicated, irrelevant, or formatted improperly. It is very important to improve the quality of data and hence the accuracy and productivity of the processes and organization as a whole.
4. What is a Linear Regression?
The linear regression equation is a one-degree equation of the form Y = mX + C and is used when the response variable is continuous in nature for example height, weight, and the number of hours. It can be a simple linear regression if it involves continuous dependent variable with one independent variable and a multiple linear regression if it has multiple independent variables.
5. What is logistic regression?
Ans. When it comes to logistic regression, the outcome, also called the dependent variable has a limited number of possible values and is categorical in nature. For example, yes/no or true/false etc. The equation for this method is of the form Y = eX + e – X
6. Explain Normal Distribution
Ans. Normal Distribution is also called the Gaussian Distribution. It has the following characteristics:
a. The mean, median, and mode of the distribution coincide
b. The distribution has a bell-shaped curve
c. The total area under the curve is 1
d. Exactly half of the values are to the right of the centre, and the other half to the left of the centre
7. Mention some drawbacks of the linear model
Ans. Here a few drawbacks of the linear model:
a. The assumption regarding the linearity of the errors
b. It is not usable for binary outcomes or count outcome
c. It can’t solve certain overfitting problems
8. Which one would you choose for text analysis, R or Python?
Ans. Python would be a better choice for text analysis as it has the Pandas library to facilitate easy to use data structures and high-performance data analysis tools. However, depending on the complexity of data one could use either which suits best.
9. What steps do you follow while making a decision tree?
Ans. The steps involved in making a decision tree are:
a. Pick up the complete data set as input
b. Identify a split that would maximize the separation of the classes
c. Apply this split to input data
d. Re-apply steps ‘a’ and ‘b’ to the data that has been divided
e. Stop when a stopping criterion is met
f. Clean up the tree by pruning
10. What is Cross-Validation?
Ans. It is a model validation technique to asses how the outcomes of a statistical analysis will infer to an independent data set. It is majorly used where prediction is the goal and one needs to estimate the performance accuracy of a predictive model in practice.
The goal here is to define a data-set for testing a model in its training phase and limit overfitting and underfitting issues. The validation and the training set is to be drawn from the same distribution yo avoid making things worse.
11. Mention the types of biases that occur during sampling?
Ans. The three types of biases that occur during sampling are:
a. Self-Selection Bias
b. Under coverage bias
c. Survivorship Bias
12. Explain the Law of Large Numbers
Ans. The ‘Law of Large Numbers’ states that if an experiment is repeated independently a large number of times, the average of the individual results is close to the expected value. It also states that the sample variance and standard deviation also converge towards the expected value.
13. What is the importance of A/B testing
Ans. The goal of A/B testing is to pick the best variant among two hypotheses, the use cases of this kind of testing could be a web page or application responsiveness, landing page redesign, banner testing, marketing campaign performance etc.
The first step is to confirm a conversion goal, and then statistical analysis is used to understand which alternative performs better for the given conversion goal.
14. What are over-fitting and under-fitting?
Ans. In the case of over-fitting, a statistical model fails to depict the underlying relationship and describes the random error and or noise instead. It occurs when the model is extremely complex with too many parameters as compared to the number of observations. An overfit model will have poor predictive performance because it overreacts to minor fluctuations in the training data.
In the case of underfitting, the machine learning algorithm or the statistical model fails to capture the underlying trend in the data. It occurs when trying to fit a linear model to non-linear data. It also has poor predictive performance.
15. Explain Eigenvectors and Eigenvalues
Ans. Eigenvectors depict the direction in which a linear transformation moves and acts by compressing, flipping, or stretching. They are used to understand linear transformations and are generally calculated for a correlation or covariance matrix.
The eigenvalue is the strength of the transformation in the direction of the eigenvector.
Stay tuned to this page for more such information on interview questions and career assistance. If you are not confident enough yet and want to prepare more to grab your dream job in the field of Data Science, upskill with Great Learning’s PG program in Data Science Engineering, and learn all about Data Science along with great career support.