Top 50 Data Science Interview Questions and Answers for 2017

Home/Questions/Top 50 Data Science Interview Questions and Answers for 2017

Data Science and Analytics jobs are in the great demand across the globe. The Harvard Business Review has named the DSA jobs as “the sexiest job of the 21st century”. According to IBM, the roles of Data Scientists and Analysts are projected to experience a demand spike by 28% by 2020. The following figure shows how important the DSA skills are today, and likely to be through 2020.

DSA skills

After experiencing the vibrant demand for the Data Science and Analytics jobs, you obviously need more attention while preparing for the DSA interviews. To help you prepare for the next interview, our subject experts have come up with a guide of most tricky interview questions and answers that will certainly let you give a tough fight in the interview.

1. What is Data Science?

Data Science is a mix of statistics, machine learning, and Artificial Intelligence that are used to extract knowledge or insights from the data available in different forms (structured or unstructured) for predicting the future trends.

Data Science

2. How Data Science is different from Big Data & Data Analytics?

This is the question where most of the newbies generally get confused.

Data Science is a blend of techniques that should deal with both structured and unstructured data for preparation, cleansing, and analysis; whereas, Big Data refers to the massive volumes of data that cannot be processed using the traditional systems. Alternatively, Data Analytics is the method of applying algorithmic / mechanical processes for drawing future insights.

3. Python or R. Which one would you prefer for Data Science?

Python and R, both are renowned programming languages for the data science and analytics. However, the best possible answer for this question would be Python. The reason is because, R is mostly found in the data science environment whereas Python is used across different platforms like data analytics, web development, etc.

4. Define the steps involved in an analytics project.

An analytics project will include the following steps:

  • Defining the problem
  • Exploring the data
  • Preparing the data
  • Modelling
  • Data validation
  • Implementation and tracking

5. What is meant by Data Cleansing?

Data Cleansing also known as Data Cleaning, is a technique used for identifying and removing the anomalies and inconsistencies from the data, to improve the quality of the data.

6. Explain about univariate, bivariate, and multi-variate analysis.

These are data analysis methodologies that depend on the number of variables.

  • Univariate Analysis: Data analysis depends on only one variable that does not have any relation. For example, a pie-chart of market sales based on geographical areas.
  • Bivariate Analysis: Data analysis will be based on two variables and a relationship exists between them. For example, analyzing the sales volume and the amount spent.
  • Multivariate Analysis: Data analysis will be based on more than two variables to understand the effect of variables on the response.

7. What do you understand by the term Normal Distribution?

In general, the data is distributed in the random directions with a bias to either left or right or can be jumbled all over up. However, there exists certain conditions where the data is distributed around a central value without getting biased to the left or to the right, and attains normal distribution in an arrangement of a bell-shaped curve. Random variables are distributed in an arrangement of a symmetrical bell-shaped curve.

8. What is Linear Regression?

Linear regression is the widely used statistical technique for defining the linear relationship between a dependent variable (y) and one or an independent variable (x).

Linear Regression

9. Explain about Logistic Regression.

Logistic Regression, often named as ‘Logit Model’ is a statistical technique used for analyzing the datasets and to predict the binary outcome. For example, to predict whether a specific product will get succeed in the market or not. The outcome here will be in the binary form (i.e. Either 0 or 1 (success/fail). The datasets will include all the operational costs incurred for that product, i.e. manufacturing costs, marketing costs, etc.

10. Define Interpolation and Extrapolation.

Interpolation and extrapolation are the two key techniques involved in any statistical analysis.

  • Interpolation: is the technique used for estimating a value that befalls between a sequence of values of a specific set of values.
  • Extrapolation: is the technique used for determining the value based on the known set of values or facts.

11. What is a Cluster? Define Clustering.

A cluster is a collection of objects that belongs to the same class, whereas the clustering is a method of making a collection of abstract objects into the classes of identical objects.

12. What is meant by k-means clustering? How can you select ‘k’ for k-means?

k-means clustering, also known as an unsupervised learning algorithm, is a method used for segmentation the data based on a certain set of clusters known as k clusters. It is deployed for grouping the data and determine the similarity in the data. k centers are defined on every cluster, and the clusters are defined into a specified k groups, with the ‘k’ as predefined. The selection of K points as the cluster centers are carried out at a random.

13. Define Power Analysis.

Power analysis is a key aspect of the experimental design, which allows the user to determine the sample size needed for detecting the effect of a given size with the given confidence level.

14. What does P-value indicate about the statistical data?

P-value informs about the importance of results after the hypothesis tests in statistics. It enables the readers to draw results between 0 and 1.

  • P-value > 0.05 represents weak evidence
  • P-value <= 0.05 represents strong evidence
  • P-value = 0.05 represents the marginal value that should go either way.

15. What is Gradient Descent?

  • Gradient descent is an optimization algorithm technique used for determining the function parameter values (coefficients) that reduces the cost function (cost).
  • Gradient Descent is the best approach when it is hard to calculate the parameters analytically, and should be searched using the optimization algorithm.

16. What is the goal of A/B testing?

A/B testing, also known as split testing is a statistical hypothesis testing approach used for the randomized experiments containing two variables A and B. The goal of the split (A/B) testing is to track any changes to the web page for enhancing the result of an interest.

Example: Tracking the click through rate for a banner advertisement.

17. What is Machine Learning (ML)?

Machine learning is the branch of Artificial Intelligence (AI), which enables the machines and software applications to get more accurate of deriving the actionable insights by learning the patterns of data. The working principle of ML is to build algorithms that can accept input data and perform statistical analysis for predicting the output value within the specified range.

18. What are the different types of Machine Learnings?

There are four types of learnings, which are:

  • Supervised Learning
  • Unsupervised Learning
  • Semi-supervised Learning
  • Reinforcement Learning

19. State the difference between Supervised Learning and Unsupervised Learning.

Supervised learning is the most common technique used today, where the algorithm learns from the training data with input variables and corresponding responses. Alternatively, unsupervised learning is a technique, where the algorithm is applied to learn from the pattern of input variables only without its corresponding responses.

20. What is Reinforcement Learning?

Reinforcement learning- a type of ML and a branch of AI, is a learning practice based on the rewards and punishments. A reinforcement learning agent will learn by interacting with its environment, and thereby observing the outcome of those interactions.

21. What is a Decision Tree?

Decision tree is a type of supervised learning algorithm, which is used largely in the classification problems. A decision tree gives a graphical representation of all the possible solutions available to a decision based on the specified conditions.

22. What do you understand from the term ‘Boosting’?

Booting refers to a group of algorithms that transforms the weak learner into a strong learner. To understand the definition in more details, consider the following example.

Example: How would you rate a shirt as a bad product or not? Certainly, we initially perform the following steps:

  • The fabric of the shirt is hard. It is a bad product.
  • Buttons are missing. It is a bad product.
  • Loosely stitched at different areas. It is a bad product.
  • Multiple tears on the shirt. It is a bad product.
  • The shirt fits comfortable to you: Not a bad product.

The above defined rules for classifying a shirt as a bad product is good to some extent, but not strong enough. Therefore, these rules are known as weak learner. To convert this weak learner into strong learner, we will group the prediction of every learner using the methods like:

  • Average/weighted average
  • Considering the high prediction vote

For example, from the above five learners, if 3 of them are voted as bad and 2 are voted as not bad, then we will consider the shirt as a bad product.

23. Explain about CART and CHAID.

CART stands for Classification and Regression Tree, is an analysis technique used to refer the

  • Classification tree analysis, where the predicted result is the class to which the data belongs.
  • Regression tree analysis, where the predicted result can be considered as a real number.

CHAID stands for Chi-square Automatic Interaction Detector, is an analysis technique used to determine the relationship between the variables. It builds a predictive model, which helps in discovering how variables best merge for explaining the result in the given dependent variable.

24. What is a Random Forest?

Random forest is a supervised classification algorithm that creates a forest with the defined number of trees. It is a blend of tree predictors, where every tree is based on the random vector sampled autonomously with the similar distribution for all the trees in the forest, and boosting algorithm.

25. Explain about Naïve Bayes algorithm.

It is a Machine Learning (ML) algorithm used for the classification problems. The algorithm is based on the ‘Bayes Probability Theorem’, and is largely used for text classification involving large dimensional training data sets.

As this algorithm makes a hypothesis that the occurrence of a feature will not show any effect on the occurrence of any other features, in other words features are independent, it is known as ‘Naïve’, and Bayes is named after the popular statistician and philosopher- ‘Thomas Bayes’.

26. What is the use of Seaborn?

Seaborn is a Python data visualization library based on Matplotlib, which offers an advanced interface for drawing appealing statistical plots and graphics. It is a great resource available for distribution plots and regressions, but the real capability of the Seaborn is to visualize the distinctive features simultaneously.

27. What does NLP stand for?

NLP is known as Natural Language Processing, a high-level approach used to help the computers understand the human speech and language.  With the help of NLP, computers would be able to communicate directly with the humans in their language.

28. Name a few libraries in Python used for data analysis and scientific computations.

Python has various libraries, but the following are the largely used across data analysis.

  • NumPy
  • SciPy
  • Pandas
  • SciKit
  • Matplotlib
  • Seaborn

29. How many Data Structures does R have?

R language contains two types of data structures- Heterogeneous and Homogeneous.

  • Homogeneous: it will have the same type of objects (matrix, vector, and array)
  • Heterogeneous: it will have different type of objects (lists and data frames)

30. What is meant by k-Nearest Neighbor?

k-Nearest Neighbor (kNN) is one of the most important, yet the simple Classification algorithm in Machine Learning (ML). It belongs to supervised learning and helps in discovering pattern recognition, intrusion detection, and data mining. As the name suggests, prediction of a response for a particular input features depends on the similar input features in the neighborhood.

31. How kNN is different from k-means clustering?

Most of the participants usually get mislead by observing k in both the terms. But, you should know the basic difference between both these terms.


32. State the difference between Covariance and Correlation.

Covariance and Correlation are the two mathematical concepts that are widely used in statistics.

Covariance and Correlation

33. What do you understand by Type I and Type II Errors?

Type I Error:

  • It is the incorrect rejection of a ‘true null hypothesis’.
  • Alpha (α) symbol is used to represent a Type I error.

Type II Error:

  • It is the failure to reject a ‘false null hypothesis’.
  • Beta (β) symbol is used to represent a Type II error.

34. Define the Lifecycle of a Data Science Project.

Every data science project will have a standard workflow to ensure that the multiple teams working in an organization are in sync, and any uncertainties will be avoided. A typical lifecycle of a data science project will involve the following phases:

  1. Data Acquisition
  2. Data Preparation
  3. Hypothesis and Modelling
  4. Evaluation and Interpretation
  5. Deployment
  6. Operations
  7. Optimization

35. Why Resampling is carried out?

Resampling is carried out because of any one of the following reasons.

  • Predicting the accuracy of sample statistics based on the subsets of available data or deriving randomly with the substitution from a set of data points.
  • Validating the models depending upon the random subsets.
  • Replacing labels on data points while performing significance tests.

36. What is the memory limit in R?

  • The memory limit for the 32-bit systems is 3 GB.
  • The memory limit for the 64-bit systems is 8 TB.

37. Name the tool that can be used to find bugs in Python.

Pylint and Pychecker are the two famous tools used for identifying bugs in Python.

  • Pylint: it checks whether a module meets all the coding standards or not.
  • Pychecker: it is a static analysis tool used to identify bugs in the course code.

38. What is meant by PEP8?

PEP8 is a standard of coding guidelines for the Python language, which enables the programmers to write a readable code making it accessible to the other persons.

39. Explain how will you convert a factor variable to numeric in the R language.

To convert a factor variable into numeric, the as.numeric() function is used. However, before converting into numeric, it is initially required to convert the variable into character. Because, as.numeric() function in R returns the vector of the levels of the factor variable instead of the returning the original values.


X1 = as.numeric(as.character(X))

40. What are Recommender Systems?

An information filtering technology that uses various algorithms like collaborative and content based filtering to offer products and services that are likely to be of interest to the user. The recommender systems gather the past information of the user like personal profile, hobbies, purchasing behavior, reviews, and comments, etc. and compares the information with the reference characteristics to offer a product or service recommendations.

Today, recommender systems are widely used across movies, music, articles, e-commerce websites, social tags, etc.

41. What is Collaborative Filtering?

Collaborative filtering is the method of filtering the interests and patterns of a user, by gathering the information from various data sources, collaborative viewpoints, and multiple users.

42. What is Content-based Filtering?

Content-based filtering, also known as cognitive filtering is a technique used to recommend products / services based on the evaluation between the content of the products / services and the user profile.

43. What is Data Mining?

Data mining is an analytic process used for data exploration (i.e. usually for bulk amounts of data named ‘Big Data’) to locate patterns and relationships between the variables, and apply the found patterns to new subsets of data.

44. State the key concepts in Data Mining.

The following concepts are triggering an increasing popularity for data mining, as an ideal business information management tool.

  • Data Pre-processing
  • Data Visualization
  • Machine learning
  • Model validation and testing
  • Deployment in production

45. What is Text Mining?

While data mining is all about locating patterns and relationships in the data, the text mining is concerned about the analysis of the text documents (multiple documents) by extracting the key concepts, phrases, etc. and preparing a processed text for later analysis with the numeric data mining techniques.

46. Explain about OLAP.

OLAP stands for ‘On-Line Analytical Processing’, is a data mining technique that enables the multi-dimensional database users to generate online comparative or descriptive summaries and other analytical queries.

47. Explain about Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation is a topic modelling approach used to classify the documents based on the subject matter. LDA represents the documents as a collection of topics that each of it have their own probability distribution of likely words.

48. What do you understand by Multidimensional Scaling?

Multidimensional scaling is an analytical technique used to detect the underlying meaningful dimensions, which enables the researcher to explain about the unidentified similarities or dissimilarities between the examined objects.

49. Explain about Neural Network.

Dr. Robert Hecht-Nielsen has defined the neural network as a “computing system formed by a collection of simple and highly interconnected processing elements, which processes the information based on their dynamic state response to the external inputs.”

50. State a few applications of Neural networks.

Neural networks are eventually applicable in any situation that holds a relationship between the predictor variables and the predicted variables. A few of the areas where the neural networks can be applied are:

  • Detection of medical phenomena
  • Stock market prediction
  • Credit assignment
  • Monitoring the condition of machinery
  • Engine management

Wrapping up

As the market for analytics jobs are growing at a faster pace, there is a huge demand for trained and certified Analysts and Data Experts in the market. So, don’t leave any stone unturned, get trained by the subject matter experts and accelerate your career with the extra edge delivered by the Kovid Academy.

By | 2017-12-06T04:52:57+00:00 September 4th, 2017|Categories: Questions|Tags: , , |2 Comments

About the Author:


  1. […] Read More: 50 Data Science Interview Questions and Answers for 2017 […]

Leave A Comment