10 Questions Only Top Data Scientists Can Answer

Do you want to be a data scientist but don’t know what the interview will be like? I gathered these questions from doing and studying data science interviews. Here’s 10 questions you may see in a data science interview, with answers! These 10 data science interview questions are all multiple choice. There can be more than one correct answer for each question! 

10 Questions Only Top Data Scientists Can Answer

For each question, choose all the correct answers. The average data scientist can get 6.5 correct, but the average interviewee can only get 3! How many can you get right?

  1. Movie Recommendation systems are an example of:
    1. Classification
    2. Clustering
    3. Reinforcement Learning
    4. Regression
  2. Sentiment Analysis is an example of:
    1. Regression
    2. Classification
    3. Clustering
    4. Reinforcement Learning
  3. Which of the following can act as possible termination conditions in K-Means?
    1. Reaching x iterations
    2. Assignments of observations to clusters stop changing between iterations.
    3. Centroids do not change between successive iterations.
    4. Terminate when Residual Sum of Squares falls below a threshold.
  4. Which of the following clustering algorithms suffers from the problem of convergence at local optima?
    1. K- Means clustering algorithm
    2. Agglomerative clustering algorithm
    3. Expectation-Maximization clustering algorithm
    4. Diverse clustering algorithm
  5. How can Clustering (Unsupervised Learning) be used to improve the accuracy of a Linear Regression model (Supervised Learning)?
    1. Creating different models for different cluster groups
    2. Creating an input feature for cluster ids as an ordinal variable.
    3. Creating an input feature for cluster centroids as a discrete variable.
    4. Creating an input feature for cluster size as a discrete variable.
  6. Which of the following types of datasets are not suited for K-Means Clustering?
    1. Datasets with many outliers
    2. Datasets with different densities
    3. Datasets with round shapes
    4. Datasets with non-convex shapes
  7. What is true about K-Means Clustering?
    1. K-Means is sensitive to cluster center initializations
    2. Bad initialization can lead to poor convergence speed
    3. Bad initialization can lead to bad overall clustering
    4. K-Means is only suitable for linearly separable datasets
  8. Which of the following should be applied to get the best results for K-Means Clustering?
    1. Run algorithm for different centroid initializations
    2. Adjust the number of iterations
    3. Test to find the optimal number of clusters
    4. Check the RSS after each time the algorithm runs and keep track
  9. List the answer choices in the correct order to perform a K-Means Clustering, leave out steps that are unnecessary.
    1. Assign each data point to the nearest cluster centroid
    2. Specify the number of clusters
    3. Re-assign each point to nearest cluster centroids
    4. Assign cluster centroids randomly
    5. Re-compute cluster centroids
  10. Which of the following metrics can we use for finding dissimilarity between two clusters in hierarchical clustering?
    1. Single-link
    2. Complete-link
    3. Average-link
    4. Triforce-link

Correct Answers 

  1. B, C
    1. You might be able to argue that movie recommendation systems use classification, but they are most commonly examples of clustering or reinforcement learning. They are not examples of regression
  2. A, B, D
    1. Sentiment analysis is one of my favorite NLP tasks. It uses regression, classification, and reinforcement learning, but not clustering. Topic detection, on the other hand, does use clustering.
  3. A, B, C, D
    1. These are all valid methods of ending your K-Means algorithm
  4. A, C
    1. K-Means and EM both suffer from the possibility of converging at local minimas. This is because of their randomized starts and the way that these multivariate graphs are structured. Meanwhile, agglomerative and diverse clustering algorithms don’t have this problem.
  5. A, B, C, D
    1. You can use all of these clustering techniques to help you improve your regression model. They add to the knowledge of the shape of the data.
  6. A, B, D
    1. K-Means clustering is fine on data with round shapes. Datasets with outliers throw off your centroids. Same for datasets with random densities. Since K-Means naturally lends itself to rounder datasets, non-convex data is obviously a challenge. For example, finding the means with the closest data won’t work on crescent shaped data.
  7. A, B, C
    1. All of them are true except K-Means can be used on datasets that aren’t linearly separable. K-Means lends itself best to circular datasets, but they do not have to be linearly separable, just grouped in chunks.
  8. A, B, C, D
    1. These are all standard methods applied to K-Means.
  9. B, D, A, E, C
    1. This is the right order of steps for doing a K-Means clustering.
  10. A, B, C
    1. Triforce-link is a joke. The other linkage methods are all real and valid.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

Leave a Reply

%d bloggers like this: