Most Asked Interview questions in Data Science with Answer.
Q1: What's the difference between supervised and unsupervised learning?
Ans: Supervised learning involves labeled data for training,
while unsupervised learning finds patterns in unlabeled data.
Q2: Explain the bias-variance trade-off.
Ans: Bias refers to model simplification causing
underfitting, variance is model complexity leading to overfitting; finding the
balance optimizes performance.
Q3: How do decision trees work?
Ans: Decision trees split data based on features to classify
or predict outcomes; nodes represent decisions, leaves represent outcomes.
Q4: What's regularization in machine learning?
Ans: Regularization prevents overfitting by adding penalties
to model complexity during training, helping generalize to new data.
Q5: Describe the steps of the CRISP-DM process.
Ans: CRISP-DM (Cross-Industry Standard Process for Data
Mining) involves Business Understanding, Data Understanding, Data Preparation,
Modeling, Evaluation, and Deployment.
Q6: How do you handle missing data in a dataset?
Ans: Options include removing, imputing (mean, median), or
using advanced techniques like regression or nearest neighbors.
Q7: What's a p-value in statistics?
Ans: The p-value assesses the evidence against a null
hypothesis; lower values suggest stronger evidence against it.
Q8: Explain the concept of A/B testing.
Ans: A/B testing compares two versions of something to determine
which performs better, using statistical methods to ensure reliability.
Q9: What's the curse of dimensionality?
Ans: It refers to challenges faced when dealing with
high-dimensional data; increased dimensions can lead to sparsity and increased
computational requirements.
Q10: How does k-means clustering work?
Ans: K-means groups data into 'k' clusters based on
similarity, minimizing the sum of squared distances between data points and
their respective cluster centers.
Q11: Describe the ROC curve and AUC.
Ans: The ROC curve visualizes the trade-off between
sensitivity and specificity for classification models; AUC (Area Under the
Curve) measures model performance.
Q12: What's gradient descent?
Ans: Gradient descent is an optimization algorithm that
adjusts model parameters iteratively to minimize the loss function, improving
model accuracy.
Q13: Explain the term "one-hot encoding."
Ans: One-hot encoding converts categorical variables into
binary columns to represent each category as a unique value (0 or 1).
Q14: What's the purpose of a validation set?
Ans: The validation set assesses model performance during
training, helping to prevent overfitting and tune hyperparameters.
Q15: How do you address class imbalance in a dataset?
Ans: Techniques include oversampling, undersampling, and
using algorithms that handle imbalance well, such as SMOTE (Synthetic Minority
Over-sampling Technique).
Q16: Describe the bias-variance trade-off.
Ans: Bias is error due to overly simplistic assumptions;
variance is error due to model's sensitivity to small fluctuations in training
data.
Q17: What's the difference between L1 and L2 regularization?
Ans: L1 regularization adds the absolute values of
coefficients, leading to feature selection, while L2 regularization adds the
squares of coefficients, encouraging smaller values.
Q18: How does cross-validation work?
Ans: Cross-validation splits data into subsets for training
and validation, iteratively evaluating model performance to ensure
generalization.
Q19: What is the purpose of a confusion matrix?
Ans: A confusion matrix visualizes true positive, true
negative, false positive, and false negative counts, aiding in model
evaluation.
Q20: How would you handle outliers in a dataset?
Ans: Options include removing outliers, transforming data, or
using robust statistical techniques that are less affected by outliers.