Data Science Interview Questions and Answers Set 3

21. How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –

• To change the value and bring in within a range
• To just remove the value.

22. How can you assess a good logistic model?


There are various methods to assess the results of a logistic regression analysis-

• Using Classification Matrix to look at the true negatives and false positives.

• Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.

• Lift helps assess the logistic model by comparing it with random selection.

23. What are various steps involved in an analytics project?

• Understand the business problem

• Explore the data and become familiar with it.

• Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.

• After data preparation, start running the model, analyze the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.

• Validate the model using a new data set.

• Start implementing the model and track the result to analyze the performance of the model over the period of time.

24. How can you iterate over a list and also retrieve element indices at the same time?

This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.

25. During analysis, how do you treat missing values?

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question:

• Understand the problem statement, understand the data and then give the answer. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

• If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

• If you have a distribution of data coming, for normal distribution give the mean value.

• Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

26. Can you use machine learning for time series analysis? Yes, it can be used but it depends on the applications.

27. Write a function that takes in two sorted lists and outputs a sorted list that is their union.

First solution which will come to your mind is to merge two lists and short them afterwards

Python code-

def return_union(list_a, list_b): return sorted(list_a + list_b)

R code-
return_union <- function(list_a, list_b) { list_c<-list(c(unlist(list_a),unlist(list_b))) return(list(list_c[[1]][order(list_c[[1]])])) } Generally, the tricky part of the question is not to use any sorting or ordering function. In that case you will have to write your own logic to answer the question and impress your interviewer. Python code- def return_union(list_a, list_b): len1 = len(list_a) len2 = len(list_b) final_sorted_list = [] j = 0 k = 0 for i in range(len1+len2): if k == len1: final_sorted_list.extend(list_b[j:]) break elif j == len2: final_sorted_list.extend(list_a[k:]) break elif list_a[k] < list_b[j]: final_sorted_list.append(list_a[k]) k += 1 else: final_sorted_list.append(list_b[j]) j += 1 return final_sorted_list Similar function can be returned in R as well by following the similar steps. return_union <- function(list_a,list_b) { #Initializing length variables len_a <- length(list_a) len_b <- length(list_b) len <- len_a + len_b #initializing counter variables j=1 k=1 #Creating an empty list which has length equal to sum of both the lists list_c <- list(rep(NA,len)) #Here goes our for loop for(i in 1:len) { if(j>len_a)
{
list_c[i:len] <- list_b[k:len_b] break } else if(k>len_b)
{
list_c[i:len] <- list_a[j:len_a] break } else if(list_a[[j]] <= list_b[[k]]) { list_c[[i]] <- list_a[[j]] j <- j+1 } else if(list_a[[j]] > list_b[[k]])

{
list_c[[i]] <- list_b[[k]] k <- k+1 } } return(list(unlist(list_c))) } 28. What is Machine Learning?

The simplest way to answer this question is – we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

29. How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.

For example, the following image shows three different groups.

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.

Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any decrement in WSS. This point is known as bending point and taken as K in K – Means.

This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendograms and identify the distinct groups from there.

30. Is it possible to perform logistic regression with Microsoft Excel?

It is possible to perform logistic regression with Microsoft Excel. There are two ways to do it using Excel.

• One is to use Add-ins provided by many websites which we can use.

• Second is to use fundamentals of logistic regression and use Excel’s computational power to build a logistic regression

But when this question is being asked in an interview, interviewer is not looking for a name of Add-ins rather a method using the base excel functionalities.

Let’s use a sample data to learn about logistic regression using Excel. (Example assumes that you are familiar with basic concepts of logistic regression)

Data shown above consists of three variables where X1 and X2 are independent variables and Y is a class variable. We have kept only 2 categories for our purpose of binary logistic regression classifier.

Next we have to create a logit function using independent variables, i.e.

Logit = L = â0 + â1*X1 + â2*X2