Data Science Interview Questions and Answers Set 8

71. What is the procedure to check the cumulative frequency distribution of any categorical variable?

The cumulative frequency distribution of a categorical variable can be checked using the cumsum () function in R language.

Example –

gender = factor(c(“f”,”m”,”m”,”f”,”m”,”f”))

y = table(gender)
cumsum(y)

Output of the above R code-
Cumsum(y)

f m 3 3

72. What will be the result of multiplying two vectors in R having different lengths?

The multiplication of the two vectors will be performed and the output will be displayed with a warning message like – “Longer object length is not a multiple of shorter object length.” Suppose there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential manner but since the length is not same, the first element of the smaller vector b will be multiplied with the last element of the larger vector a.

73. R programming language has several packages for data science which are meant to solve a specific problem, how do you decide which one to use?

CRAN package repository in R has more than 6000 packages, so a data scientist needs to follow a well-defined process and criteria to select the right one for a specific task. When looking for a package in the CRAN repository a data scientist should list out all the requirements and issues so that an ideal R package can address all those needs and issues.

The best way to answer this question is to look for an R package that follows good software development principles and practices. For example, you might want to look at the quality documentation and unit tests. The next step is to check out how a particular R package is used and read the reviews posted by other users of the R package. It is important to know if other data scientists or data analysts have been able to solve a similar problem as that of yours. When you in doubt choosing a particular R package, I would always ask for feedback from R community members or other colleagues to ensure that I am making the right choice.

74. How can you merge two data frames in R language?

Data frames in R language can be merged manually using cbind () functions or by using the merge () function on common rows or columns.

75. Explain the usage of which() function in R language.

which() function determines the postion of elemnts in a logical vector that are TRUE. In the below example, we are finding the row number wherein the maximum value of variable v1 is recorded.

mydata=data.frame(v1 = c(2,4,12,3,6)) which(mydata$v1==max(mydata$v1))
It returns 3 as 12 is the maximum value and it is at 3rd row in the variable x=v1.

76. How will you convert a factor variable to numeric in R language?

A factor variable can be converted to numeric using the as.numeric() function in R language. However, the variable first needs to be converted to character before being converted to numberic because the as.numeric() function in R does not return original values but returns the vector of the levels of the factor variable.

X <- factor(c(4, 5, 6, 6, 4)) X1 = as.numeric(as.character(X))

77. Name a few libraries in Python used for Data Analysis and Scientific computations.

NumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn

78. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?

Matplotlib is the python library used for plotting but it needs lot of fine-tuning to ensure that the plots look shiny. Seaborn helps data scientists create statistically and aesthetically appealing meaningful plots. The answer to this question varies based on the requirements for plotting data.

79. Which method in pandas.tools.plotting is used to create scatter plot matrix?

Scatter_matrix

80. How can you check if a data set or time series is Random?

To check whether a dataset is random or not use the lag plot. If the lag plot for the given dataset does not show any structure then it is random.