ML MODEL Overview

Introduction to ML Model.

Performing machine learning involves creating model, which is trained on some training data and then can process additional data to make predictions. A machine learning model can be a mathematical representation of a real-world application. To generate a machine learning model you will need to provide training data to a machine learning algorithm to learn from.

The model finds the pattern in the training data and compares the pattern with the input test data and gives the output(predictions). The algorithm with the patterns of the training data is called a model. There are several algorithms that can be used on the need

Data handling

preparing data files before applying machine learning algorithms took a whole lot of time. The data handling refers to the data cleaning and processing. It means handling missing values in the dataset, the row containing the missing values are dropped and another way in handling the missing values are filling the missing values row with average or any other suitable method

Next is dropping duplicate values in the data set, if there are rows with same values  in all the columns they can be dropped ,Binning data (i.e) data bucketing classifying data based on a label value ,detecting outliers and removing it as outliers can change the true nature of the data set and which also affects the output

Data Preprocessing

Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.

In dataset there are always few null values. It doesn’t really matter whether it is regression or classification as model cannot process the data with null values ,they can be removed using drop function,  imputation. Standardization is transforming our values such that the mean of the values is 0 and the standard deviation is 1.

Handling categorical variables, Categorical variables are basically the variables that are discrete and not continuous. Multicollinearity occurs in our dataset when we have features which are strongly dependent on each other

 

Types of ml models:

Models are nothing but the algorithm used in the machine learning process. There are many algorithms for each of the types of machine learning. Overall the models are of three types supervised learning models, unsupervised learning models, reinforcement learning models. Each have different problem types and the models are build according to the need. i.e. the algorithm is trained with the test data to give the output, the output may be based on regression, classification, association, clustering, the models may lie in one of these types

Supervised and Unsupervised

In Supervised learning the algorithm learns from the labelled data i.e there will be a set of labelled training data for the data set. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new the input data the problem type of supervised learning is classification and regression. . The output will be known in supervised learning. The popular supervised learning algorithm are

Logistic regression:

Logistic regression is used for prediction of output which is binary. Its called Regression but performs classification as based on the regression it classifies the dependent variable into either of the classes.

Support vector machine:

Support Vector is used for both regression and Classification. It is based on the concept of decision planes that define decision boundaries. It performs classification by finding the hyperplane that maximizes the margin between the two classes with the help of support vectors.

K nearest neighbor:

K-NN algorithm is one of the simplest classification algorithm and it is used to identify the data points that are separated into several classes to predict the classification of a new sample point. It classifies new cases based on a similarity measure

Decision tree classification:

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed

Unsupervised learning

Unsupervised learning algorithm learns from the unlabeled data, it identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. The problem type of supervised learning is association and clustering.  unsupervised learning finds the hidden structures in the data the popular unsupervised algorithm are

K -means

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

The centroids of the K clusters, which can be used to label new data

Labels for the training data (each data point is assigned to a single cluster)

C means

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method is frequently used in pattern recognition.

How to test your Data?

Testing of data means testing data set so that it can be used in to train the model. The data set is portioned in two training and test data

The test data must meet the following two conditions:

  •   Is large enough to yield statistically meaningful results.
  •   Is representative of the data set as a whole. In other words, don’t pick a test set with different characteristics than the training set

The larger the training data set, the larger the model learn. The dataset can be partitioned in the ratio of 80:20. After training the model with the training  data set if it gives higher accuracy with the test data just the check the test data  and found that many of the examples s in the test set are duplicates of examples in the training set. We’ve inadvertently trained on some of our test data, and as a result, we’re no longer accurately measuring how well our model generalizes to new data

Cross validation techniques:

Random Subsampling

Random subsampling, is based on randomly splitting the data into subsets, whereby the size of the subsets is defined by the user. The random partitioning of the data can be repeated arbitrarily often.

K-Fold Cross-Validation

In K Fold cross validation, the data is divided into k subsets. After that holdout method is performed k times (i.e.) ( holdout method means portioning the dataset into train and test and using test we can get the accuracy the model but there will error induced ) such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set.

Leave-one-out Cross-Validation:

Leave one out cross validation works as follows.  The parameter optimization is performed (automatically) on 9 of the 10 data set and then the performance of the tuned algorithm is tested on the 10th data set.  So, in this step, the 10th data set is the test set and the other nine pairs are the training data for optimizing the free parameters of your algorithm. Now, repeat the process 10 times, each time leaving out a different data set to use as the single test case.  You now get test performance for all 10 data set. That is the way that leave-one-out cross validation works.