The three major building blocks of a Machine Learning system are the model, the parameters, and the learner.
- Model is the system which makes predictions
- The parameters are the factors which are considered by the model to make predictions
- The learner makes the adjustments in the parameters and the model to align the predictions with the actual results
Let us build on the beer and wine example from above to understand how machine learning works. A machine learning model here has to predict if a drink is a beer or wine. The parameters selected are the colour of the drink and the alcohol percentage. The first step is:
Learning from the training set
This involves taking a sample data set of several drinks for which the colour and alcohol percentage is specified. Now, we have to define the description of each classification, that is wine and beer, in terms of the value of parameters for each type. The model can use the description to decide if a new drink is a wine or beer.
You can represent the values of the parameters, ‘colour’ and ‘alcohol percentages’ as ‘x’ and ‘y’ respectively. Then (x,y) defines the parameters of each drink in the training data. This set of data is called a training set. These values, when plotted on a graph, present a hypothesis in the form of a line, a rectangle, or a polynomial that fits best to the desired results.
The second step is to measure error
Once the model is trained on a defined training set, it needs to be checked for discrepancies and errors. We use a fresh set of data to accomplish this task. The outcome of this test would be one of these four:
- True Positive: When the model predicts the condition when it is present
- True Negative: When the model does not predict a condition when it is absent
- False Positive: When the model predicts a condition when it is absent
- False Negative: When the model does not predict a condition when it is present
The sum of FP and FN is the total error in the model.
For the sake of simplicity, we have considered only two parameters to approach a machine learning problem here that is the colour and alcohol percentage. But in reality, you will have to consider hundreds of parameters and a broad set of learning data to solve a machine learning problem.
- The hypothesis then created will have a lot more errors because of the noise. Noise is the unwanted anomalies that disguise the underlying relationship in the data set and weakens the learning process. Various reasons for this noise to occur are:
- Large training data set
- Errors in input data
- Data labelling errors
- Unobservable attributes that might affect the classification but are not considered in the training set due to lack of data
You can accept a certain degree of training error due to noise to keep the hypothesis as simple as possible.
Testing and Generalisation
While it is possible for an algorithm or hypothesis to fit well to a training set, it might fail when applied to another set of data outside of the training set. Therefore, It is essential to figure out if the algorithm is fit for new data. Testing it with a set of new data is the way to judge this. Also, generalisation refers to how well the model predicts outcomes for a new set of data.
When we fit a hypothesis algorithm for maximum possible simplicity, it might have less error for the training data, but might have more significant error while processing new data. We call this is underfitting. On the other hand, if the hypothesis is too complicated to accommodate the best fit to the training result, it might not generalise well. This is the case of over-fitting. In either case, the results are fed back to train the model further.
The typical output of a classification algorithm
The typical output of a classification algorithm can take two forms:
Discrete classifiers. A binary output (YES or NO, 1 or 0) that indicates whether the algorithm has classified the input instance as positive or negative or not. The algorithm simply says that the application is ‘high potential’ if it is. If there is no expectation of human intervention in the decision making process, such as if the company has no upper or lower limit to the applications that are considered ‘high potential’, then this could be helpful.
Probabilistic classifiers. A probabilistic output (a number between 0 and 1) that shows the likelihood that the input falls into the positive class. Let us take a look at an example. If the algorithm indicates that the application has a 0.68 probability of being high potential. If human intervention is expected in the decision making process, such as if the company has a limit to the number of applications which could be considered ‘high potential, then this could be helpful’. The probabilistic output becomes a binary output as soon as a human defines a ‘cutoff’ to determine which instances fall into the positive class.