Cost Function and Loss Function in Machine Learning

With machine learning, features associated with it also have flourished. The driving force behind optimization in machine learning is the response from an internal function of the algorithm, called the cost function.

Cost Function

It’s a function that determines how well a Machine Learning model performs for a given set of data. The Cost Function calculates the difference between anticipated and expected values and shows it as a single real number. Cost Functions may be created in a variety of methods depending on the situation.

To estimate how poorly models perform, cost functions are employed. Simply put, a cost function is a measure of how inaccurate the model is in estimating the connection between X and y. This is usually stated as a difference or separation between the expected and actual values.

The term ‘loss’ in machine learning refers to the difference between the anticipated and actual value. The “Loss Function” is a function that is used to quantify this loss in the form of a single real number during the training phase. These are utilised in algorithms that apply optimization approaches in supervised learning.

Regression, logistic regression, and other algorithms are instances of this type. The phrases “cost function” and “loss function” are interchangeable. The purpose of Cost Function is to be either:

  • Minimum– When a value is reduced to its simplest form, it is referred to as a cost, loss, or mistake. The aim is to identify the model parameter settings for which the Cost Function gives the smallest possible number.
  • Maximum– When something is maximised, the value it produces is referred to as a reward. The aim is to discover model parameter values with as large a returned number as feasible.


Take the following scenario: you’re trying to solve a classification issue, that is, you’re trying to sort data into categories. Assume the data is on the weight and height of two different types of fish, which are represented in the scatter plot below by red and blue dots. 

To categorise fishes into these two groups, you’ll need to utilise these two attribute values. A scatter plot depicts the distribution of the two different species of fish. Various solutions to this categorization issue are also presented below, in addition to the scatter plot:

Different weights and heights of 2 different kind of species of fishes is being represented through the use of scatter plots.

Scatter plots, Source:

Although all three classifiers have a high degree of accuracy, the third solution is the best since it does not misclassify any points and splits the data into two equal halves. If you look closely, you’ll notice that the top right image’s line misclassified one red point, whereas the bottom left image’s line misclassified one blue dot. 

The line in the graph’s bottom right corner properly identifies all of the points. The line is almost precisely in between the two groups, and not closer to any of them, which is why it classifies all the points properly. This is where the cost function notion comes into play. The cost function assists us in finding the best option.

The cost function changes depending on the algorithm. The cost function for fitting a straight line is the total of squared errors, although it varies from method to algorithm. We differentiate the sum of squared errors with respect to the parameters ‘m’ and ‘c’ to minimise the sum of squared errors and discover the optimal ‘m’ and ‘c’. The values of ‘m’ and ‘c’ are then obtained by solving the linear equations. The cost function will almost always have to be minimised.

It’s quite easy to minimise and maximise a function: (a) Calculate the difference between the function and the parameter and equal to 0,

and (b) Differentiate the function w.r.t the parameter and equate to 0.

  • For minimization – the function value of the double differential should be greater than 0.
  • For maximization – the function value of the double differential should be less than 0.\

Applying the Cost Function

The Cost Function has many different formulations, but for this example, we wanna use the Cost Function for Linear Regression with a single variable.

cost function image


  • m: Is the number of our training examples.
  • Σ: The Summatory.
  • i: The number of Examples and the Output.
  • h: The Hypothesis of our Linear Regression Model

The Cost Function will return a value that matches to our Model error once it has been calculated. The Cost Function must be minimised on a constant basis. When we reduce the Cost Function, we reduce the error and, as a result, our Model’s performance improves.

But how can we make the Cost Function as little as possible?

The Gradient Descent Algorithm is the most common method for minimising the Cost Function.

Gradient Descent

Gradient Descent is an algorithm for optimising the cost function or the model’s error. It’s used to identify the smallest amount of inaccuracy in your model.

Gradient Descent may be thought of as the path you must travel to make the least amount of mistakes. The inaccuracy in your model might vary at different places, and you must discover the quickest approach to decrease it in order to avoid wasting resources.

Gradient Descent is analogous to a ball rolling down a slope. The ball will now roll to the bottom of the hill. It may use this position as the point where the error is the smallest because, in every model, the error will be the smallest at one point before increasing again.

Gradient descent is a method for determining the inaccuracy in your model for various input variable values. This is repeated, and you’ll see that the error numbers become less and fewer with time. You’ll soon arrive at the values for variables with the least amount of error, and the cost function will be optimised.


Types of Cost function

There are many cost functions in machine learning and each has its use cases depending on whether it is a regression problem or classification problem.

Types of Cost Functions

  1. Regression cost Function

Regression models are used to forecast a continuous variable, such as an employee’s pay, the cost of a car, the likelihood of obtaining a loan, and so on. The “Regression Cost Function” is a cost function utilised in the regression issue. They are determined as follows depending on the distance-based error:

Error = y-y’

Where, Y – Actual Input and Y’ – Predicted output

For obvious reasons, this cost function is also known as the squared error function. Because it is simple and works well, it is the most often used cost function for linear regression.

  1. Binary Classification cost Functions

The cost functions used in classification problems are not the same as the cost functions used in regression problems. Cross-entropy loss is a popular classification loss function.

Under the maximum likelihood inference paradigm, it is the preferable loss function mathematically. It is the loss function that should be assessed first, and only altered if there is a compelling reason to do so. For predicting class 1, cross-entropy will compute a score that represents the average difference between the actual and predicted probability distributions. A perfect cross-entropy value is 0 when the score is minimised.

By minimising the overlap between distributions of the soft output for each class, this cost function seeks to reduce the likelihood of classification mistakes. To estimate the distributions from the training data set, the non-parametric Parzen window technique with Gaussian kernels is employed. 

The cost function was implemented in a GRBF neural network and evaluated in a motion detection application using low-resolution infrared pictures, demonstrating certain improvements over the traditional mean squared error cost function as well as the support vector machine, a reference binary classifier.

  1. Multi-class Classification cost Functions

Predictive modelling issues involving multi-class categorization are ones in which instances are allocated to one of more than two classes.

The issue is frequently presented as predicting an integer value, with each class given a different integer value ranging from 0 to (num classes – 1). Predicting the likelihood of an example belonging to each known class is a common way to solve the problem.

Categorical cross-entropy, which is nothing more than the mean of cross-entropy for all N training data, is used to calculate the error in classification for the whole model. For multi-class classification problems, the default loss function is cross-entropy.

It is designed for use with multi-class classification in this scenario, when the target values are in the range 0 to 1, 3,…, n, and each class is given a distinct integer value.

Under the maximum likelihood inference paradigm, it is the preferable loss function mathematically. It is the loss function that should be assessed first, and only altered if there is a compelling reason to do so.

For all classes in the problem, cross-entropy will produce a score that summarises the average difference between the actual and anticipated probability distributions. A perfect cross-entropy value is 0 when the score is minimised.

Loss Functions

A loss function takes a theoretical proposition to a practical one. Building a highly accurate predictor requires constant iteration of the problem through questioning, modeling the problem with the chosen approach and testing.

The only criteria by which a statistical model is scrutinized is its performance – how accurate the model’s decisions are. This calls for a way to measure how far a particular iteration of the model is from the actual values. This is where loss functions come into play.

Loss functions measure how far an estimated value is from its true value. A loss function maps decisions to their associated costs. Loss functions are not fixed, they change depending on the task in hand and the goal to be met.

Loss functions for regression

Regression involves predicting a specific value that is continuous in nature. Estimating the price of a house or predicting stock prices are examples of regression because one works towards building a model that would predict a real-valued quantity.

Let’s take a look at some loss functions which can be used for regression problems and try to draw comparisons among them.

Mean Absolute Error (MAE)

Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions used for regression models.

Regression problems may have variables that are not strictly Gaussian in nature due to the presence of outliers (values that are very different from the rest of the data). Mean Absolute Error would be an ideal option in such cases because it does not take into account the direction of the outliers (unrealistically high positive or negative values).

As the name suggests, MAE takes the average sum of the absolute differences between the actual and the predicted values. For a data point xi and its predicted value yi, n being the total number of data points in the dataset, the mean absolute error is defined as:


Mean Squared Error (MSE)

Mean Squared Error (also called L2 loss) is almost every data scientist’s preference when it comes to loss functions for regression. This is because most variables can be modeled into a Gaussian distribution.

Mean Squared Error is the average of the squared differences between the actual and the predicted values. For a data point Yi and its predicted value Ŷi, where n is the total number of data points in the dataset, the mean squared error is defined as:


Mean Bias Error (MBE)

Mean Bias Error is used to calculate the average bias in the model. Bias, in a nutshell, is overestimating or underestimating a parameter. Corrective measures can be taken to reduce the bias post-evaluating the model using MBE.

Mean Bias Error takes the actual difference between the target and the predicted value, and not the absolute difference. One has to be cautious as the positive and the negative errors could cancel each other out, which is why it is one of the lesser-used loss functions.

The formula of Mean Bias Error is:


Where yi is the true value, ŷi is the predicted value and ‘n’ is the total number of data points in the dataset.

Mean Squared Logarithmic Error (MSLE)

Sometimes, one may not want to penalize the model too much for predicting unscaled quantities directly. Relaxing the penalty on huge differences can be done with the help of Mean Squared Logarithmic Error.

Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error, except the natural logarithm of the predicted values is used rather than the actual values.


Where yi is the true value, ŷi is the predicted value and ‘n’ is the total number of data points in the dataset.

Huber Loss

A comparison between L1 and L2 loss yields the following results:

  1. L1 loss is more robust than its counterpart.

On taking a closer look at the formulas, one can observe that if the difference between the predicted and the actual value is high, L2 loss magnifies the effect when compared to L1. Since L2 succumbs to outliers, L1 loss function is the more robust loss function.

  1. L1 loss is less stable than L2 loss.

Since L1 loss deals with the difference in distances, a small horizontal change can lead to the regression line jumping a large amount. Such an effect taking place across multiple iterations would lead to a significant change in the slope between iterations.

On the other hand, MSE ensures the regression line moves lightly for a small adjustment in the data point.

Huber Loss combines the robustness of L1 with the stability of L2, essentially the best of L1 and L2 losses. For huge errors, it is linear and for small errors, it is quadratic in nature.

Huber Loss is characterized by the parameter delta (𝛿). For a prediction f(x) of the data point y, with the characterizing parameter 𝛿, Huber Loss is formulated as:

Huber Loss

Loss functions for classification

Classification problems involve predicting a discrete class output. It involves dividing the dataset into different and unique classes based on different parameters so that a new and unseen record can be put into one of the classes.

A mail can be classified as a spam or not a spam and a person’s dietary preferences can be put in one of three categories – vegetarian, non-vegetarian and vegan. Let’s take a look at loss functions that can be used for classification problems.

Binary Cross Entropy Loss

This is the most common loss function used for classification problems that have two classes. The word “entropy”, seemingly out-of-place, has a statistical interpretation.

Entropy is the measure of randomness in the information being processed, and cross entropy is a measure of the difference of the randomness between two random variables.

If the divergence of the predicted probability from the actual label increases, the cross-entropy loss increases. Going by this, predicting a probability of .011 when the actual observation label is 1 would result in a high loss value. In an ideal situation, a “perfect” model would have a log loss of 0. Looking at the loss function would make things even clearer –

Binary Cross Entropy Loss

Where yi is the true label and hθ(xi) is the predicted value post hypothesis.

Since binary classification means the classes take either 0 or 1, if yi = 0, that term ceases to exist and if yi = 1, the (1-yi) term becomes 0.

Pretty clever, isn’t it?

Categorical Cross Entropy Loss

Categorical Cross Entropy loss is essentially Binary Cross Entropy Loss expanded to multiple classes. One requirement when categorical cross entropy loss function is used is that the labels should be one-hot encoded.

This way, only one element will be non-zero as other elements in the vector would be multiplied by zero. This property is extended to an activation function called softmax, more of which can be found in this article.

Hinge Loss

Another commonly used loss function for classification is the hinge loss. Hinge loss is primarily developed for support vector machines for calculating the maximum margin from the hyperplane to the classes.

Loss functions penalize wrong predictions and does not do so for the right predictions. So, the score of the target label should be greater than the sum of all the incorrect labels by a margin of (at the least) one.

This margin is the maximum margin from the hyperplane to the data points, which is why hinge loss is preferred for SVMs. The following image clears the air on what a hyperplane and maximum margin is:


The mathematical formulation of hinge loss is as follows:

Hinge Loss

Where sj is the true value and syi is the predicted value.

Hinge Loss is also extended to Squared Hinge Loss Error and Categorical Hinge Loss Error.

Kullback Leibler Divergence Loss (KL Loss)

Kullback Leibler Divergence Loss is a measure of how a distribution varies from a reference distribution (or a baseline distribution). A Kullback Leibler Divergence Loss of zero means that both the probability distributions are identical.

The number of information lost in the predicted distribution is used as a measure. The KL Divergence of a distribution P(x) from Q(x) is given by:

KL Divergence Loss
Follow Us On