Perceptrons as linear classifiers — 01

Vishal Jain
6 min readFeb 7, 2021

Motivation

The general task of any model is to take in some known quantities and predict some unknown quantity we care about.

Whether it be predicting the price of a house, given its location and the number of bedrooms, or predicting the probability of passing a test, given previous mock test scores. The general idea is the same - take in some things we know, and figure out something we don’t.

So how do we go about building these models? Let’s take the following problem, we want to predict if a plant will grow, given the input features time spent in the sun and the amount of water received. If we were to go out and gather a bunch of the data then plot it, it might look something like this:

The points represent all our data, with the red points being all the plants that didn’t grow and the green points being all the plants that did end up growing. We now want to use our data to find out if the black point represents a plant that will grow in the future.

We can make out by eye that there seems to be a dividing line which separates all the plants that do and don’t grow! This line will classify our black data point and tell us, based on our data, that it should grow.

This dashed pink line is, in fact, our model!

Introducing the perceptron

So how do we represent these classifying lines? We can do this by using what’s called a perceptron. A perceptron is just a graph which represents a linear equation ( our purple dashed line ).

So to start giving this all some mathematical structure, let’s enumerate the input features by the variable x and call our output feature y. A general perceptron with n input features looks like this.

We’re going to break this down into 3 steps, the inputs, the summation and finally the activation.

The inputs

The variable x represents our input features, in our previous case, this would be the time spent in the sun and the amount of water given to the plants. Each of the n input features is multiplied by a corresponding weight (denoted by the w variables). This is just a number which determines how important the corresponding feature is in determining the output. Going back to our previous example again, if it turned out that we lived on a different planet where sunlight wasn't necessary, the weight for that input feature (time in the sun) would be 0. Visually, it would mean our data looked like this:

You can see, it doesn't matter what the amount of time spent in the sun is, the only thing which determines if a plant will grow is the amount of water received. In the above case, the time in sun weight would be 0. You can imagine an opposite case where plants don’t need any water to grow and the outcome only depended on time in the sun, that would correspond with a situation where the water weight was 0.

You might notice that not all our input features are denoted by the variable x, we seem to have an input feature ‘1’ that's multiplied by a ‘b’?

This is known as the bias term. The bias term is used as sort of an offset term, true to its name, the purpose of this term is to bias the output. Consider the above example, where humanity has succeeded in its endeavour of interplanetary travel and we live in a world where plants don't need the sun to grow, increasing the bias term would have the following effect:

Left: small bias term. Right: larger bias term

A model with a larger bias term would come from a dataset where plants need more water to grow (pic on the right).

The summation and activation

Now that we’ve introduced all the terms on the left-hand side and their corresponding impacts on the model, let's see how the terms are actually used to generate a model. This is where we will now place our attention.

Given a set of input features {x0,x1,…,xn}, weights {w0,w1,…,wn} and a corresponding bias term, b, the perceptron first sums over all the input features (x) terms multiplied by their corresponding weight (w) term and finally adds the bias at the end. In our humble 2 feature case, this step would look like:

This output, y, is then passed through what's called an activation function. An activation function is a simple function which takes the output of the previous step, y, as input and performs some sort of normalisation. This normalised value is the final output of the perceptron.

In our case, we want a step function as our activation. This is because we want to classify all points above our line as good and all points below the line as bad. This looks like the following

So to predict if a plant will grow given the proposed time in the sun and water levels, we first compute the output of the summation step, y, then pass that through to our activation function (the step function), if the final normalised output is 1, we say the plant will grow and if the final output is 0, we say the plant will not grow. This is our model!

Summary

So far in this post, all that's been discussed is the general idea behind a perceptron, the basic building block of vanilla neural networks.

We’ve talked about the motivation behind building models. We’ve seen how in the case where the data can be separated into two groups by a line (in general an n-1 dimensional surface, where n is the number of input features), the underlying model can be represented by a perceptron.

Next steps

So far, we’ve only talked about cases where you can separate the data into two classes in a linear fashion.

source: brilliant

What happens in the case above, when our dataset isn’t easily separable into two groups by a simple line?

How do we actually find the best weights and biases for our model?

What happens if we have 3 or more classes we want to separate?

How can we modify the perceptron to work on regression problems?

Comment your answers :)

--

--