Image recognition
In this article, we will explain Convolutional Neural networks (CNNs) and why CNNs are particularly useful on image recognition.
What is a convolution?
Convolution means in mathematics, that we take a function and apply to it another one, which will result a third function.
If you take a photo and apply a filter to the image: you have just applied convolution.
Neural Network
Brains are a massive neural network: its relatively easy and quick for our brains to recognize objects within images. We can compare artificial neural network, to consist of neurons, which have a mathematical activation function.
Neural networks consist of layers. The input layer consists of the input data: information coming in. The output layer consist of outputs. Between input and output layers, there are the hidden layers. There can be one hidden layer or multiple. In each hidden layer, there may be one or varying number of neurons.
The neural networks are trained with data. The first step is to feed forward: An output is produced by using initial weights with the neural network. This output is compared with the correct value, to calculate the neural network error using loss function. Then the neural network optimizer will adjust the neural network weights slightly based on the optimizer algorithm. The adjusting of the weights is called back propagation. Once we repeat this process, we start fitting better and better our predicted outputs to measured output values. Its important to note, that without any training, the neural network has little intelligence.
Convolutional Neural Network (CNN)
We have discussed now convolution and neural networks. Neural networks take bunch of data, pass it through neurons and fit the network by optimizing weights.
The purpose of the Convolution is to extract useful features from the image. This way the rest of the neural network will focus on only the relevant information to be passed to the neural network.
This results a more efficient algorithm. The CNNs are particularly useful in image data. It automatically detects important features in the image — regardless of its location within the image.
The CNNs apply a filter to the original image. The filter is called kernel and its chosen in a way, that makes the resulting image provide information in more efficient manner.
If we had a 9 x 9 pixel image matrix, we would “convolute” the original image into a new image matrix, prior to passing it to the rest of the neural network.
Let’s take an example below. The image includes lots of pixels with each different shade of color. There are two main objects: blue sky and a red building. In fact, a large portion of the area is covered by different shades of blue color, that our brain is certainly associating with simply as a sky, despite it is not analyzing every single pixel value. It simply discards most of this information — and it appears to make sense. We know there is a building in the image, and it is characterized by the different edges in the image. The black rooftop generates and edge against the blue sky. What happens here? There is a sudden change of pixel intensity, when the constant blue color pixels suddenly change to black color pixel. If we applied a filter that highlighted these edges from the less valuable image regions— the neural network become more efficient to train.
Kernel
Filter is easy to understand — we all know how to use filter on mobile phone to adjust lightning on an image with poor lightning. It is more correct, to call filter as a kernel.
Kernel is a matrix, that we will use to transform image pixels into new ones. There is no single “correct” kernel,but different types of kernels. One common kernel is the “Sobel-Feldman-operator”, used to detect edges on the image.
We have an original image of 9x 9 pixels. The kernel size is 3 x 3 pixels. If we apply this kernel, it will result a new image. What is the size of the resulting image? It depends upon the way we are apply the kernel.
Max pooling
Max pooling refers to taking maximum value from the region reviewed at each step by the kernel. The first step starts from left top corner and it takes portion based on the kernel size from the original image. The maximum value from this portion is sent to the resulting image. If our kernel size is 3 x 3 pixels, it will only result a single pixel with the maximum value passed to the output image.
The kernel is then moved one step towards right and the above process is repeated. Once its reached to the right side border, the kernel is returned to the left side and moved one step down and the entire process is repeated.
Stride
Above, we performed max pooling by assuming the stride is one. However, we can perform max pooling with alternative values to stride. For example, if we select stride = 2: in each step the kernel will be moved two positions to the right, instead of just one position.
Zero padding
The kernel transforms image data different manner in borders, than in the inner part of the image. The kernel will start from the corner and only once read the corner pixel, while the pixels located in the inner part are processed multiple time. Zero padding adds extra pixels to the corners: e.g. 9 x 9 pixel is converted first to 11 x 11 pixel image, prior to application of the kernel.
The zero padding is applied prior to max pooling. It can help to keep important information on the corners and sometime it may be a useful, so the resulting image will be same size as the original image.
Normalization
Normalization refers to applying same number ranges in our models, regardless of the variable being used. If variable ranges differ significantly, we should first normalize them, if we want to apply them on the same machine learning model. The income may change between 10000 USD to 100000 USD in specific dataset, while the age differs only between 20and 75 years.
The normalization can be applied within image recognition, too. Each image consist of matrix of pixels, which each have a particular shade of color.For example a grey scale image has 256 different shades of grey. Thus, we should normalize the images with 256.
Color channels
Our current image has been a grey scale, where the image color shade can be one of the 256 values. However, the real life pictures include many more colors. If we applied color images, we could overcome this by considering multiple matrices. For example, we could present our image colors in RGB values, meaning we would get three matrices, which each include their range of color shades for Red, Green and Blue. In this case, we would have three color channels.
Flattening
Flattening refers to converting the data into 1 dimensional, for an input to the next layer. So far our data is still as a matrix, where each pixel represents the color intensity. If we apply RGB image, we would have in fact three different matrices, each with data on the specific color channel.
For example if our images are 7 x 7 pixels after max pooling and our image is three color channels (RGB), then input vector would become 7*7*3=147.
Fully connected layer
Let’s summarize, the steps performed so far. We started with an original image. We added zero padding, so we do not lose any data on the image borders. We then applied kernel, to obtain a convoluted image to highlight the edges. This output image was finally flattened to an input vector for the neural network.