A simple way to understand and implement Object Detection from scratch, by pure CNN.

Using MNIST as datasets to do object detection and from scratch.

Published in

tree.rocks

6 min readDec 27, 2021

This article is a tutorial on building a deep learning object detection model from scratch by yourself.
Hence, every idea and code I will explain in this article.
We will only use Convolutional Neural Network (CNN) to recognize numbers like object detection.

I won’t give you any existing model/weights files in this article.
You need to train your model by yourself. ( Recognize MNIST handwritten digits as an example )

Before getting started, you need to have background knowledge of Tensorflow and Convolutional Neural Network (CNN)

BTW, there is no Intersection over Union (IoU) or any complex math symbol in the codes :)

The idea

If you have ever used a Digital single-lens reflex camera (DSLR) before, you should notice the viewfinder is interesting.

As you can see, Points are called “focus points.”
Now let’s think about using CNN.

The object exists probability

We think about CNN as the lens in a camera and focus points as the “object exists probability” points.

Let’s look at traditional image classification first.
It uses CNN to classification the image.

Traditional image classification with CNN

Assume our input image is 64x64 pixel, then there has an image layer (above figure green text) it’s 8x8 pixel image. ( we don’t care channels here )
That’s what we want, like the DSLR focus points ( probability points ) to tell us which “pixel” has an object detected.

Typically any RGB image has three channels ( Red, Green, Blue); we are now outputting a channel as the probability.

About the bounding box and the classification.

As shown above, we can use one channel image as the probability of the object’s existence.

And how about the bounding box? As we can know, a standard bounding box will at least have four numbers ( x1, y1, x2, y2, or some people are using x1, x2, width, height, doesn’t matter.)

Let’s do the same thing as the probability of object exists; we use four channels to represent x1, y1, x2, y2.

We have five channels. For now channels are

[probability, x1, y1, x2, y2]

The classification is also.
Assume we have three classes ( cat, butterfly, flower ).

Each class is a channel

Let’s summary

so that we will have eight channels.
( 1 probability + 4 bounding box + 3 classes )

[probability, x1, y1, x2, y2, cat, butterfly, flower]

The visualization of channels is like this.

Implement Object Detection with Tensorflow ( using CNN )

Now we will implement Deep Learning Object Detection with Tensorflow.
But first thing first, we have to prepare datasets.

Let’s import all we need modules first.

import numpy as npimport tensorflow as tf
from tensorflow.keras import layersimport cv2
import matplotlib.pyplot as plt

Create Datasets

The idea is simple, we create many 128x128 px images (RGB) and randomly put handwritten digits (from MNIST)

And our output will be image 8x8 px of 15 channels. ( I call it “mask” in the code )
which is [probability, x1, y1, x2, y2, cls_0, cls_1, cls_2 … ]

We will focus on object detection (Make model section), so I won’t explain this section’s codes too much.

Should get some image like this:

Make Model

As we mentioned before, this is a Convolutional Neural Network (CNN).
Therefore, the bottom of the model is nothing special.
( We use ReLU as activation function )

Here are the codes of the Model, I will explain it.

Let see the bottom of the model first.

fig: bottom model, from tf.keras.utils.plot_model

And the code:

x = x_input = layers.Input(shape=(128, 128, 3))x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(x)
x = layers.MaxPool2D()(x)
x = layers.BatchNormalization()(x) # size: 64x64x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)  # size: 64x64x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(x)
x = layers.MaxPool2D()(x)
x = layers.BatchNormalization()(x)  # size: 32x32x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(x)
x = layers.MaxPool2D()(x)
x = layers.BatchNormalization()(x)  # size: 16x16x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(x)
x = layers.MaxPool2D()(x)
x = layers.BatchNormalization()(x) # size: 8x8x

Here are the differences from traditional CNN.
we need:

probability of object exists ( x_prob )
bounding box position ( four channels, x_boxes )
classification of the digit. ( ten channels, x_cls )

So here are the codes:

x_prob = layers.Conv2D(1, kernel_size=3, padding='same', activation='sigmoid', name='x_prob')(x)
x_boxes = layers.Conv2D(4, kernel_size=3, padding='same', name='x_boxes')(x)
x_cls = layers.Conv2D(10, kernel_size=3, padding='same', activation='sigmoid', name='x_cls')(x)

looks like this:

and here are the points:

we use sigmoid to do probability channel output, so its range keep in 0~1
the bounding boxes channels, like linear regression, don’t use any activation function
Classification channels are also using sigmoid to make probability predictions.

The gate of the output.

This code is super important.

we don’t want backward propagation or wrong object output that doesn’t exist.
So we do something like “gate” for bounding boxes channels and classification channels.

When the probability is less than 0.5, the gate will be 0 otherwise is 1.

gate = tf.where(x_prob > 0.5, tf.ones_like(x_prob), tf.zeros_like(x_prob))
x_boxes = x_boxes * gate
x_cls = x_cls * gate

This code will output zero and stop gradient of low probability area.

Then combine the outputs and the model.

x = layers.Concatenate()([x_prob, x_boxes, x_cls])model = tf.keras.models.Model(x_input, x)
model.summary()

Loss functions

See codes first, then explain.

The `loss_funcs` is sum of three functions ( loss_p, loss_bb, loss_cls )
and like their name:
loss_p is for probability loss
loss_bb is for bounding boxes loss
loss_cls is for classification loss.

Here we use `tf.gather` to channel by indices

loss_p and loss_cls

Because they are probability, use `binary_crossentropy` as loss function.

loss_bb

we use mean_squared_error (MSE) as loss function, for better result can use smooth L1 ( but here for simplicity we use MSE is good enough )

Preview model prediction

let’s use a function for preview:

def preview(numbers=None, threshold=0.1):
    X, y = make_data(size=1)
    y = model.predict(X)
    show_predict(X[0], y[0], threshold=threshold)preview()

Untrained model prediction output looks like nothing helpful.

Now we prepare train datasets:

batch_size = 32
X_train, y_train = make_data(size=batch_size * 400)

And training for 30 epochs

model.fit(X_train, y_train, batch_size=batch_size, epochs=30, shuffle=True)

The loss convergence is pretty good. ( I am using my RTX 3060 12GB )
Call preview again.

As you can see, the result is pretty good, but some red labels there.
That means low probability; we can adjust the threshold to remove that.

preview(threshold=0.7)

Conclusion

For this demo, I used a simple model architect and datasets.
So the accuracy may not be good, but this is the concept of object detection how it works.

If you want more accuracy and the different size of the object detection, you can use different bottom layers like Conv2DTranspose / Dropout / ResNet / … to get a better result.

That’s all :D