Get Heatmap from CNN ( Convolution Neural Network ), AKA CAM

Published in

tree.rocks

4 min readJul 25, 2021

Convolution Neural Network (CNN) is incredible. And if you want to know how it sees the world ( image ), there have a way is visualize it.
The idea is we get weights from the last dense layers multiply with the final CNN layer. And this needs Global Average Pooling (GAP) to work.

Choose a model

In this tutorial, we are using Keras with Tensorflow and ResNet50.

Because ResNet50 has a Global Average Pooling (GAP) layer ( will explain later ), it’s suitable for our demonstration. That’s perfect.

Heat-Map how does it work

Heatmap from CNN, aka Class Activation Mapping (CAM ). The idea is we collect each output of the convolution layer ( as image ) and combine it in one shot. ( We will show the code step by step later )

So here is how Global Average Pooling (GAP) or Global Max Pooling work
(depend on which you use, but they are the same idea).

In some models after feature extraction, we use the flatten layer ( fully connection ) to the Neural Network to predict the result. But this step is like drop out images dimension and some information.

In contrast, use Global Average Pooling (GAP) or Global Max Pooling (GMP) is working here. It keeps the image dimension info and makes Neural Network decide which CNN channel (feature image) is more crucial for predicting results.

Example

let’s start with ResNet50 in Keras.

from tensorflow.keras.applications import ResNet50
res_model = ResNet50()
res_model.summary()

As you can see ( above fig ):

the red: we will use this layer as “transfer leanring.”
the green: Global Average Pooling (GAP). It’s crucial about this work.

And import libraries and the image for later use.

import cv2
import matplotlib.pyplot as plt
from scipy.ndimage import zoom
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictionsimg = cv2.imread('./test_cat.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)X = np.expand_dims(img, axis=0).astype(np.float32)
X = preprocess_input(X)

We use “from scipy.ndimage import zoom".
For resize the heatmap because of the CNN, the feature extraction images shape is smaller than the original image.

Transfer Learning

Now extract the layer we will use.
P.S: You can train your model from scratch, but it will take a long time, and feature extraction may also need a lot of tuning.

from tensorflow.keras.models import Modelconv_output = res_model.get_layer("conv5_block3_out").output
pred_ouptut = res_model.get_layer("predictions").outputmodel = Model(res_model.input, outputs=[conv_ouptut, pred_layer])

Here we have two outputs ( as above mentioned, red part in fig).

The first is convolution network output
The second is the predicted result

and do predict

conv, pred = model.predict(X)
decode_predictions(pred)

The result looks like this. It’s pretty good

[[('n02123159', 'tiger_cat', 0.7185241),
  ('n02123045', 'tabby', 0.1784818),
  ('n02124075', 'Egyptian_cat', 0.034279127),
  ('n03958227', 'plastic_bag', 0.006443105),
  ('n03793489', 'mouse', 0.004671723)]]

The Output

Now, let’s see some CNN output.

scale = 224 / 7
plt.figure(figsize=(16, 16))
for i in range(36):
    plt.subplot(6, 6, i + 1)
    plt.imshow(img)
    plt.imshow(zoom(conv[0, :,:,i], zoom=(scale, scale)), cmap='jet', alpha=0.3)

We show the ground image at first ( plt.imshow(img) ), so we can compare it with the ground image.
( if you don’t, will get the result like this )

Combine in one shot

Here is the crucial. We are using the predict result index (target) to get weights.
And multiply with each feature map with weights ( dot product )

target = np.argmax(pred, axis=1).squeeze()
w, b = model.get_layer("predictions").weights
weights = w[:, target].numpy()heatmap = conv.squeeze() @ weights

Then showing heatmap with the ground image.

scale = 224 / 7
plt.figure(figsize=(12, 12))
plt.imshow(img)
plt.imshow(zoom(heatmap, zoom=(scale, scale)), cmap='jet', alpha=0.5)

That’s the result we want.

Reference:

Deep Residual Learning for Image Recognition — https://arxiv.org/abs/1512.03385
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization — https://arxiv.org/abs/1610.02391
Network In Network — https://arxiv.org/abs/1312.4400
Learning Deep Features for Discriminative Localization — https://arxiv.org/abs/1512.04150