Efficient Implementation of MobileNet and YOLO Object Detection Algorithms for Image Annotation

The objective of the problem is to implement classification and localization algorithms to achieve high object classification and labelling accuracies, and train models readily with as least data and time as possible. The solution to the problem is considered in the following blog.

The efficiency of a model is dependent on various parameters, including the architecture of the model, number of weight parameters in the model, number of images the net has been trained on, and the computational power available to test the models in real time. The third parameter can’t be controlled, thus leaving us dependent on the first two parameters. Thus transfer learning works the best in this scenario, for the pre-trained weights are adjusted according to our dataset, although minimal errors and reliable accuracies are obtained.


For image classification, we use a keras model with the model summary obtained by running the code below. The model’s parameters are tuned to suit the maximum change in information for as minimum data as possible. Thus, we have the batch normalization layers, that randomly shake up the weights to make the model generalized.


We use the MobileNet model for training on our dataset. The dataset has been taken from HackerEarth deep learning challenge to classify animals. We choose 10 random classes from the dataset and change the number of images per class and the size of the fully connected layers, and report the results.



The models were run for 15 epochs on an Intel i7 Processor.

Model 1: Mobilenet : 1000 steps/epoch: Larger FC Layers: Training Time: 18 mins/epoch : dataset: 50 images/average: accuracy : 82.2%
Model 2: Mobilenet: 500 steps/epoch: Smaller FC layers: Training time : 12 mins/epoch: dataset: 50 images/average : 82.47%
Model 3: Mobilnet 500 steps/epoch: Smaller FC layers: Training time: 11 mins/epoch: dataset: 30 images per class: accuracy: 76%

Image Detection:

There are a few methods that pose detection as a regression problem. Two of the most popular ones are YOLO and SSD. These detectors are also called single shot detectors. Let’s have a look at them:

You Only Look Once.
YOLO divides each image into a grid of S x S and each grid predicts N bounding boxes and confidence. The confidence reflects the accuracy of the bounding box and whether the bounding box actually contains an object(regardless of class). YOLO also predicts the classification score for each box for every class in training. You can combine both the classes to calculate the probability of each class being present in a predicted box.

So, total SxSxN boxes are predicted. However, most of these boxes have low confidence scores and if we set a threshold say 30% confidence, we can remove most of them as shown in the example below.

YOLO predicts one type of class in one grid! Hence small objects are not identified…

Single Shot Detectors

SSD runs a convolutional network on input image only once and calculates a feature map. Now, we run a small 3×3 sized convolutional kernel on this feature map to predict the bounding boxes and classification probability. SSD also uses anchor boxes at various aspect ratio similar to Faster-RCNN and learns the off-set rather than learning the box. In order to handle the scale, SSD predicts bounding boxes after multiple convolutional layers. Since each convolutional layer operates at a different scale, it is able to detect objects of various scales.

We compared two models, initially YOLO(darknet) and later SSDs and compared their accuracies and speeds. Since our inputs are images, the FPS parameter is not used to differentiate the models. Moreover, the SSDs are a balance between the Faster — RCNN model and the YOLO model. Let’s see what the experiment tells us?

The SSD model is implemented using the dnn module, with the help of Adrian Rosebrock, in openCV’s library.


The YOLO pre-trained weights were downloaded from the author’s website where we choose the YOLOv3 model. Since it is the darknet model, the anchor boxes are different from the one we have in our dataset. Hence we initially convert the bounding boxes from VOC form to the darknet form using code from here. Then we train the network by changing the config file.


SSDs: IOU= 0.74, mAP: 0.83 Time /epoch: 12 minutes
YOLOs: IOU= 0.69, mAP: 0.85 Time/epoch: 11 minutes

Output Images

SSDs used for Vehicle Detection
Output with YOLOv3 Pretrained Weights


The overall problem is stated as one where we need to trade off the speed and accuracy. The overall solution is proposed as two different models for various types of images.

The trade-off between speed and accuracy is accompanied with computational power available. The YOLO model is suitable for high-speed outputs, where accuracy is not that high… whereas SSDs provide higher accuracies with high-speed outputs with a higher computation time.

Hence choose SSDs on good microprocessors, else YOLO is the goto for microprocessor-based computations.

Efficient Implementation of MobileNet and YOLO Object Detection Algorithms for Image Annotation was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: