Train Cat & Dog Neural Network

11 min readApr 3, 2020

To find the original page for Youtube tutorials and ML models in this project, please check this URL.

Recognizing different objects is easy for human beings, but how about our computers? Well, computer is good at tasks with massive calculation procedures but may not be intelligent enough to think like humans. Hence, in order to train machine thinking like humans, neural network (NN) was introduced in early 21st century and nowadays, a lot of APIs have been developed and are open to Neural Network developers.

In this tutorial, I assume you have basic understanding about NN architecture and training procedures. I am using tensorflow.Keras as API to train my NN model in python on Windows 10. Moreover, there are also some prerequisites for your computer’s hardware and software if you want to train NN model by following this tutorial:

Distinguishing Cat and dog images is a classic topic in machine learning (ML) for all ML beginners. Training NN need massive data and for this topic, we can actually download training data from Kaggle for free which contains a lot of pictures.

There are a lot of articles and Youtube tutorials about cat and dog NN training, however, I have to say many procedures and explanation are not accurate and sometime even misleading. For example, I have seen a dozen of times that people train their “deep” models over 1000 epochs and show their high accuracy. For such a simple image distinguishing purpose, a binary case, training over 1000 epochs will absolutely result in overfitting and their so-called high accuracy is actually training accuracy which makes no sense to judge whether their trained model is good or not. More importantly, based on my research, I have never seen any model that can achieve more than 90% accuracy on the internet. Which is to say, whenever you test that model with 10 pictures with either cat or dog at the same time, there will be at least 1 picture distinguished mistakenly mathematically, which absolutely can not satisfy high-level need.

In this tutorial series, I will introduce 3 levels to train NN model. First is exploring level, second is intermediate level, third is high level. For exploring level, we don’t know what kinds of model architecture suit for our purpose. Hence, we will try different “shallow” architectures and filtering those architectures with low testing accuracy. For intermediate level, we will try to go deeper with certain architectures. And for high level, we will use applications from APIs (e.g. Keras) to train our model, which is usually extremely complex and needs for huge memory, but with proper tuning, the accuracy can reach a new ‘career high’ that our self-made NN can never achieve.

1. Get Training Data Script

In order to train our NN model, we need to generate our training data in advance. After download our cat & dog images from Kaggle, we will transform all images into matrix format. Since image is actually formed by pixels in terms of RGB color, there will be 3 levels in 3rd dimension of colorful images. For exploring level, we will use only 1 dimension in gray-scale instead in order to reduce our training complexity.

1.1 Import library

If you have no API cv2 downloaded, try ‘pip install opencv-python’ on windows or ‘sudo pip install opencv-python’ on linux instead.

import random
import cv2
import os
import pickle
import numpy as np

1.2 Define function ‘get_training_data()’

Transform images into gray scale matrix format: (80, 80, 1) and append their corresponding labels into a list.

# transform images into matrix format
def get_training_data():
    global DATADIR
    global CATEGORIES
    global training_data

    for category in CATEGORIES:
            path = os.path.join(DATADIR, category)
            class_num = CATEGORIES.index(category)
            for img in os.listdir(path):
                try:
                    img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE) # get gray-scale image
                    new_array = cv2.resize(img_array, (IMG_SIZE,IMG_SIZE))
                    training_data.append([new_array, class_num])
                except Exception as e:  # corrupted images will be passed
                    pass

1.3 Save data

Save image data into trainable pickle file format.

if __name__ == "__main__":
    DATADIR = 'C:\\python\\machine learning\\tf\\PetImages' # change to the directory where you store your images
    CATEGORIES = ['Dog', 'Cat']
    training_data = []
    IMG_SIZE = 80   # change to the image size you wish to compress to. Keep it small to reduce training complexity
    get_training_data()
    random.shuffle(training_data)   # shuffle our training data
    sample_num = len(training_data)

    # check cat and dog distribution. If it is close to 0.5 then it should be fine
    img_sum = []
    for sample in training_data[int(sample_num * 0.8):]:
        img_sum.append(sample[1])
    print('image sum = {}'.format(np.mean(img_sum)))

    # initialize training, validation and testing data. If need validation data, then try ratio of 70:15:15 for
    # training, validation and testing data sample respectively. Otherwise, try 80:20 for training and testing
    # data sample
    X_train = []
    y_train = []
    # X_val = []
    # y_val = []
    X_test = []
    y_test = []

    # append matrix and label into training data
    for feature, label in training_data[:int(sample_num * 0.8)]:
        X_train.append(feature)
        y_train.append(label)

    # for feature, label in training_data[int(sample_num * 0.7):int(sample_num * 0.85)]:
    #     X_val.append(feature)
    #     y_val.append(label)

    for feature, label in training_data[int(sample_num * 0.8):]:
        X_test.append(feature)
        y_test.append(label)

    # reshape image data by adding an extra level since maxpooling2D function requires 3D training data
    X_train = np.array(X_train).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
    X_test = np.array(X_test).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

    # generate training and testing data
    pickle_out = open("X_train.pickle", "wb")
    pickle.dump(X_train, pickle_out)
    pickle_out.close()

    pickle_out = open("y_train.pickle", "wb")
    pickle.dump(y_train, pickle_out)
    pickle_out.close()

    # pickle_out = open("X_val.pickle", "wb")
    # pickle.dump(X_val, pickle_out)
    # pickle_out.close()

    # pickle_out = open("y_val.pickle", "wb")
    # pickle.dump(y_val, pickle_out)
    # pickle_out.close()

    pickle_out = open("X_test.pickle", "wb")
    pickle.dump(X_test, pickle_out)
    pickle_out.close()

    pickle_out = open("y_test.pickle", "wb")
    pickle.dump(y_test, pickle_out)
    pickle_out.close()

2. Exploring Level Script

When we are not sure what kind of architecture is suitable to our task, we can usually try different NN architectures and filter those models with low validation accuracy. We will normally feed training data into convolutional layer and then dense layer. Since this binary task is not that complicated, we can try our models with 1–3 convolutional layers and 0–2 convolutional layers, with 32, 64, 128 neurons in each layer. The reason why we choose 2 ** n neurons is because our computer’s memory is stored in binary format. Hence, when using memory to store neurons’ information, memory taken for 32 neurons has no difference with 30 neurons as both of them are using 2 ** 5 memory. As for convolutional layer, after the conv layer, we will usually use maxpooling to get maximum value in order to extract features in that picture (matrix). In addition, since our NN is a binary case, we will use sigmoid activation function in output layer while relu for all other layers since the slope of relu function is 1 when input is greater than 0 which will not result in a saturation during training. When we see there is an overfitting case, validation accuracy is much smaller than training accuracy (say, validation acc is 86% and train acc is 96% after certain epochs), we will use dropout layer (ratio: 0–1) to reduce the difference between validation and training accuracy. However, larger dropout ratio will result in worse training accuracy which will slow down training process and might only result in less than 90% training accuracy since the NN architecture is too simple to get well trained. Hence, trying different dropout ratio is crucial to NN result.

2.1 Import Library

Import libraries that we are going to use.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ReduceLROnPlateau, EarlyStopping
import tensorflow.keras.backend as K
import time
import pickle

2.2 Define layer Variables and import training data

Define dense layer and convolutional layer size and load load training and testing data.

dense_layers = [0, 1, 2]
conv_layers = [1, 2, 3]
layer_sizes = [32, 64, 128]

# load sorted training data
X_train = pickle.load(open('X_train.pickle', 'rb'))
y_train = pickle.load(open('y_train.pickle', 'rb'))

# validation data is optional if using validation_split instead of validation_data when fitting data into model
# X_val = pickle.load(open('X_val.pickle', 'rb'))
# y_val = pickle.load(open('y_val.pickle', 'rb'))

X_test = pickle.load(open('X_test.pickle', 'rb'))
y_test = pickle.load(open('y_test.pickle', 'rb'))

2.3 Construct NN architecture

Construct our NN architectures by using 3 for loops.

for conv_layer in conv_layers:
    for layer_size in layer_sizes:
            for dense_layer in dense_layers:
                with tf.Session() as sess:
                    # set the name for the NN model
                    NAME = '{}-conv-{}-nodes-{}-dense-{}'.format(conv_layer, layer_size, dense_layer, int(time.time()))
                    # set up callbacks
                    callbacks = [TensorBoard(log_dir='Exploring Level/{}'.format(NAME))]

                    model = Sequential()    # Initialize model as a sequential model

                    # Feed training data into a convolutional layer first
                    model.add(Conv2D(layer_size, (5, 5), input_shape=X_train.shape[1:]))  # Specify input shape, otherwise model cannot be saved 
                    model.add(Activation('relu'))
                    model.add(MaxPooling2D(pool_size=(2, 2)))
                    model.add(Dropout(rate=0.25))

                    # convolutional layer loop
                    for i in range(conv_layer-1):
                        model.add(Conv2D(layer_size, (5, 5)))
                        model.add(Activation('relu'))
                        model.add(MaxPooling2D(pool_size=(2, 2)))
                        model.add(Dropout(rate=0.25))

                    model.add(Flatten())    # Flatten 3D training data into 1D before sent into dense layer

                    # dense layer loop
                    for i in range(dense_layer):
                        model.add(Dense(layer_size))
                        model.add(Activation('relu'))
                        model.add(Dropout(rate=0.25))

                    model.add(Dense(1))
                    model.add(Activation('sigmoid'))    # Use sigmoid activation function in output layer for binary training

                    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
                    model.fit(X_train, y_train, batch_size=64, validation_split=0.3, epochs=10, callbacks=callbacks)
                    model.summary()
                    model.save('{}.model'.format(NAME))  # save model
                    test_loss, test_acc = model.evaluate(x=X_test, y=y_test)    # evaluate model with testing data

                    # store testing accuracy result into text file for easy reference
                    print('{}:  testing accuracy = {}'.format(NAME, test_acc))
                    with open('record.txt', 'a+') as f:
                        f.write('{}:  testing loss = {};  testing accuracy = {}'.format(NAME, test_loss, test_acc))
                        f.write('\n')
                    del model   # delete model in order to save memory
                K.clear_session()   # clear tf session in order to save memory

2.4 Result

Here below are 2 pictures of result for our 3*3*3 = 27 cases. The left one is our training accuracy and right one is validation accuracy. We can see that architecture with 3 conv layers and 0 dence layer is the best architecture in this case as it has highest validation accuracy. Of cause, you can also judge model’s performance by comparing their loss values. But for me, validation accuracy is always my preference. Model 3–128–0 has accuracy of 85.49% and 3–32–0 has accuracy of 83.91%. Hence, we are going to use 3 conv 0 dense architecture for our intermediate level. In addition, we also print out our evaluated testing accuracy after our model has finished training, which is used to compare with validation accuracy. If their values have a big difference, for example if testing accuracy is much lower than validation accuracy, we should try other NN architectures since our ‘real world’ sample does not fit our architecture.

3. Intermediate Level Script

In this level, we are going to try 3 convolutional layers and 0 dense layer architecture first, together with either 32 neurons or 64 neurons for all layers. Except for the codes shown below, all other codes remain the same to our exploring level. The difference between intermediate level and exploring level is we add more callbacks functions and we introduce batchnormalization function here. Since we would like to train our model with more epochs in order to find its ‘local minimal’ point without overfitting, we set maximum training epoch to 300 and reduce model learning rate (lr) to 0.1 with patience of 5, which will be reduced to 0.1 time of previous learning rate if our validation accuracy has not improved after consecutive 5 epochs in our case.

The code below is just one simple case, you can try 32, 64 and 128 neurons in convolutional layers in sequence as well. Moreover, you can also try to add another 2 dense layers with 512 and 1024 neurons respectively. One thing to remember, more complex architectures will make our model more overfitting or get a higher change of overfitting. Hence, we should implement different technics in our model. For example, increase dropout ratio, increase our training data samples, use earlystop function etc. last but not least, you can try to use RGB data instead of grayscale data to train our model. In this level, what we need to do is to tune our hyper-parameters in certain architectures. We can tune reduced learning rate ratio, dropout ratio, convolutional size, maxpooling size, neuron numbers in each layer, activation function etc. However, there will always a limit for our self-designed model architecture. In this case, for my best of try, I reached 90% accuracy with 3 conv layers and 2 dense layers. For some other problems, for instance MNIST integer image recognition problem, self-designed architecture can easily reach over 99% which is absolutely perfect. In our case, we should try some other ‘professional architectures’ instead to further improve our architecture which will also be introduced in our high level tutorial later on. The reason why I don’t recommend to try professional models at the first stage is because it is actually not that complex for cat & dog image recognition problem and simpler model should be enough to reach high accuracy. If the best validation/testing accuracy can not fit your appetite, you should try other professional models as well.

dense_layers = [0]
conv_layers = [3]
layer_sizes = [64, 128]

for conv_layer in conv_layers:
    for layer_size in layer_sizes:
            for dense_layer in dense_layers:
                with tf.Session() as sess:
                    # set the name for the NN model
                    NAME = '{}-conv-{}-nodes-{}-dense-{}'.format(conv_layer, layer_size, dense_layer, int(time.time()))
                    # set up callbacks
                    callbacks = [TensorBoard(log_dir='Intermediate Level/{}'.format(NAME)),
                                 ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5),
                                 EarlyStopping(monitor='val_acc', mode='min', patience=15)]

                    model = Sequential()    # Initialize model as a sequential model

                    # Feed training data into a convolutional layer first
                    model.add(Conv2D(layer_size, (5, 5), input_shape=X_train.shape[1:]))  # Specify input shape, otherwise model cannot be saved 
                    model.add(BatchNormalization())
                    model.add(Activation('relu'))
                    model.add(MaxPooling2D(pool_size=(2, 2)))
                    model.add(Dropout(rate=0.25))

                    # convolutional layer loop
                    for i in range(conv_layer-1):
                        model.add(Conv2D(layer_size, (5, 5)))
                        model.add(BatchNormalization())
                        model.add(Activation('relu'))
                        model.add(MaxPooling2D(pool_size=(2, 2)))
                        model.add(Dropout(rate=0.25))

                    model.add(Flatten())    # Flatten 3D training data into 1D before sent into dense layer

                    # dense layer loop
                    for i in range(dense_layer):
                        model.add(Dense(layer_size))
                        model.add(BatchNormalization())
                        model.add(Activation('relu'))
                        model.add(Dropout(rate=0.25))

                    model.add(Dense(1))
                    model.add(Activation('sigmoid'))    # Use sigmoid activation function in output layer for binary training

                    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
                    model.fit(X_train, y_train, batch_size=64, validation_split=0.3, epochs=300, callbacks=callbacks)
                    model.summary()
                    model.save('{}.model'.format(NAME))  # save model
                    test_loss, test_acc = model.evaluate(x=X_test, y=y_test)    # evaluate model with testing data

                    # store testing accuracy result into text file for easy reference
                    print('{}:  testing accuracy = {}'.format(NAME, test_acc))
                    with open('record.txt', 'a+') as f:
                        f.write('{}:  testing loss = {};  testing accuracy = {}'.format(NAME, test_loss, test_acc))
                        f.write('\n')
                    del model   # delete model in order to save memory
                K.clear_session()   # clear tf session in order to save memory

4. High Level Script

In this level, we will use VGG16, a fantastic NN architecture for multiple image recognition problem, to train our NN model. However, VGG16 uses softmax as the activation function with 1000 neurons in output layer. Hence, we only need to modify VGG16 model’s output layer into 1 neuron with sigmoid activation function or 2 neurons with softmax activation function, which are identical to each other. For VGG16, it uses RGB training data. Hence, we need to reshape our training and testing data into (224, 224, 3) format which is to be taken by our input layer. Moreover, type of imported VGG16 layer from Keras is in class ‘tensorflow.python.keras.engine.training.Model’ while we want our model format in class ‘tensorflow.python.keras.engine.sequential.Sequential’. Therefore, we need to append layers from VGG16 application model into our sequential model.

vgg16_model = tf.keras.applications.vgg16.VGG16()
model = Sequential()
for layer in vgg16_model.layers[:-1]:   # append all layers except for the output layer from VGG16 into new model
    model.add(layer)
# model.summary()

for layer in model.layers:
    layer.trainable = False

model.add(Dense(1, activation='sigmoid'))
# model.summary()

X_train = pickle.load(open('X_train.pickle', 'rb'))
y_train = pickle.load(open('y_train.pickle', 'rb'))
X_val = pickle.load(open('X_val.pickle', 'rb'))
y_val = pickle.load(open('y_val.pickle', 'rb'))
X_test = pickle.load(open('X_test.pickle', 'rb'))
y_test = pickle.load(open('y_test.pickle', 'rb'))

NAME = 'VGG16-' + str(int(time.time()))
callbacks = [TensorBoard(log_dir='log_VGG16/{}'.format(NAME)),
             ReduceLROnPlateau(monitor='val_acc', factor=0.1, patience=4),
             EarlyStopping(monitor='val_acc', patience=7)]  # keep patience small here as training size is much larger

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, validation_data=(X_val, y_val), epochs=100, callbacks=callbacks)

model.save('{}.model'.format(NAME))
test_loss, test_acc = model.evaluate(x=X_test, y=y_test)
print('{}:  testing accuracy = {}'.format(NAME, test_acc))
with open('record.txt', 'a+') as f:
    f.write('{}:  testing loss = {};  testing accuracy = {}'.format(NAME, test_loss, test_acc))
    f.write('\n')

Originally published at https://www.nianliblog.com.

Train Cat & Dog Neural Network

1. Get Training Data Script

2. Exploring Level Script

3. Intermediate Level Script

4. High Level Script

Written by Nian Li