In this paper, the difficult problem of character recognition in natural scenes caused by many factors such as variability of light in the natural scene, background clutter and inaccurate viewing angle, and inconsistent resolution. Based on the deep learning framework PyTorch, a convolutional neural network is implemented. Based on the classic LeNet-5 network, the network optimizes the input layer to accept three-channel images, changes the pooling method to maximum pooling to simplify parameters, and the activation function is replaced by Rectified Linear Unit with faster convergence. The cross-entropy loss is used instead of the minimum mean square error to mitigate the slow learning. Furthermore, we also enroll the gradient descent optimization algorithm RMSprop and L2 regularization to improve the accuracy, speed up the convergence and suppress the over-fitting. The experiment results show that our model achieved an accuracy of 92.32% after training for 7h24min on the street view house number(SVHN) dataset, effectively improving the performance of LeNet-5.

The traditional method of classifying house numbers from natural scene images is usually to use manual feature extraction[

For these traditional methods, the key to determining their performance is to have a good classifier, and the features in the classifier are mostly designed manually (such as SIFT, SURF, HOG, etc.), and the features of the artificial design are well interpreted. However, in the face of complex backgrounds, changing fonts and various deformations, it is rather troublesome and difficult to extract more general features[

The Convolutional Neural Network (CNN) is a multi-layered supervised learning neural network. Although the training process requires a large amount of data compared with the traditional method, the convolutional neural network can automatically summarize the target feature from these data. Features do not require human intervention. Overcome the shortcomings of manual design features that are time-consuming and labor-intensive, have poor general use and require high experience in the designer field. It is precise because of these advantages of convolutional neural networks that a large number of researchers have begun to apply it to solve character recognition problems.

In response to this situation, we implemented a LeNet-5-based neural network based on the deep learning framework PyTorch and achieved an accuracy of 92.32% on the SVHN dataset at a time of 6 hours and 17 minutes.

The network used in this experiment is modified by LeNet-5 as shown in Figure

Network structure

The pooling layer in the original LeNet-5 network is very different from the currently recognized pooling layer operation, so we replace it directly with the max-pooling layer, which on the other hand reduces the number of trainable parameters of the network. It is conducive to controlling the scale of the network and speeding up the training. In terms of the activation function, the activation function in the original LeNet-5 is Sigmoid or TanH. Here we use a Rectified Linear Unit (ReLU) with faster convergence speed and no significant impact on the generalization accuracy of the model. LeNet-5’s loss function is Minimum Mean Squared Error:

Where n is the number of samples, ŷ_{i} represents the predicted value of the ith sample, and y_{i} is the labelof the ith sample. In the case of back-propagation by the gradient descent method, the minimum mean square error is easy to occur when the neuron output is close to ‘1’ and the gradient is too small to learn slow. We use the cross-entropy loss function here:

In addition to the above improvements, we will introduce four optimization algorithms, SGD (with momentum), Adam, Adamax, and RMSprop.

The package ‘torch.optim’ in PyTorch encapsulates a large number of optimization algorithms, which are often referred to as optimizers. In Figure

OPTIMIZER PARAMETER SETTING

Optimizer | parameter |
---|---|

SGD | lr=0.001, |

Adam | lr=0.001, |

Adamax | lr=0.002, |

RMSprop | lr=0.001, |

Optimizer effect

It can be observed from Figure

OPTIMIZERS TRAINING RESULTS

optimizer | Top Accuracy/% |
---|---|

87.350184 | |

89.090000 | |

88.955900 | |

88.676000 |

In the face of possible over-fitting, one possible inhibitory measure is the introduction of regularization. We first use the L2 regularization and introduce the training set accuracy rate, training set loss, test set loss three indicators to enrich the evaluation results of the experimental results. The experimental design is shown in Table

It can be seen from Table

RESULT OF DIFFERENT WEIGHT DECAY

weight_decay | Train Acc%(e) | Test Acc%(e) |
---|---|---|

0.01 | 89.01538(85) | 87.14659(85) |

0.005 | 91.59397(57) | 88.95590(57) |

0.0025 | 94.51470(88) | 90.01229(88) |

0.001 | 97.21119(90) | 89.70498(24) |

Training Accuracy of different regularization

Training Loss of different regularization

Test Accuracy of different regularization

Tetst Loss of different regularization

Currently, for the identification task of the street view number, the better public data set is the SVHN data set. The SVHN dataset is a real-world image dataset focused on the development of machine learning and target detection algorithms with minimal need for data preprocessing and format conversion. There are ten types of labels in the dataset. Each class represents 1 number. For example, the category label of the number “1” is 1, and so on. The label of “9” is 9, and the label of “0” is 10. In general, the SVHN dataset contains three subsets: training set, test set, and extended set; the data set is divided into two formats based on the difficulty of recognition: a character-level bounding box containing the entire house number and a small number of wall backgrounds. The full resolution image (Figure

SVHN-Complete house number

SVHN-Part number

Augmenting the data set is also an effective means to improve the accuracy of the model. The size of each subset of the SVHN dataset is shown in Table

AUGMENTATION RESULT

Subset category | Number of samples |
---|---|

73257 | |

531131 | |

26032 |

Example of train set

Example of extra set

Example of test set

Category distribution of SVHN

The effect of 90 epoch training before and after the introduction of the extended set is shown in Table

RESULT AFTER DATA AUGMENTATION

Train sample number | test sample number | Best test accuracy | time |
---|---|---|---|

73257 | 26032 | 90.01229 | 1h24min |

604388 | 26032 | 92.32483 | 6h17min |

Training of the model after adding data augmentation

The convolutional neural network applied in the SVHN dataset to improve the classic LeNet-5 network is: (1) modify the input layer to accept three-channel images; (2) switch to the more commonly used maximum pooling and Activation function, loss function; (3) introduction of gradient descent optimization algorithm RMSprop; (4) use L2 regularization. The seven-layer convolutional neural network implemented in this paper achieves direct processing of color pictures without complicated preprocessing, which improves the versatility of the model, speeds up the training and effectively avoids over-fitting. In the end, both the training speed and the prediction accuracy are better than the domestic Ma Miao and others based on the experimental results of the improved LeNet-5. After expanding the dataset, I tried to run a maximum of 170 epoch, and there was no obvious improvement in the test accuracy. Therefore, the future improvement direction should still be based on the principle. We can consider deepening the network level to obtain more abundant features.