Case Study

Removing Rain from Images with Densely-Connected Convolutional Networks

by Trevor McInroe

14 minutes

computer-vision
deep-learning

Summary

Rain, and other weather-based impediments, can cause performance degradation in downstream computer vision models. Subsiding these effects is an important area of research, as the the decisions made by various autonomous systems within vehicles are based on the performance of these downstream models. Ideally, we could improve the information quality within rainy images such that model performance is not affected. In this work, we explore the ability of deep convolutional encoder-decoders to remove rain from input images. We evaluate our deraining network's (DRN) representational capacity, its ability to recover statistical information contained within images, and its effect on downstream object detection and image classification models. In all three tests, our DRN proves capable of recovering important information in images that allows for faster learning and improved performance metrics for downstream models.

For a PDF version of this paper, see here.

Introduction

Weather impediments, such as rain, can cause significant loss of high-frequency information in images such as edges and textures. This loss of precision can result in the visual phenomenons of blurriness and obfuscation. Whereas humans have adapted to be able to perform complex tasks in rain, such as driving a motor vehicle, even small amounts of visual noise can cause the immediate performance degradation of deep learning models [1, 2, 3]. This degradation can be of significant concern. As an extreme example, rain may cause an autonomous vehicle to misidentify a stop sign, resulting in a potentially unsafe situation.

To alleviate the impact of rain on our downstream computer vision tasks, we have developed a system that is capable of removing rain from images. To accomplish this, we have trained a densely-connected convolutional encoder-decoder to take, as input, a rainy image and to output a version of that image without rain. In our tests, we show that this deraining network (DRN) can both improve the integrity of information within input images as well as improve the performance of downstream modeling tasks.

Problem Formulation

In most of the modern research literature, it is argued that a rainy image I is composed of a layer of rain R and a base layer B of the image that is behind the rain. eqn1

We can extend the above assumption with some simple logic and algebra to see that B can be recovered through subtraction: I - R. Given this formula, it becomes clear that our goal is to train some system that can produce R from a given I. However, this task formulation presents a significant data challenge. In the general supervised learning framework, we have some input data X that directly pairs with some output data Y. The purpose of a neural network θ is to learn a mapping between the input and output data, θ : XY. In our application, this can be framed as θ learning to map a set of rainy images X to a set of non-rainy versions of those images Y. Of course, practically speaking, generating this image pair is only possible within very specific, controlled environments that may not reflect real-world conditions.

Instead, most research conducted in the arena of image deraining has relied on algorithmic methods that can generate synthetic rain [4, 5]. Doing so allows us to take any image and generate its rainy pair. While this approach solves the problem of data generation for training our models, we must ensure that θ can generalize to images that contain real rain. To differentiate between the two, we created two datasets. One, Dsynthetic, contains paired images generated using algorithmic methods from various studies such as [6]. The other, Dreal, contains images of real rain and corresponding information about said images such as class labels and bounding boxes for common road objects such as "car", "bus", and "pedestrian" [7]. In addition, we used the CURE-TSR [8, 9, 10] dataset, which contains images of various street signs across a multitude of weather conditions of varying intensities. Dsynthetic was used to train the DRN and to understand statistical information recovery, Dreal was used to evaluate the training behavior of downstream object detection models, and CURE-TSR was used to observe the impact of the deraining process on classification tasks.

Network Implementation

"Traditional" computer vision tasks involve models that take high dimensional images and produce low dimensional outputs. For example, a standard image classification model might produce a probability distribution over classes. However, for Equation (1) fo funtcion, our system must produce an output that is the same size as the input image. Most modern research of this kind deals with information bottlenecking via encoder-decoder architectures [11, 12]. By forcing the incoming data through a slim area of the network, the model is incentivized to learn important underlying features of the input data.

In addition to the encoder-decoder structure, our network mainly exploits two key aspects of deep neural networks for computer vision: residual learning via skip connections [13] and densely-connected convolutional layers [14]. Using skip connections, especially in deep networks, "smooths" the loss landscape, thereby leading to more stable and convergent training [15, 16]. Also, using the one-to-many connections found in densely-connected convolutions allows for increased parameter efficiency, improved gradient flow during backpropagation, as well as an inherent regularization effect that reduces overfitting on small training datasets [14]. For a depiction of a densely-connected convolutional block, see Figure 1, below.

dense-blockFigure 1: Depiction of a densely-connected convolutional block from [14].

Within the network, the hidden feature maps xi produced by the i-th convolution Hi are constructed by using the concatenation of all previously produced feature maps:

eqn2

This concatenation operation is not feasible in traditional convolutional architectures, where standard pooling operations shrink the size of feature maps between convolutions. To overcome this, we use these dense operations within isolated blocks of the network and insert pooling operations between the blocks. Each block in the encoder is followed by a MaxPooling operation [17] and each block in the decoder is followed by a MaxUnPooling operation [12] that ultimately upsamples the feature maps back to the resolution of the original input. See Figure 2 for a depiction of the DRN.

drnFigure 2: Truncated depiction of the DRN and its Densely Connected Convolutional Blocks (DCCB).

For our objective function, we use the ℓ1 pixelwise loss between the DRN's output subtracted from a rainy image I and its non-rainy pair Ibase across all color channels:

eqn3

Evaluation

In this study, we aimed to understand a few key capabilities of the DRN: (a) the representational capacity of the network architecture, (b) the ability of the network to recover the underlying statistical nature of the non-rainy versions of images, and (c) the effect of the deraining process on downstream models. For all three evaluations, we trained the DRN on Dsynthetic as described in the previous sections.

For (a), we visually evaluated the derained versions of images across varying intensities of simulated rain. Doing so allowed us to understand the level of noise that the DRN was capable of capturing. Observing Figure 3, below, we note that, even under extreme circumstances of obfuscation, the DRN is able to produce an output that allows us to recover a significant amount of fine detail. This suggests that the network architecture is capable of recovering information even in heavy rain conditions.

ex1
ex2
ex3
Figure 3: Same image with various levels of synthetic rain passed through the DRN. Input image (left column), rain texture map produced by the DRN (middle column), recovered image (right column).

For (b), we measured the comparative closeness of a non-rainy image with its rainy and derained counterparts by using Structural Similarity Index Measure (SSIM) [18], Universal Quality Index (UQI) [19], and Peak Signal to Noise Ratio (PSNR) [20] averaged across all images in Dsynthetic. These metrics capture statistical signatures such as luminance, contrast, structure, and color. Observing Table 1, we see that the DRN can recover a significant amount of the statistical information that is lost when rain is added to the images.

tab1 Table 1: Image similarity metrics between a rainy or DRN-cleaned image and the original image. For all metrics, higher is better. SSIM and UQI of 1 signifies identical images. The maximum value for PSNR depends on image characteristics.

For (c), we performed two evaluations using two different types of image-based models, both meant to discover how the DRN could potentially help downstream tasks. The first evaluation observed the impact of deraining on the training of an object detection model, specifically looking at learning speed and convergence level. The second evaluation observed the impact of deraining on the accuracy of an already-trained classification model.

For detection, we trained two YOLOv4 [21] networks from scratch using Dreal. One network was trained directly on the real-rain images and the other was trained on the derained versions of these images. To track the learning progress of these models, we measured mean average precision at a 50% intersection over union ([email protected]) on the entire training set after each epoch. Each model was trained five times and each run was averaged together to achieve the final result. Observing Figure 4, we note that the models trained on the derained images, on average, performed at least as well throughout training relative to the models trained on the rainy images, and converged to a solution with better metrics.

training-results Figure 4: [email protected] on training dataset throughout learning. Higher is better.

For classification, we used a subset of the CURE-TSR dataset that is focused on rain in real environments. For simplicity, we grouped the dataset's weather intensity levels into low, medium, and high. For this task, we first trained a convolutional network on CURE-TSR images with no weather obfuscation. To gather evaluation metrics, we measured the classification accuracy of this model on both the rainy images as well as the derained versions of these images. Observing Table 2, we note that, in almost every case, the classification model performed better on the derained version of images.

tab2 Table 2: Classification performance on the challenging CURE-TSR dataset. Higher is better and a value of 1 indicates perfect performance.

Discussion and Future Work

While the results obtained in this study show promise, there are certain limitations that deserve recognition. For one, the evaluations performed in this work were small scale, which brings into question their generalization potential. Ideally, we would perform these tests with much larger and more diverse datasets. Toyota is a global brand, and the visual characteristics of street signs, vehicles, and surroundings vary significantly across geography. For the DRN to be of global help, future training and evaluation should include images from all countries where Toyota products are deployed.

In addition, this study is limited to a single form of weather-based obfuscation. In a real-world deployment, vision models would be subject to a series of impediments such as fog, hail, and darkness. Evaluations (a) and (b) suggest the DRN's network architecture is capable of improving images in various weather conditions but this should be explicitly tested.

Finally, the tensors that are produced in the internals of the DRN are large, which may render the current architecture infeasible to run on compute-limited devices. It is not currently clear whether the size of the DRN is a major determining factor in its capabilities. From this, two avenues should be investigated. First, we should experiment with shrinking the DRN's architecture, potentially through network pruning [22]. Second, we should evaluate the potential for reducing the model's compute requirements through quantization, which is an active area of academic research [23, 24].

References

[1] S. Roy, S. Hossain, M. A. H. Akhand, and K. Murase, “A robust system for noisy image classification combining denoising autoencoder and convolutional neural network,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 9, 2018.

[2] G. Costa, W. Contato, T. Nazaré, J. Batista Neto, and M. Ponti, “An empirical study on the effects of different types of noise in image classification tasks,” arXiv:1609.02781 [cs.CV], 2016.

[3] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, and B. Zheng, “Improving transfer learning and squeeze- and- excitation networks for small-scale fine-grained fish image classification,” IEEE Access, vol. 6, pp. 78 503–78 512, 2018.

[4] Q. Yang, M. Yu, Y. Xu, and S. Cen, “Single image rain removal based on deep learning and symmetry transform,” Symmetry, vol. 12, p. 224, 2020.

[5] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown, “Rain streak removal using layer priors,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2736–2744.

[6] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint rain detection and removal from a single image,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 1685–1694.

[7] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo, and X. Cao, “Single image deraining: A comprehensive benchmark analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3838–3847.

[8] D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib, “CURE-TSR: Challenging unreal and real environments for traffic sign recognition,” in Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for Intelligent Transportation Systems, Long Beach, CA, USA, 2017.

[9] D. Temel and G. AlRegib, “Traffic signs in the wild: Highlights from the IEEE Video and Image Processing Cup 2017 Student Competition,” IEEE Signal Processing Magazine, vol. 35, pp. 154–161, 2018.

[10] D. Temel, M.-H. Chen, T. Alshawi, and G. AlRegib, “Challenging environments for traffic sign detection: Reliability assessment under inclement conditions,” arXiv:1902.06857 [cs.CV], 2019.

[11] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Munich, Germany: Springer International Publishing, 2015, pp. 234–241.

[12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.

[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.

[14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2261–2269.

[15] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018, pp. 6389–6399.

[16] K. Kawaguchi and Y. Bengio, “Depth with nonlinearity creates no bad local minima in ResNets,” Neural Networks, vol. 118, pp. 167–174, 2019.

[17] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999.

[18] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.

[19] Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002.

[20] I. Sazanita Isa, S. Noraini Sulaiman, M. Mestapha, and S. Darus, “Evaluating denoising performances of fundamental filters for T2-weighted MRI images,” in 19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, Singapore, 2015, pp. 760–768.

[21] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv:2004.10934 [cs.CV], 2020.

[22] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, and S. Han, “APQ: Joint search for nerwork architecture, pruning and quantization policy,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2075–2084.

[23] Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, “BRECQ: Pushing the limit of post-training quantization by block reconstruction,” in International Conference on Learning Representations, Vienna, Austria, 2021.

[24] H. Yu, T. Wen, G. Cheng, J. Sun, Q. Han, and J. Shi, “Low-bit quantization needs good distribution,” in EEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 2909–2918.

Join our Team

We build technology that empowers people to move, and makes their lives easier and safer at Toyota Connected.