Accident detection from Live Cameras using Deep Learning

Background

Video cameras have become a valuable tool for controlling and regulating traffic in urban areas as a result of technological advancements. They allow for the analysis and monitoring of traffic flow throughout the city. However, the number of cameras required to do these jobs has grown dramatically over time, making control impossible if automation techniques are not used, as the number of experts required to comply with all of the points grows as well. Deep learning algorithms have demonstrated superior performance in a wide range of situations, particularly picture understanding and analysis. These layers take advantage of the spatial relationship that the input data has, which dense neural networks can't achieve due to the amount of the data. The application of convolutions on input data with a large number of features allows for the avoidance of the curse of dimensionality, among other things.

There are numerous types of video that can be used to detect an accident. Cameras within the vehicle (first-person view), cameras at city crossings, and cameras on highways are only a few examples. It is proposed as an unsupervised model for detecting accidents in videos produced using first-person vision. Vehicles are recognised and found on the scene using this method, which involves calculating a Bounding-Box for each one. The vehicle's future position is then forecasted using some estimated attributes in order to see if there is a collision between boxes. With the auto-occlusion of objects of interest in the video, this method can cause false alarms. Presents a system for detecting traffic accidents that consists of three steps: vehicle recognition, tracking, and attribute extraction, using the same sort of footage with first-person vision.

For the detection of traffic accidents, a video-based approach is developed. There are 324 examples of training with six different types of collisions in this set of data. There are around 53 instances in each category. Recently, a deep learning-based automated anomaly detection technique for pedestrian walkways (DLADT-PW) was described. This method seeks to recognise and classify the various anomalies that can be found in pedestrian paths. The analysis is based solely on frame information and does not take into account the natural time component of the input, which is video.

Method

The most essential techniques for visual analysis of images are convolutional-based architectures [68]. In terms of picture categorization solutions, these offer a huge advance above typical artificial neural networks. Convolutional layers, on the other hand, do not solve all difficulties. Convolutional layers are not effective at extracting temporal information from data, which is one of their flaws. Recurrent neural networks were developed to leverage the temporal features of the data, whereas convolutional layers were designed to utilise the spatial characteristics of the data. Convolutional layers can process data in such a way that the spatial information is transformed into a more abstract representation, which saves time and money. Due to their efficiency in lowering the dimensionality of the input data, these architectures are being used as automatic extractors of image features. However, in a video, geographical data is not everything.
The first component of the architecture is intended as an automatic picture feature extractor to process each frame of the video segment in order to address the traffic accident detection problem. Then, to extract temporal information from the input data, this new representation of the data is employed as input data in an experimentally developed recurrent neural network. Finally, to achieve the binary classification of identifying an accident, a dense artificial neural network block is used.

It is vital to know the temporal and spatial aspects of video type data in order to extract more information; consequently, a model capable of extracting these characteristics is required. As a result, a ConvLSTM layer-based neural network architecture is presented, with the feature vector generated by the altered InceptionV4 architecture as an input. Finally, you must determine whether or not the video section contains a traffic collision. In order for the final model to generalise the answer, a dense artificial neural network block utilising regularisation approaches is provided.

Results

The proposed solution for Delhi Police is on POC stage now and is based on the extraction of visual and temporal features. The InceptionV4 architecture is shortened at the first step of the model. That is, all of the Inception cells (convolutional layers) were utilised, obviating the need for the multilayer perceptron at the conclusion of this structure. This is so that this section of the model can only be used to extract visual features.
Recurrent neural networks are used to extract temporal features. Two Conv LSTM layers make up the suggested architecture for this step. Using the convolution method, they were constructed to extract temporal information from data with more than one dimension. A BatchNormalization is applied between these layers, and the various hyper-parameters are modified.
A thickly stacked block provides the final stage of the accident detection procedure. The suggested neural network has three hidden layers and one output layer, as well as a dropout regularisation approach with a value of 0.3. In order to conduct a binary classification, the neurons in the above layers are distributed as follows: 400, 100, and 1 neurons, with the first two layers using the hyperbolic tangent activation function and the last layer (output layer) using the sigmoid activation function (accident or non-accident).

Conclusion

IBecause the similarity values between the segments of the techniques with frame selection present negligible differences between them, the technique that best represents a temporal segment of a traffic accident does not eliminate any data, while the computational cost, processing time, and accuracy in accident detection present better results by not conditioning the selection of frames to a metric. Artificial vision has made significant progress in comprehending video scenes. Artificial neural networks are one of the most effective strategies. In order to extract as much information as possible from the input data, many of these models use architectures built of convolutional and recurrent layers. The proposed method is based on this type of architecture and achieves a high performance when detecting traffic accidents in videos, achieving an F1 score of 0.98 and an accuracy of 98%.
For video traffic accident detection, the proposed model performs well. However, the conditions under which the model operates are constrained due to the scarcity of such datasets in the scientific community. Due to the small number of cases available, the solution is limited to automobile crashes, omitting motorbikes, bicycles, and pedestrians. Furthermore, the model makes mistakes when determining accident segments with poor illumination (such as films taken at night) or low resolution and occlusion (low quality video cameras and locations).