final
This commit is contained in:
248
src/04a22-implementation-changes.tex
Normal file
248
src/04a22-implementation-changes.tex
Normal file
@@ -0,0 +1,248 @@
|
||||
\section{YOLOv3 Against YOLOv3\textendash Tiny}\label{sec:yolo-against-yolo-tiny}
|
||||
|
||||
The goal is to fundamentally determine how the performance of both networks is on traffic lights.
|
||||
For this purpose, both networks are trained on the Cityscapes dataset without transfer learning.
|
||||
The implementation is completed by a routine that converts the Cityscapes dataset in shape of single images into a file format that is faster readable for Tensorflow, so\textendash called tfrecords.
|
||||
Tfrecords are a list of entities consisting of data and labels.
|
||||
The input data is unified, and the images with the corresponding labels from a separate folder are stored together in one entity each.
|
||||
The label for an image is, in this case, a list of bounding boxes.
|
||||
A bounding box is defined by the upper left and the lower right coordinate.
|
||||
|
||||
The introduction of the metric \ac{mAP} allows measuring precision and is, therefore, one of the benchmarks for measuring overall performance.\myref{equ:mAP}
|
||||
|
||||
The results show a significant difference in speed.
|
||||
YOLOv3\textendash Tiny, with about one\textendash tenth of the weights, creates about twice as much \ac{FPS}\footnote{The system used for the test has a \ac{HDD} build\textendash in. Larger differences can be expected when using a SSD}.
|
||||
Due to the smaller capacity of the network, the \ac{mAP} is lower.
|
||||
That manifests itself with dropouts in image recognition and instability of the bounding boxes in the video: The center moves around the actual center, and the size changes in successive frames between too large and too small.
|
||||
|
||||
YOLOv3 also has localization problems.
|
||||
Although the misframes are reduced when using the extensive network, the detected size is still unstable and still deviates strongly from the target.
|
||||
|
||||
It is to be expected that YOLOv3\textendash Tiny achieves worse results than YOLOv3.
|
||||
Whether the higher frame rate and the lower computational effort is sufficient, compensation will have to be shown by further tests.
|
||||
|
||||
\section{Varying the Image Input Size}\label{sec:varying-the-image-input-size}
|
||||
|
||||
Traffic lights have the particular characteristic of being incredibly small.
|
||||
As already explained in detail \myref{subsec:driveu-traffic-light-dataset-(dtld)}, a traffic light at a distance of 100 meters with an image width of 2048 pixels is only about 1.7 pixels wide.
|
||||
Especially YOLOv3\textendash Tiny has problems with this because of the missing feature pyramid layer for small objects.
|
||||
|
||||
Since the architecture of YOLOv3 only consists of a convolutional layer, the time needed for recognition increases linearly with the image's size.
|
||||
|
||||
When downsampling the image to a smaller resolution, it is mainly the width that becomes a problem because there is a risk that the traffic light is no longer included in the image.
|
||||
The images of both datasets, Cityscapes, and DriveU, have a resolution of 1024x2048 and are twice as wide as high\myref{fig:reshape_original}.
|
||||
Downsampling to a square image would therefore have more effect on width than on height\myref{fig:reshape_square}.
|
||||
Scaling to a rectangular image helps to counteract the problem\myref{fig:reshape_rect}.
|
||||
|
||||
|
||||
% Image
|
||||
\begin{figure}[!htbp]
|
||||
\vspace{0cm}
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage\hfill
|
||||
\minipage{0.33\textwidth}
|
||||
\centerline{\includegraphics[width=\linewidth]{../data/bremen_000032_000019_leftImg8bit.png}}
|
||||
\caption{Original image} %Bildunterschrift
|
||||
\label{fig:reshape_original} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.33\textwidth}
|
||||
\centerline{\includegraphics[width=0.5\linewidth,height=2.45cm]{../data/bremen_000032_000019_leftImg8bit.png}}
|
||||
\caption{Square reshape} %Bildunterschrift
|
||||
\label{fig:reshape_square} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.33\textwidth}
|
||||
\centerline{\includegraphics[width=0.66\linewidth]{../data/bremen_000032_000019_leftImg8bit.png}}
|
||||
\caption{Same area, more width} %Bildunterschrift
|
||||
\label{fig:reshape_rect} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage
|
||||
\end{figure}
|
||||
|
||||
The previous Python implementation of YOLO can theoretically handle different image sizes, but bugs occur in recognition of the bounding boxes during training.
|
||||
These are fixed, and additionally, a possibility is implemented to adjust the width independently from the height to train YOLOv3 on rectangular images.
|
||||
In principle, a trained YOLOv3 network can be applied to different image sizes~\myref{subsubsec:flexible-image-size}, but detection is low if traffic lights now have an unknown aspect ratio or have an unknown size.
|
||||
|
||||
At this point, one can take advantage of the fact that traffic lights are usually located in the upper half of the image.
|
||||
If the network tests only the upper half of the image, the speed doubles almost without a loss of precision.
|
||||
|
||||
\section{Make Use of the Pre-Trained Darknet-53}\label{sec:make-use-of-the-pre-trained-darknet-53}
|
||||
|
||||
The pre-trained weights of YOLOv3 and the preceding Darknet are trained on the COCO dataset.\cite{yoloWebsite} That contains over 120,000 images.\cite{coco}
|
||||
It is recommended to use the already trained Darknet and to train only the YOLO part with the own data.
|
||||
|
||||
For this purpose, the trained weights for the Darknet are loaded and frozen so that they cannot be trained further.
|
||||
The YOLO part is still initialized randomly and trained on the own datasets.
|
||||
|
||||
Despite a slight decrease in mAP, the detections are more stable and reliable.
|
||||
|
||||
\section{Recurrent Neural Network}
|
||||
|
||||
Videos, such as those created while traveling in an autonomous car, contain not only a spatial component but also a temporal one.
|
||||
YOLO, however, only considers the spatial component, and thus, in a video, all images are processed individually.
|
||||
YOLOv3 needs to be extended by a temporal component in order to make use of the knowledge from the previous steps.
|
||||
|
||||
Both datasets do not include annotated videos, so the implementation must be based on the existing annotated photos.
|
||||
There are two approaches to the implementation.
|
||||
|
||||
\subsection{Heat Map}\label{subsec:heat-map}
|
||||
|
||||
The concept of a heat map is based on the assumption that near the position where a traffic light was detected in the previous frame, there must also be a traffic light in the current frame, only shifted by a few pixels.
|
||||
|
||||
Instead of an image with the dimensions 3x416x832 as input tensor, a heat map is added next to the color components, resulting in an input sensor of 4x416x832.
|
||||
This additional layer contains the results of the previous steps.
|
||||
The heat map is initially 0 at every position and not activated anywhere.
|
||||
After the first frame, each bounding box and its area get added to the heat map.
|
||||
Now the areas are a bit warmer.
|
||||
That repeats every frame, whereby the old values cool down a bit before adding so that areas that have not been activated for a long time return to 0.
|
||||
|
||||
% Image
|
||||
\begin{figure}[!htbp]
|
||||
\vspace{0cm}
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage\hfill
|
||||
\minipage{1.0\textwidth}
|
||||
\includegraphics[width=\linewidth]{../data/heatmap.jpg}
|
||||
\caption[Heatmap Proposal]{\textbullet~First Row: First iteration; Initial heat map not activated; Add bounding boxes to heat map \\ \textbullet~Second Row: Further iteration; Pass last heat map; Cool down; Add bounding boxes to heat map } %Bildunterschrift
|
||||
\label{fig:heatmap} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage
|
||||
\end{figure}
|
||||
|
||||
Annotated images can be used to achieve teacher learning.
|
||||
To generate them, the labels are used and applied to the heat map with some translation towards the center of the image and blur.
|
||||
That represents the previous step.
|
||||
|
||||
This approach has the advantage that it is comparatively easy to implement and extends several frames back into the past.
|
||||
|
||||
Currently, the network is only used to detect traffic lights.
|
||||
A great strength of YOLO is to recognize several classes at the same time without additional resources.
|
||||
To get a network with multiple classes working, each new class needs its heat map.
|
||||
The input tensor reaches a high degree of depth;
|
||||
the three color layers lose their influence.
|
||||
|
||||
Darknet has to be trained again.
|
||||
The input tensor of the pre-trained weights has a fixed depth of three.
|
||||
Since the use of the pre-trained weights gives the network high stability, they should be used.
|
||||
|
||||
During the implementation, many hyper-parameters accumulate.
|
||||
The cooling rate of the old values after each step, the warming rate of the new added values, the amount of the added blur and shift to the center during training.
|
||||
The correct choice of these hyper-parameters is fragile and makes goal-oriented training difficult.
|
||||
|
||||
The implementation of the prototype is aborted in favor of using another method.
|
||||
|
||||
\subsection{Recurrent Layers}
|
||||
|
||||
The use of recurrent layers is more challenging.
|
||||
The output boxes of YOLOv3 from the previous step are combined with those from the current step, passed through LSTM cells to result in the new output boxes.
|
||||
|
||||
% Image
|
||||
\begin{figure}[!htbp]
|
||||
\vspace{0cm}
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage\hfill
|
||||
\minipage{0.50\textwidth}
|
||||
\includegraphics[width=\linewidth]{../data/lstm.pdf}
|
||||
\caption{Schematic of the Recurrent Extension Made to YOLOv3} %Bildunterschrift
|
||||
\label{fig:lstm} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage
|
||||
\end{figure}
|
||||
|
||||
For teacher learning, the label must be minimally randomized and then transformed into the definition domain that the network outputs.
|
||||
The transformation includes reverse engineering of the last layer in the network and calculating inverse functions for the last layer.
|
||||
|
||||
The use of LSTM cells makes it possible to parse a video without additional manual calculations, like creating a heat map.
|
||||
|
||||
This approach works best when already annotated training videos are available.
|
||||
That makes it possible not to only emulate the last frame but to actually provide the last, for example, 16 frames as input.
|
||||
For this purpose, recorded ROS bags can be used, which have recorded traffic lights with classical image processing methods.
|
||||
|
||||
The marginal improvements indicate that one frame in the past is not sufficient.
|
||||
|
||||
\section{Further Improvements}\label{sec:further-improvements}
|
||||
|
||||
The following improvements do not necessarily serve to increase precision, but they do show the limitations of YOLOv3.
|
||||
|
||||
\subsection{Detection of the Traffic Light Signal}\label{subsec:detection-of-the-traffic-light-signal}
|
||||
|
||||
Up to now, all traffic lights have been combined in one class.
|
||||
The DriveU dataset can differentiate the traffic light signal into individual classes depending on their stage.
|
||||
|
||||
When creating the tfrecords for DriveU, a parameter for differentiation has already been extracted.
|
||||
With this parameter, the dataset is divided into classes and trained.
|
||||
|
||||
The result is not very satisfying and will not be pursued further.
|
||||
Especially for small, distant traffic lights, the recognition is imprecise.
|
||||
The detection of small objects has to be improved first.
|
||||
|
||||
\subsection{Distance Estimation}\label{subsec:distance-estimation}
|
||||
|
||||
The closer a traffic light is, the larger it is in the image.
|
||||
This circumstance can be used for distance estimation.
|
||||
Based on this, one can determine a braking behavior.
|
||||
In an earlier chapter \myref{equ:distance_measure}, a formula was introduced while arguing that a traffic light in a certain distance is only tiny on the captured picture.
|
||||
After rearranging, the traffic light distance can be determined by specifying the camera opening angle, the traffic light width (20cm), and the proportion of the traffic light within the image relative to the image width (for example, 2px of 832px).
|
||||
|
||||
{\begin{figure}
|
||||
\def\arraystretch{1.2}
|
||||
\begin{equation}
|
||||
\begin{array}{rcl}
|
||||
a&:=&\textup{\q{Traffic light width (mostly $20cm$)}} \\
|
||||
m&:=&\textup{\q{Measured width of bounding box in px}} \\
|
||||
m&=&\frac{a}{p(d)}\cdot w \\
|
||||
p(d)&:=&\textup{\q{Plane width in a distance $d$}} \\
|
||||
p(d)&=&2d\cdot tan(\frac{\theta}{2}) \\
|
||||
w&:=&\textup{\q{Width of the captured image (e.g. 2048px)}} \\
|
||||
\theta&:=&\textup{\q{Camera aperture}} \\
|
||||
d&:=&\textup{\q{Distance of the traffic light}} \\
|
||||
m&=&\frac{aw}{2d\cdot tan(\frac{\theta}{2})} \\
|
||||
d&=&\underbrace{\frac{aw}{2\cdot tan(\frac{\theta}{2})}}_{const.}\cdot \frac{1}{m}
|
||||
\end{array}\label{eq:distance_measure}
|
||||
\end{equation}
|
||||
\caption{Calculating the Distance of an Traffic Light}
|
||||
\end{figure}
|
||||
}
|
||||
The problem is the parameters needed for the formula.
|
||||
While the camera aperture is fixed, the other parameters can only be determined inaccurately.
|
||||
According to EN12368\cite{anforderungenLichtsignalanlagen}, the traffic light's width is standardized to 20 centimeters but can be 30 centimeters in hazardous areas, and in rare cases 10 centimeters are allowed.
|
||||
Even more challenging to determine is the traffic light width in the picture.
|
||||
Furthermore, from a certain distance, the difference in width is only maginal.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\begin{tikzpicture}
|
||||
\begin{axis}
|
||||
[
|
||||
xlabel={traffic light width in px},
|
||||
ylabel={distance in m},
|
||||
grid,
|
||||
scaled y ticks=false,
|
||||
]
|
||||
\addplot+[mark repeat=10,domain=1:21, samples=200] {0.1*2048.0/tan(0.87665)/x/78};
|
||||
|
||||
\end{axis}
|
||||
\end{tikzpicture}
|
||||
\caption{Traffic light size compared with its distance using a 100 degree camera opening angle}
|
||||
\label{fig:plot_distance_vs_px} %fig:ID
|
||||
\end{figure}
|
||||
|
||||
Using a 100 degree camera opening angle a 2-pixel wide bounding box can belong to a traffic light 113 or 68 meters away.
|
||||
Due to the antiproportionality, a usefully accurate distance indication is given beneath a distance of about 40 meters.\myref{fig:plot_distance_vs_px}
|
||||
|
||||
% Image
|
||||
\begin{figure}
|
||||
\vspace{0cm}
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage\hfill
|
||||
\minipage{0.85\textwidth}
|
||||
\includegraphics[width=\linewidth]{../data/distance_measure.png}
|
||||
\caption{Output of traffic lights with distance marker} %Bildunterschrift
|
||||
\label{fig:distance_measure} %fig:ID
|
||||
\endminipage\hfill
|
||||
\minipage{0.0\textwidth}
|
||||
\endminipage
|
||||
\end{figure}
|
||||
|
||||
Reference in New Issue
Block a user