This commit is contained in:
Sebastian Seedorf
2020-11-25 15:14:56 +01:00
parent 4faabe5409
commit 135a6c7621
24 changed files with 967 additions and 118 deletions

View File

@@ -7,14 +7,15 @@
\publishers{
\begin{tabular}{l l}
\textbf{\normalsize{Year of Study:}} & \normalsize{\jahrgang} \tabularnewline
\textbf{\normalsize{Faculty:}} & \normalsize{} \tabularnewline
\textbf{\normalsize{Faculty:}} & \normalsize{Department of Mathematics and Computer Science} \tabularnewline
\textbf{\normalsize{Study Course:}} & \normalsize{Computer Science} \tabularnewline
\textbf{\normalsize{First Examiner:}} & \normalsize{\erstBetreuer} \tabularnewline
\textbf{\normalsize{Second Examiner:}} & \normalsize{\zweitBetreuer} \tabularnewline
\end{tabular}
}
\titlehead{
\hspace{15mm}
\includegraphics[scale=0.1]{../data/fu_logo.png}
\vspace{-10mm}
\hspace{25mm}
\includegraphics[scale=0.25]{../data/fu_logo.png}
}
\maketitle

View File

@@ -1,49 +1,19 @@
\section*{Abstract}
\addcontentsline{toc}{chapter}{Abstract}
Autonomous driving is becoming increasingly important in the 21st century.
It offers the possibility to complement existing mobility concepts and to enable new groups of people to move independently, such as people with limited mobility.\cite{maurer2015autonomes}
In recent years, many new and innovative ideas have been established, bringing the boundaries between local public transport and individual transport ever closer.
\ac{YOLO} is a widespread neural network architecture that has been used in many applications, especially since the release of the third version.
Meanwhile, implementations for different frameworks exist.
\ac{YOLO} is characterized by the high accuracy and speed with which it can recognize objects in an image.
Particularly in the field of autonomous driving, \ac{YOLO} has gained tremendous popularity.
The recognition of traffic lights in the successful \acs{FU} project \q{autonomos} still requires human assistance in some cases.
The BVG (Berliner Verkehrsgesellschaft) in Berlin, for example, offers the BerlKönig service, which functions as a mix of on-call bus and cab.
When ordering a ride, specify the start and destination points and the number of people to get transported.
The system then bundles similar trips together so that the called BerlKönig bus can run several tours simultaneously.
That reduces both the fare compared to a cab ride and is better for the environment.\cite{berlkoenig}
Uber modernizes the cab industry and manages to use a higher capacity utilization of existing resources.
Thus, Uber's drivers have a significantly increased workload and fewer empty runs.\cite{10.1257/aer.p20161002}
The quoted article points out that these improvements to detect outdated and inefficient regulations for the cab industry that do not apply to Uber, but also to better driver coordination with \q{more efficient driver-passenger matching technology based on mobile Internet technology}.
In this master thesis, a different approach is pursued.
The traffic light recognition shall be automated with the help of the network YOLOv3.
It will be investigated which special requirements the traffic light recognition has and how a neural network can fulfill these specifications.
Afterward, YOLOv3 will be applied to the problem concerning the requirements.
In individual experiments, it is investigated which hyperparameters are the most suitable for the network.
The route planning for the drivers is usually already predefined by the system and is controlled by the system by suggesting possible connecting tours.
In the case of BerlKönig, the driver no longer has any room for maneuver.
Both concepts presented can be further optimized by autonomous vehicles.
Safety plays a unique role in the development of autonomous cars.
In theory, autonomous vehicles from Waymo, a subsidiary of the Alphabet Group, to which Google also belongs, have been driving since 2012, have covered more than 14 million kilometers in the period up to 2018, and in autonomous mode have caused only one accident themselves.\cite{Herger_2018}
In practice, many questions are still unanswered.
Sensors and vehicle communication can be disturbed, legal liability issues are still unresolved, and the technical aspects, including implementation, are still partially unresolved.\cite{glancy2015autonomous}
Because of these many problems, Waymo cars only drive on specific routes in simple neighborhoods with little traffic in California's sunny Mountain View.
Concerning safety, the detection of traffic light signals in road traffic, in the following called traffic lights, is an essential part of autonomous driving.
Previous technology used at the Free University of Berlin to detect traffic lights is based on markings in maps in which the car searches for traffic lights using image processing methods.
Furthermore, as part of a project carried out in the past two years in a test area in Berlin in cooperation with the Senate Department for Environment, Traffic and Climate Protection, and the Fraunhofer Institute FOKUS, who have extended traffic light systems at intersections with transmitter masts.
These indicate not only the general condition of the crossings (including the layout of the turning lanes and road works) but also the current status of the traffic lights.\cite{safari}
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{0.85\textwidth}
\includegraphics[width=\linewidth]{../data/safari.png}
\caption{Funktionsweise von SAFARI} %Bildunterschrift
\label{fig:safari} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
This method of detecting traffic lights works very well on test routes or routes that have been upgraded for this purpose, but not in unknown terrain.
The systems fail at temporary traffic lights at road works and in other countries that are not equipped with such a system.
In many situations, such retrofitting of existing traffic lights is not economically viable, especially since all traffic lights would have to be retrofitted for the use of autonomous cars without a transitional period.
In this thesis, a different approach to traffic light detection shall be applied.
Here, the hardware of the cameras and graphic units already installed in the autonomous car shall be used to detect traffic lights in real-time utilizing a neural network.
Several successes were achieved in the thesis.
The traffic light recognition with this network is better than using the already trained weightings on the class traffic lights.
Further improvements have been made or suggested, which have become evident by using YOLO. Nevertheless, there are still many possibilities for improvement based on the findings of this work.
At the end of the thesis, some of them are explained in more detail.

View File

@@ -1,26 +1,35 @@
{\let\cleardoublepage\relax
\newpage
\chapter*{List of Abbreviations}}
\addcontentsline{toc}{chapter}{Abkürzungsverzeichnis}
\addcontentsline{toc}{chapter}{List of Abbreviations}
\begin{acronym}[SEPSEP]
\acro{API}{Application Programming Interface}
\acro{CCR}{Center for Clinical Research}
\acro{CDISC}{Clinical Data Interchange Standards Consortium}
\acro{CRF}{Case Report Form}
\acro{CRUD}{Create-Read-Update-Delete}
\acro{CSV}{Comma-separated values}
\acro{CRUD}{Create\textendash Read\textendash Update\textendash Delete}
\acro{CSV}{Comma\textendash Separated Values}
\acro{DDS}{Data Definition Sheet}
\acro{FDA}{Food and Drug Administration}
\acro{FPS}{Frames per Second}
\acro{FU}{Freie Universität}
\acro{GUI}{grafische Benutzeroberfläche}
\acro{HDD}{Hard Drive Disk}
\acro{HTML}{Hypertext Markup Language}
\acro{HTTPS}{Hypertext Transfer Protocol Secure}
\acro{HTTP}{Hypertext Transfer Protocol}
\acro{ID}{Identification}
\acro{IT}{Informationstechnik}
\acro{IoU}{Intersection over Union}
\acro{JSON}{JavaScript Object Notation}
\acro{JS}{JavaScript}
\acro{LDAP}{Lightweight Directory Access Protocol}
\acro{LIDAR}{Light Detection and Ranging}
\acro{mAP}{Mean Average Precision}
\acro{MDD}{Medical Device Directive}
\acro{MIT}{Massachusetts Institute of Technology}
\acro{PDF}{Portable Document Format}
\acro{RADAR}{Radio Detection and Ranging}
\acro{RDBMS}{Relationales Datenbankmanagementsystem}
\acro{REST}{Representational State Transfer}
\acro{SAS}{Statistical Analysis System}
@@ -28,6 +37,7 @@
\acro{SQL}{Structured Query Language}
\acro{URI}{Uniform Resource Identifier}
\acro{URL}{Uniform Resource Locator}
\acro{YOLO}{You Only Look Once}
\end{acronym}
% Referenz mit \acs{edc} - Kurzfassung erzwingen
% \acl{edc} - Langfassung erzwingen

View File

@@ -90,15 +90,32 @@ Furthermore, the mAP can only provide accuracy compared to the annotated trainin
If these are inaccurate or have a bias, the metric will not reflect it.
The metric is suitable as a comparison by using the same dataset and visualization.
TABLE
\begin{table}[]
\footnotesize
\begin{tabularx}{\textwidth}{@{}lrrrr@{}}
\toprule
& \multicolumn{2}{l}{\textbf{YOLOv3}} & \multicolumn{2}{l}{\textbf{YOLOv3-Tiny}} \\ \toprule
& \multicolumn{1}{l}{\textbf{80 classes}} & \multicolumn{1}{l}{\textbf{Traffic light only}} & \multicolumn{1}{l}{\textbf{80 classes}} & \multicolumn{1}{l}{\textbf{Traffic light only}} \\ \midrule
\textbf{mAP with IoU of 0.25} & 0.91 & 0.75 & 0.60 & \\
\textbf{mAP with IoU of 0.50} & 0.77 & 0.35 & 0.07 & \\
\textbf{mAP with IoU of 0.75} & 0.37 & 0.15 & 0.00 & \\
\textbf{time (min)} & 93 ms & 98 ms & & 25 ms \\
\textbf{time (25\% percentile)} & 97 ms & 101 ms & & 27 ms \\
\textbf{time (50\% percentile)} & 98 ms & 102 ms & & 28 ms \\
\textbf{time (75\% percentile)} & 100 ms & 104 ms & & 30 ms \\
\textbf{time (max)} & 135 ms & 135 ms & & 43 ms \\ \bottomrule
\end{tabularx}
\caption{}
\label{tab:my-table}
\end{table}
The reference weightings achieve an AP\textsuperscript{IoU=0.5} of 0.35 in the class traffic lights, significantly below the average across all classes of 0.70.
A possible reason for this can again be the size.
An absolute translation of a few pixels in a small bounding box results in a much faster declining IoU than in large objects.
The tiny variant is about 3.5 times faster than the large network.
However, the average precision is only approximately two thirds as high.
The resulting annotations also clearly show the difference: Many traffic lights are not recognized or only inaccurately.
However, the average precision is only approximately two thirds as high.
The resulting annotations also clearly show the difference: Many traffic lights are not recognized or only inaccurately.
When viewing the annotated images, an additional problem becomes apparent.
In the COCO triangulation data set, the traffic light class includes all traffic lights, including those that are not directed at the vehicle from the front or are only indirectly relevant, such as pedestrian lights.

View File

@@ -1,13 +1,26 @@
\chapter{Introduction}\label{ch:introduction}
\input{04a01-introduction}
\chapter{Fundamentals}\label{ch:fundamentals}
\input{04a01-requirement-analysis}
\input{04a11-requirement-analysis.tex}
\newpage
\input{04a02-yolo}
\input{04a12-yolo.tex}
\newpage
\input{04a03-datasets}
\input{04a13-datasets.tex}
\chapter{Implementation}\label{ch:implementation}
\input{04-a21-implementation-init}
\input{04a21-implementation-init.tex}
\input{04a22-implementation-changes.tex}
\chapter{Experiments}\label{ch:experiments}
\input{04a31-experiments}
\chapter{Conclusion}\label{ch:conclusion}
\input{04a-41-conclusion}

View File

@@ -0,0 +1,15 @@
%! Author = sebastian
%! Date = 10.11.20
% Preamble
\documentclass[11pt]{article}
% Packages
\usepackage{amsmath}
% Document
\begin{document}
\end{document}

View File

@@ -0,0 +1,6 @@
@article{ning2016spatially,
title={Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking},
author={Ning, Guanghan and Zhang, Zhi and Huang, Chen and He, Zhihai and Ren, Xiaobo and Wang, Haohong},
journal={arXiv preprint arXiv:1607.05781},
year={2016}
}

38
src/04a-41-conclusion.tex Normal file
View File

@@ -0,0 +1,38 @@
The experiments show that traffic light recognition with an optimized implementation of YOLOv3 is adequate.
YOLOv3 has many hyper\textendash parameters, which can and must be adapted.
The advantage of the algorithm is the flexibility in rescaling.
This can be dynamically adjusted depending on the situation and resource availability.
Especially in autonomous cars, the availability of resources like energy and spatiality is limited.\myref{subsubsec:portability}
YOLOv3 is, according to the results of the experiments, preferable to YOLOv3\textendash Tiny.
Although the Tiny variant allows up to 8 times more frames per second on a system with an SSD, the accuracy is also significantly lower.
The additional layer in the Feature Pyramid makes YOLOv3 more suitable for small objects.
For traffic light detection, an average of 20 \ac{FPS}, which YOLOv3 offers, is sufficient.
\section{Perspective}\label{sec:perspective}
The currently used implementation uses the anchors suggested by the paper of YOLOv3.
These have been developed by analyzing the COCO dataset with the 80 standard classes.
A kmeans algorithm has been run over all bounding boxes of all objects to find the nine best\textendash distributed anchors.
The class \q{traffic lights} is exceptional and has a substantially different distribution than the 80 classes with very heterogeneous objects.
The anchors should be recalculated with the help of a kmeans so that the traffic lights are distributed more evenly over the anchors again in order to specialize better.
To not be dependent on a fixed size even after the training, the network should use the possibility of flexible scaling and change the image size every ten batches during the training, as suggested in the paper.\cite{DBLP:journals/corr/abs-1804-02767}
Tensorflow does not offer the possibility to vary the input size of a model during training.
The model has to get saved and reloaded with changed input size.
With the existing data, it is possible to distinguish traffic lights in color.
For this purpose the size of the traffic lights and the identifiability of the traffic light on the image is crucial.
Either decide everything at a glance, which is the principle of YOLO, or prepare the found traffic lights again to determine the class.
The advantage of the second method is that searching through the entire image is faster because only a low resolution can be used.
YOLOv4 is relatively new when the master thesis has been started.
This project, which was published under the supervision of a third party, has addressed this problem in the further development of YOLOv3.
YOLO has problems with small objects.
Predominantly these are targeted with the new version, and the recognition improves.\cite{DBLP:journals/corr/abs-2004-10934}
The change between YOLOv3 and YOLOv4 is substantial.
The use of recurrent layers tends to show good results.
On the one hand, annotated videos allow introducing a hard metric for the precision of bounding boxes in videos.\cite{Mao_2019_ICCV}
More high\textendash grade teacher learning can be used without emulating the frame with a current label before.
Viewing the last n frames makes the network more stable against short dropouts.\cite{ning2016spatially}

View File

@@ -0,0 +1,54 @@
% Abstract
@book{maurer2015autonomes,
title={Autonomes Fahren},
author={Maurer, Markus and Gerdes, J Christian and Lenz, Barbara and Winner, Hermann},
year={2015},
publisher={Springer Berlin Heidelberg},
pages={176\pagehyphen177}
}
@misc{berlkoenig,
title = {Berlkoenig {\textbar} {Die} {Idee}},
url = {https://www.berlkoenig.de/die-idee},
abstract = {Der BerlKönig chauffiert dich sicher und zuverlässig durch Berlin. Einfach per App bestellen, bargeldlos zahlen \& los fahren.},
urldate = {2020-06-03}
}
@article{10.1257/aer.p20161002,
Author = {Cramer, Judd and Krueger, Alan B.},
Title = {Disruptive Change in the Taxi Business: The Case of Uber},
Journal = {American Economic Review},
Volume = {106},
Number = {5},
Year = {2016},
Month = {May},
Pages = {177-82},
DOI = {10.1257/aer.p20161002},
URL = {https://www.aeaweb.org/articles?id=10.1257/aer.p20161002}
}
@misc{Herger_2018,
title={Waymo: 14,4 Millionen Kilometer und ein Technologievorsprung von Jahren},
url={https://derletztefuehrerscheinneuling.com/2018/09/02/waymo-144-millionen-kilometer-und-technologievorsprung-von-jahren/},
abstractNote={Ein neuer Monat, ein neuer Meilenstein für Waymo, Googles Selbstfahrtechnologie-Gruppe. Neun Millionen Meilen (14,4 Millionen Kilometer) in autonomen Modus haben die Autos nun erreicht, bei einer a…},
journal={Der letzte Führerscheinneuling...},
author={Herger, Mario},
year={2018},
month={Sep}
}
@article{glancy2015autonomous,
title={Autonomous and automated and connected cars-oh my: first generation autonomous cars in the legal ecosystem},
author={Glancy, Dorothy J},
journal={Minn. JL Sci. \& Tech.},
volume={16},
pages={684},
year={2015},
publisher={HeinOnline}
}
@misc{safari,
title = {SAFARI Verkehrspolitik: Forschungs- und Entwicklungsprojekte / Land Berlin},
url={https://www.berlin.de/senuvk/verkehr/politik_planung/projekte/safari/index.shtml},
abstractNote={Senatsverwaltung für Umwelt, Verkehr und Klimaschutz in Berlin Verkehrspolitik / Forschungs- und Entwicklungsprojekte: SAFARI Sicheres automatisiertes und vernetztes Fahren auf dem Digitalen Testfeld Stadtverkehr in Berlin Reinickendorf}
}

View File

@@ -0,0 +1,46 @@
Autonomous driving is becoming increasingly important in the 21st century.
It offers the possibility to complement existing mobility concepts and to enable new groups of people to move independently, such as people with limited mobility.\cite{maurer2015autonomes}
In recent years, many new and innovative ideas have been established, bringing the boundaries between local public transport and individual transport ever closer.
The BVG (Berliner Verkehrsgesellschaft) in Berlin, for example, offers the BerlKönig service, which functions as a mix of on\textendash call bus and cab.
When ordering a ride, specify the start and destination points and the number of people to get transported.
The system then bundles similar trips together so that the called BerlKönig bus can run several tours simultaneously.
That reduces both the fare compared to a cab ride and is better for the environment.\cite{berlkoenig}
Uber modernizes the cab industry and manages to use a higher capacity utilization of existing resources.
Thus, Uber's drivers have a significantly increased workload and fewer empty runs.\cite{10.1257/aer.p20161002}
The quoted article points out that these improvements to detect outdated and inefficient regulations for the cab industry that do not apply to Uber, but also to better driver coordination with \q{more efficient driver\textendash passenger matching technology based on mobile Internet technology}.
The route planning for the drivers is usually already predefined by the system and is controlled by the system by suggesting possible connecting tours.
In the case of BerlKönig, the driver no longer has any room for maneuver.
Both concepts presented can be further optimized by autonomous vehicles.
Safety plays a unique role in the development of autonomous cars.
In theory, autonomous vehicles from Waymo, a subsidiary of the Alphabet Group, to which Google also belongs, have been driving since 2012, have covered more than 14 million kilometers in the period up to 2018, and in autonomous mode have caused only one accident themselves.\cite{Herger_2018}
In practice, many questions are still unanswered.
Sensors and vehicle communication can be disturbed, legal liability issues are still unresolved, and the technical aspects, including implementation, are still partially unresolved.\cite{glancy2015autonomous}
Because of these many problems, Waymo cars only drive on specific routes in simple neighborhoods with little traffic in California's sunny Mountain View.
Concerning safety, the detection of traffic light signals in road traffic, in the following called traffic lights, is an essential part of autonomous driving.
Previous technology used at the Free University of Berlin to detect traffic lights is based on markings in maps in which the car searches for traffic lights using image processing methods.
Furthermore, as part of a project carried out in the past two years in a test area in Berlin in cooperation with the Senate Department for Environment, Traffic and Climate Protection, and the Fraunhofer Institute FOKUS, who have extended traffic light systems at intersections with transmitter masts.
These indicate not only the general condition of the crossings (including the layout of the turning lanes and road works) but also the current status of the traffic lights.\cite{safari}
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{0.85\textwidth}
\includegraphics[width=\linewidth]{../data/safari.png}
\caption{Funktionsweise von SAFARI} %Bildunterschrift
\label{fig:safari} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
This method of detecting traffic lights works very well on test routes or routes that have been upgraded for this purpose, but not in unknown terrain.
The systems fail at temporary traffic lights at road works and in other countries that are not equipped with such a system.
In many situations, such retrofitting of existing traffic lights is not economically viable, especially since all traffic lights would have to be retrofitted for the use of autonomous cars without a transitional period.
In this thesis, a different approach to traffic light detection shall be applied.
Here, the hardware of the cameras and graphic units already installed in the autonomous car shall be used to detect traffic lights in real\textendash time utilizing a neural network.

View File

@@ -87,3 +87,8 @@
year={2001},
organization={IEEE}
}
@misc{J3016B,
title={SAE J3016B},
url={https://www.sae.org/standards/content/j3016_201806/},
abstractNote={This SAE Recommended Practice describes motor vehicle driving automation systems that perform part or all of the dynamic driving task (DDT) on a sustained basis. It provides a taxonomy with detailed definitions for six levels of driving automation, ranging from no driving automation (level 0) to ful} }

View File

@@ -8,10 +8,10 @@ Such applies to the mechanical, electrical, and software components.\cite{DBLP:c
The average age of passenger cars in Germany in 2010 was 8.1 years.
In 2019 it was already 9.5 years, and the trend is still rising.
For tractors, the average age in 2019 will even be almost 30 years on average.\cite{kraftfahrtBundesamt}
Accordingly, the additional hardware used for artificial intelligence should also be durable and long-lasting.
Accordingly, the additional hardware used for artificial intelligence should also be durable and long\textendash lasting.
Furthermore, tractors, in particular, are already operating autonomously in many countries and are an \q{elementary component of smart farming}.
Smart Farming describes the use of modern communication and information technology in agriculture.\cite{smartFarming}
Cameras have already proven themselves in several industrial sectors and are also used in the automotive industry as rearview cameras and blind-spot assistants for trucks.
Cameras have already proven themselves in several industrial sectors and are also used in the automotive industry as rearview cameras and blind\textendash spot assistants for trucks.
\subsubsection{Portability}\label{subsubsec:portability}
Furthermore, the technology to be carried must have a reasonable degree of portability because laws and practical handling limit the space that vehicles may occupy.
@@ -39,33 +39,34 @@ The safety of vehicles has a high priority.
A safe vehicle prevents expensive repairs and saves the lives of outsiders and passengers.
Ultimately, this will also become a selling point, especially for autonomous cars.
After all, people entrust their lives to this machine.
Software built into it must be error-free, or at least exclude \q{false positives}, so that the vehicle does not accelerate unintentionally through manipulated traffic signs\cite{sicherheitsbedenken} or steer into oncoming traffic due to misleading points on the road\cite{teslaSecurity}.
Software built into it must be error\textendash free, or at least exclude \q{false positives}, so that the vehicle does not accelerate unintentionally through manipulated traffic signs\cite{sicherheitsbedenken} or steer into oncoming traffic due to misleading points on the road\cite{teslaSecurity}.
The calculations of the software that help to control the vehicle must be performed in real-time to ensure safety.
The calculations of the software that help to control the vehicle must be performed in real\textendash time to ensure safety.
The latencies and thus the reaction speeds remain at a low level.
\subsubsection{Production Costs}\label{subsubsec:production-costs}
Despite high safety requirements, the price of autonomous cars must be affordable.
That is one of the reasons why Tesla, for example, does not use costly LIDAR sensors.
According to Tesla boss Elon Musk, LIDAR sensors are too expensive, and all vision tasks are also accomplishable with cameras and low-cost RADAR (Radio Detection and Ranging) sensors.
Cameras provide significantly more information after resolving the three-dimensional perception, and humans also only use visible light for navigation.\cite{elonMusk}
That is one of the reasons why Tesla, for example, does not use costly \ac{LIDAR} sensors.
According to Tesla boss Elon Musk, \ac{LIDAR} sensors are too expensive, and all vision tasks are also accomplishable with cameras and low\textendash cost \ac{RADAR} sensors.
Cameras provide significantly more information after resolving the three\textendash dimensional perception, and humans also only use visible light for navigation.\cite{elonMusk}
\vspace{1cm}
\begin{chapquote}{Elon Musk, 2019\cite{elonMusk}}
\q{Lidar is a fools errand, and anyone relying on lidar is doomed - expensive sensors that are unnecessary. It is like having a whole bunch of expensive appendices. Like, one appendix is bad, well now you have a whole bunch of them, it is ridiculous.}
\q{Lidar is a fool's errand, and anyone relying on lidar is doomed \textemdash expensive sensors that are unnecessary. It is like having a whole bunch of expensive appendices. Like, one appendix is bad, well now you have a whole bunch of them, it is ridiculous.}
\end{chapquote}
\subsection{Requirements for Traffic Light Sensor Detection}\label{subsec:requirements-for-traffic-light-sensor-detection}
To achieve an autonomous vehicle of level 5, or already of autonomy level 3 when used in city traffic, the car must recognize traffic light heads correctly.
The current approach of traffic light detection with radio-controlled boxes at the roadside, as in the SAFARI project, works.
In order to make the progress of automation objective and also to enable legal discussions, the five levels of driving automation are described in the standard J3016 of the SAE (Society of Automotive Engineers).\cite{J3016B}
To achieve an autonomous vehicle of level 5, Full Driving Automation, or already of autonomy level 3, Conditional Driving Automation, when used in city traffic, the car must recognize traffic light heads correctly.
The current approach of traffic light detection with radio\textendash controlled boxes at the roadside, as in the SAFARI project, works.
However, it has the decisive disadvantage that all traffic lights need a retrofit.
Temporary intersections, such as road works, might not be detected or deviate from the given plan due to flexible modifications.
Road traffic, as it exists today, is designed for visual contact.
Humans navigate by sight alone, except for acoustic support.
Besides, this approach should not use LIDAR. These are, as already mentioned (see Production Costs), expensive and do not offer any particular added value since they work like cameras in the visible light spectrum or at the edge of it.\cite{hecht2018lidar}
Besides, this approach should not use LIDAR. These are, as already mentioned \myref{subsubsec:production-costs}, expensive and do not offer any particular added value since they work like cameras in the visible light spectrum or at the edge of it.\cite{hecht2018lidar}
Thus both have the same problems of particle penetration in snow, rain, dust, or hail.
RADAR does not have these problems, but due to its long wavelength of at least 4 mm, it can have a high inaccuracy - but a more extended range.\cite{rohling2001waveform}
RADAR does not have these problems, but due to its long wavelength of at least 4 mm, it can have a high inaccuracy -\textemdash but a more extended range.\cite{rohling2001waveform}
Another advantage of using cameras is the recognition of the color of traffic lights.
Visual recognition alone can already detect the state.
@@ -76,10 +77,10 @@ Traditional methods require a given algorithm to extract the required informatio
A detailed description by algorithms is difficult due to the many different shapes and positions that light signal heads can take on and the danger of confusion with other objects such as the upper brake lights of trucks.
With neural networks, a determination of an algorithm is not necessary.
Adding a neural network can extract this information from the given annotated example data.
There are already many large data sets available for the detection of objects in road traffic.\cite{DBLP:journals/corr/CordtsORREBFRS16,DBLP:conf/icra/FreginMKD18}
There are already many large datasets available for the detection of objects in road traffic.\cite{DBLP:journals/corr/CordtsORREBFRS16,DBLP:conf/icra/FreginMKD18}
Therefore, the use of such a network at this point makes sense.
The YOLO neural network allows the segmentation of images into bounding boxes and classification of those with a precision of 57.9mAP in 51 milliseconds (approximately 20 frames per second (FPS)) on a Titan X\@.\cite{DBLP:journals/corr/abs-1804-02767}
The YOLO neural network allows the segmentation of images into bounding boxes and classification of those with a precision of 57.9mAP in 51 milliseconds (approximately 20 \ac{FPS}) on a Titan X\@.\cite{DBLP:journals/corr/abs-1804-02767}
In comparison, RetinaNet, a neural network with comparable functionalities, achieves the same precision in about four times the time.
That makes YOLO more suitable for use in real-time environments, such as in autonomous vehicles.

View File

@@ -1,8 +1,8 @@
\section{The Application of YOLOv3}\label{sec:the-application-of-yolov3}
YOLO is a fast convolutional network to detect objects in images and classify them at the same time.
\ac{YOLO} is a fast convolutional network to detect objects in images and classify them at the same time.
Joseph Redmon and Ali Farhadi developed the network, and their latest version, YOLOv3, was released in 2018.\cite{DBLP:journals/corr/abs-1804-02767}
In spring 2020, Alexey Bochkovskiy released the fourth version, YOLOv4, with further improvements.\cite{DBLP:journals/corr/abs-2004-10934}
This iteration increases the mAP (Mean Average Precision) while maintaining the speed of the mesh.
This iteration increases the \ac{mAP} while maintaining the speed of the mesh.
Since YOLOv4 does not bring any additional notable enhancements for detecting small objects, YOLOv3 takes application in this work instead.
Further optimizations will be made after a more detailed analysis to specialize YOLOv3 for small items.
@@ -16,7 +16,7 @@ In contrast, variables can be changed at any time, even after the training of th
The basic functionality of YOLO does not change in the course of the versions.
The input for the network is an image with three dimensions.
In YOLOv1, the size of the image is predetermined due to the \q{Fully Connected Layer} at the end and equals 448 x 448 x 3 pixels.
The output is also a three-dimensional vector that describes the so-called output boxes - meaning in which part of the image with which size a particular object is located.
The output is also a three\textendash dimensional vector that describes the so\textendash called output boxes \textendash meaning in which part of the image with which size a particular object is located.
\begin{figure}[!htbp]
\vspace{0cm}
@@ -69,14 +69,14 @@ The attributes in the cell indicate whether an object was found in this cell.
A cell is responsible for an object if the center of the object is located in the respective cell.
Thus in the illustration, cell (1, 4) is responsible for recognizing the dog, and cell (2, 3) to recognize the bicycle, whereas one is the first column from the left and four the fourth row from the top.
In the 30 attributes of a cell, YOLOv1 encodes two bounding boxes (with five attributes each) and a one-hot encoding of 20 classes.
In the 30 attributes of a cell, YOLOv1 encodes two bounding boxes (with five attributes each) and a one\textendash hot encoding of 20 classes.
In the five attributes of the bounding box, the position of the center (x, y) is relative to the center of the respective cell, and the width and height (w, h) relative to the total size of the image.
The fifth value specifies the \q{confidence} of how secure the network is that this given Bounding Box is also an object.
The two middle graphs in the figure indicate which class has the highest activation in the respective cell (bottom) and which bounding boxes have been detected (top).
The more secure the network is for a bounding box, the thicker the surrounding frame in the figure.
It can be seen that the dog, bicycle, and car are framed more intensively, and the respective class in the cell of the centers is correctly recognized.
If the confidence for one of the two bounding boxes and a class from the One-Hot-Coding is greater than the threshold value 0.5, then an object with the recognized class exists in the image with the dimensions of the bounding box.
If the confidence for one of the two bounding boxes and a class from the One\textendash Hot\textendash Coding is greater than the threshold value 0.5, then an object with the recognized class exists in the image with the dimensions of the bounding box.
When detecting the traffic light class only, the tensor would have a size of 7x7x11 because just two bounding boxes and one class would be present.
@@ -86,7 +86,7 @@ Besides, the recall (percentage of actual positive results from all as correctly
Especially the recall is essential for the recognition of traffic lights because otherwise, many traffic lights are not recognized.
At the expense of precision (proportion of the actual positive results from all identified objects), the recall can be increased.
However, it would lead to worse results overall because the standard threshold value of 0.5 offers the best cost-benefit ratio (see figure).
However, it would lead to worse results overall because the standard threshold value of 0.5 offers the best cost\textendash benefit ratio (see figure).
The comparatively low localization accuracy results mainly from the inaccuracy of the anchor boxes at the end.
The network predicts several anchor boxes and one class for a single cell.
@@ -113,8 +113,8 @@ It occurs in the following text under the name \q{DarknetConv}\myref{par:darknet
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{1.0\textwidth}
\includegraphics[width=\linewidth]{../data/Darknet.png}
\minipage{1\textwidth}
\includegraphics[width=\linewidth]{../data/Darknet.pdf}
\caption{Network Architecture of Darknet-53} %Bildunterschrift
\label{fig:darknet} %fig:ID
\endminipage\hfill
@@ -130,19 +130,20 @@ That is better designed to handle the full utilization of the graphics card and
While the Darknet-19 used in YOLOv2 has a lower accuracy (74.1\%), Darknet-53 from YOLOv3 has a similar accuracy (77.2\%) as ResNet-152 (77.6\%).
Note that ResNet-152 performs 157\% more FLOPs (Floating Point Operations) than Darknet-53.
Because ResNet has more and less performing layers, Darknet achieves more operations per second, and therefore more frames per second (FPS).
Because ResNet has more and less performing layers, Darknet achieves more operations per second, and therefore more \ac{FPS}.
\def\arraystretch{1.5}
\begin{tabularx}{\textwidth}{|l|r|r|r|r|R{1cm}|} \hline
\textbf{Backbone} & \textbf{Top-1 in \%} & \textbf{Top-5 in \%} & \textbf{FLOP in $10^9$} & \textbf{GFLOP/s} & \textbf{FPS} \\ \hline
\textbf{Darknet-19} & $74.1$ & $91.8$ & $7.29$ & $1246$ & $171$ \\ \hline
\textbf{ResNet-101} & $77.6$ & $93.7$ & $19.7$ & $1039$ & $53$ \\ \hline
\textbf{ResNet-152} & $77.6$ & $93.8$ & $29.7$ & $1090$ & $37$\\ \hline
\textbf{Darknet-53} & $77.2$ & $93.8$ & $18.7$ & $1457$ & $78$ \\ \hline
\caption{Performance Comparison of Darknet and ResNet\cite{DBLP:journals/corr/abs-1804-02767}}
\label{tab:comparison_resnet_darknet}
\end{tabularx}
\def\arraystretch{1}
\begin{table}[]
\begin{tabularx}{\textwidth}{@{}llllll@{}}
\toprule
\textbf{Backbone} & \textbf{Top-1 in \%} & \textbf{Top-5 in \%} & \textbf{FLOP in $10^9$} & \textbf{GFLOP\/s} & \textbf{\acs{FPS}} \\ \midrule
Darknet-19 & 74.1 & 91.8 & 7.29 & 1246 & 171 \\
ResNet-101 & 77.6 & 93.7 & 19.7 & 1039 & 53 \\
ResNet-152 & 77.6 & 93.8 & 29.4 & 1090 & 37 \\
Darknet-53 & 77.2 & 93.8 & 18.7 & 1457 & 78 \\ \bottomrule
\end{tabularx}
\caption{Performance Comparison of Darknet and ResNet\cite{DBLP:journals/corr/abs-1804-02767}}
\label{tab:comparison_resnet_darknet}
\end{table}
\paragraph{Darknet Convolutional Layer}\label{par:darknet-convolutional-layer}
The base of Darknet is the already described \q{DarknetConv}, a convolutional layer with subsequent batch normalization and Leaky ReLU as the activation function.
@@ -227,7 +228,7 @@ Currently, only the traffic light class gets trained, and the length of the four
\end{figure}
In YOLOv2, the offset of an anchor box \q{bx} and \q{by} is relative to the upper left corner of the image.
In contrast, the anchor box offset in YOLOv3 refers to the upper left corner of the containing cell.
In contrast, the anchor box' offset in YOLOv3 refers to the upper left corner of the containing cell.
Before training, the sizes of the Anchor Boxes are defined as hyperparameters.
In the reference implementation of YOLOv3 are already nine anchor boxes specified, which are used in this work.\cite{DBLP:conf/cvpr/RedmonF17}
@@ -256,16 +257,16 @@ The activation on the layer with more detail is higher and overshadows the activ
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{1.0\textwidth}
\includegraphics[width=\linewidth]{../data/YOLOv3.png}
\minipage{0.75\textwidth}
\includegraphics[width=\linewidth]{../data/YOLOv3.pdf}
\caption{Network Architecture of Darknet-53} %Bildunterschrift
\label{fig:darknet} %fig:ID
\label{fig:yolov3} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
\subsubsection{Flexible Image Size}
\subsubsection{Flexible Image Size}\label{subsubsec:flexible-image-size}
By using only \q{Convolutional Layers} and no \q{Fully Connected Layer} starting with YOLOv2, the output size of the network is variable depending on the original image size and corresponds to a scalar before 1/32.
It makes Darknet incredibly flexible.
It can be applied to any image size, with the runtime increasing linearly with the number of pixels during training, testing, and execution.

View File

@@ -1,7 +1,7 @@
\section{Used Datasets}\label{sec:used-datasets}
A central point in the development of neural networks is the use of large and precise data sets.
The creation of datasets for images is very time- and resource-consuming because these datasets have to be labeled individually.
Since there is intensive research in autonomous driving, many large and information-rich datasets are available.
A central point in the development of neural networks is the use of large and precise datasets.
The creation of datasets for images is very time\textendash and resource\textendash consuming because these datasets have to be labeled individually.
Since there is intensive research in autonomous driving, many large and information\textendash rich datasets are available.
\subsection{Cityscapes Dataset}\label{subsec:cityscapes-dataset}
A collaboration between Daimler AG, the Max Planck Institute, and Darmstadt Technical University created the Cityscapes dataset in 2016.\cite{DBLP:journals/corr/CordtsORREBFRS16}
@@ -10,9 +10,9 @@ The Cityscapes dataset covers over 50 cities in Germany and neighboring countrie
The available images have a size of 1024 x 2048 pixels.
All images with its 30 classes of the dataset are labeled by hand.\cite{cityscapesYoutube}
The used version of the data set contains 41 cities with 22973 images, which are divided accordingly into training and test images.
The data set is designed for the segmentation of various objects and not specifically for the detection of traffic lights.
Therefore the data set contains only 25077 traffic lights in 6734 images.
The used version of the dataset contains 41 cities with 22973 images, which are divided accordingly into training and test images.
The dataset is designed for the segmentation of various objects and not specifically for the detection of traffic lights.
Therefore the dataset contains only 25077 traffic lights in 6734 images.
\subsection{DriveU Traffic Light Dataset (DTLD)}\label{subsec:driveu-traffic-light-dataset-(dtld)}
Traffic lights can appear in many different appearances.
@@ -23,12 +23,12 @@ An essential part is also the currently displayed color of the signal.
If individual trips are the basis of the datasets, the latter can have a bias.
Predominantly green, but also red traffic lights can be recognized preferentially.
A traffic light in the yellow phase may not be recognized because it appears comparingly less.
Balanced data sets are especially important at this point.
Balanced datasets are especially important at this point.
If a traffic light is still more than 100 meters away, it is only 1.7px wide, under the condition of a width of 20 centimeters (as specified in the European standard EN12368)\cite{anforderungenLichtsignalanlagen}, a camera viewing angle of 100 degrees on an image with a resolution of two megapixels, and thus hardly recognizable by a neural network.
To make matters worse, the image is often scaled down for faster training and detection.
The actual traffic light is now no longer visible on the image - the network detects the light only by its context.
All the more important is a large and variable data set.
The actual traffic light is now no longer visible on the image \textemdash the network detects the light only by its context.
All the more important is a large and variable dataset.
\begin{figure}[!htbp]
\vspace{0cm}
@@ -36,17 +36,17 @@ All the more important is a large and variable data set.
\endminipage\hfill
\minipage{0.65\textwidth}
\includegraphics[width=\linewidth]{../data/traffic_light_on_plane.png}
\caption{Traffic Light Size Within Image}
\label{fig:startseite} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\caption[Traffic Light Size Within Image]{\small If the distance d between the car and the traffic light equals 100m and the camera viewing angle $\Theta$ is 100 degrees, the plane p in that distance has a width of 238 meters.
The traffic light is 20cm wide, as states in EN12368\cite{anforderungenLichtsignalanlagen}.
That results in a share of $0.08 \%$.
If the captured image is 2048px wide, the width of the traffic light corresponds to $1.7$.}
\label{fig:traffic_light_size} %fig:ID
\end{figure}
If the distance d between the car and the traffic light equals 100m and the camera viewing angle $\Theta$ is 100 degrees, the plane p in that distance has a width of 238 meters.
The traffic light is 20cm wide, as states in EN12368.
That results in a share of $0.08 \%$.
If the captured image is 2048px wide, the width of the traffic light corresponds to $1.7$.
Cooperation between the University of Ulm and Daimler AG developed the DriveU Traffic Light Dataset.
On a total of 43875 pictures, it contains 232039 traffic lights taken in eleven different German cities.

View File

@@ -0,0 +1,70 @@
@book{Redmon_2020,
title = {pjreddie/darknet},
author = {Redmon, Joseph},
year = 2020,
month = oct,
url = {https://github.com/pjreddie/darknet},
abstractnote = {Convolutional Neural Networks. Contribute to pjreddie/darknet development by creating an account on GitHub.}
}
@book{Caesar2011_2020,
title = {Caesar2011/yolov3-tf2},
author = {Caesar2011},
year = 2020,
month = oct,
url = {https://github.com/Caesar2011/yolov3-tf2},
abstractnote = {YoloV3 Implemented in Tensorflow 2.0. Contribute to Caesar2011/yolov3-tf2 development by creating an account on GitHub.}
}
@misc{why_docker_2020,
title = {Why Docker? 2020},
year = 2020,
month = oct,
url = {https://www.docker.com/why-docker},
abstractnote = {Learn why Docker is the leading container platform — Freedom of app choice, agile operations and integrated container security for legacy and cloud-native applications.}
}
@article{7036275,
title = {Containers and Cloud: From LXC to Docker to Kubernetes},
author = {D. {Bernstein}},
year = 2014,
journal = {IEEE Cloud Computing},
volume = 1,
number = 3,
pages = {81--84}
}
@misc{yoloWebsite,
title = {YOLO: Real-Time Object Detection},
url = {https://pjreddie.com/darknet/yolo/},
abstractnote = {You only look once (YOLO) is a state-of-the-art, real-time object detection system.}
}
@inproceedings{Mao_2019_ICCV,
title = {A Delay Metric for Video Object Detection: What Average Precision Fails to Tell},
author = {Mao, Huizi and Yang, Xiaodong and Dally, William J.},
year = 2019,
month = oct,
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}
}
@article{doi:10.1021/ci0342472,
title = {The Problem of Overfitting},
author = {Hawkins, Douglas M.},
year = 2004,
journal = {Journal of Chemical Information and Computer Sciences},
volume = 44,
number = 1,
pages = {1--12},
doi = {10.1021/ci0342472},
url = {https://doi.org/10.1021/ci0342472},
note = {PMID: 14741005},
eprint = {https://doi.org/10.1021/ci0342472}
}
@misc{docker_replacing_vm,
title="Are Containers Replacing Virtual Machines?",
url={https://www.docker.com/blog/containers-replacing-virtual-machines/},
abstractNote={Learn from Docker experts to simplify and advance your app development and management with Docker. Stay up to date on Docker events and new version announcements!},
journal={Docker Blog},
year={2018},
month={Aug}
}

View File

@@ -0,0 +1,153 @@
Joseph Redmon, developer of \ac{YOLO} and worker at the \ac{MIT}, built on a self\textendash engineered architecture named Darknet.
Darknet is comparable with Tensorflow with a reduced feature space but optimized on running \ac{YOLO}. He trained the reference implementation of YOLOv3 programmed based on Darknet on 80 different classes.\cite{Redmon_2020}
The class list already contains many classes related to autonomous driving and traffic, for instance, car, person, or traffic sign.
It also included the class traffic light.
On running the network with the Cityscapes dataset with a first visual approach, a low level of precision becomes observed.
That is due to the recognition of several traffic signs as traffic lights.
Both classes are very similar to each other.
Traffic lights and traffic signs are mounted on a mast and are tiny objects challenging to spot with \ac{YOLO}. The developers of the reference implementation fed images of different input size to the network during training.
As previously mentioned, the network is translation invariant due to the lack of fully connected layers.
Training on different input sizes makes the network also robust against scaling.
\section{Used Software and Hardware}\label{sec:used-software-and-hardware}
The foundation with this approach provides reasonably good results, but further improvements remain required.
A version of YOLOv3 transposed to Tensorflow is used to develop the network further.
Tensorflow offers a more straightforward interface for quick modifications.
The fork with all described changes is available on Github.\cite{Caesar2011_2020}
Docker is an open\textendash source software to containerize software.
An isolated user space, filesystem, and network allow the software to run independently without interference.
The installation process is fast and portable to share it with others due to a pre\textendash determined build script.\cite{why_docker_2020}
Systems based on virtual machines utilize a hypervisor to abstract a so\textendash called guest operation system on which each application runs on.
In contrast to a virtual machine, Docker containers make use of the operating system of the host machine.
That increases performance and reduces the overhead size used by each application.\cite{7036275}
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{1\textwidth}
\includegraphics[width=\linewidth]{../data/docker-architecture.png}
\caption{Docker Containers Versus Multiple Virtual Machines\cite{docker_replacing_vm}} %Bildunterschrift
\label{fig:docker-architecture} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
A Docker container, based on the official Tensorflow image, bundles Tensorflow, OpenCV, and the forked YOLOv3 repository together to enable a fast workflow.
The test results refer to the use of a GeForce RTX 2080 Ti with 11GB of GDDR6 memory.
The graphic card achieves roughly 13.4 TFLOPs.
\section{YOLO in Tensorflow}\label{subsec:yolo-in-tensorflow}
The network transposed to Tensorflow includes an implementation of YOLOv3 and YOLOv3\textendash Tiny.
The command\textendash line interface (CLI) can also convert weightings of networks, which are available in the file format of MIT's Darknet, into a Tensorflow checkpoint.
For this purpose, weight specification and the number of classes on which those weights are trained is necessary.
For testing, only the Cityscapes dataset is used for the time being.
As expected, the mAP is identical to the Darknet implementation.
However, it can be noted that Tensorflow is slower than the Darknet implementation.
Darknet is optimized to run YOLO. For better comparability of the results, the speed of the Tensorflow implementation is taken as reference.
When training with the Tensorflow implementation, it is not possible to vary the image size during training.
Contrarily the network offers the same range of functions.
One can expect that optimizing the network for the particular needs of traffic light detection can further increase the median average precision.
The already trained network is trained on 80 classes and, therefore, not specialized.
Although this provides a broad data basis, because the reference weightings are trained on the COCO dataset\cite{yoloWebsite}, and reduce susceptibility to failure, there is a risk of confusion with other classes.
As already mentioned, YOLOv3 is context\textendash sensitive due to the convolutional layer.
Both traffic lights and traffic signs appear on traffic masts at about three meters height.
At this spot, the activation for both classes is relatively high.
The result can be a traffic sign incorrectly recognized as a traffic light (leads to less precision) or a traffic sign incorrectly recognized as a traffic light (leads to less recall).
Confusion with other classes is also possible, but the probability decreases with increasing differences between the classes.
If the training process only includes one class, the anchor boxes may also specialize.
Traffic lights are not horizontal in Germany.
If the network is trained for only one class, the anchor box for horizontal objects is less active or not active altogether.
The expectation is that recognizing a traffic light as a square decreases, and the location error, which is comparatively high in YOLOv3\cite{DBLP:conf/cvpr/RedmonF17}, decreases.
\subsection{Evaluation of the Reference Weights}\label{subsec:evaluation-of-the-reference-weights}
The Mean average precision (mAP) is a performance indicator for determining the quality of a classification network.
The calculation is relatively complex.
The metric AP indicates how high the precision of a class on an average across all images is.
The mAP is the average precision covering all trained or relevant classes.
In the application case of traffic light recognition, only the detection of the class traffic light is relevant.
The mAP is identical to the AP for a single class.
The mAP has, over time, also established itself as an indicator for object recognition networks such as YOLO. There is no generally valid definition for these networks, and different datasets like VOC or COCO implement mAP differently.
In principle, a bounding box is considered correct if it has an IoU with a target box of more than 0.5. Differently used IoUs are also possible.
This thesis uses the definition of the mAP of COCO\@.
\begin{figure}
\begin{equation}
\begin{array}{rcl}
f^{IoU=u}(x\_{c,i}) &=& \left\{\begin{array}{ll}
1 &\textup{, if bounding box of class $c$ with index $i$ has a IoU $\geq $ $u$} \\
0 &\textup{, if bounding box of class $c$ with index $i$ has a IoU $<$ $u$}
\end{array}\right. \\
AP^{IoU=u}\_{c} &=& \frac{1}{n}\sum\_{i=0}^{n=len(c)-1} f^{IoU=u}(x\_{c,i}) \\
mAP^{IoU=u} &=& \frac{1}{n}\sum\_{c=0}^{n=len(C)-1} AP^{IoU=u}\_{c} \\
&=&
\end{array}
\end{equation}
\caption{Definition of mAP}
\label{equ:mAP}
\end{figure}
However, the metric mAP has some weak points that the indicator does not express.
Videos have an additional time component.
The metric does not reflect the detection robustness of a bounding box.
The requirement does not yet include tracking and matching a bounding box over several frames, but whether, for instance, it is only detected in every second frame and \q{flickers}.\cite{Mao_2019_ICCV}
Furthermore, the mAP can only provide accuracy compared to the annotated training data.
If these are inaccurate or have a bias, the metric will not reflect it.
The metric is suitable as a comparison by using the same dataset and visualization.
\begin{table}[]
\footnotesize
\begin{tabularx}{\textwidth}{@{}lrrrr@{}}
\toprule
& \multicolumn{2}{l}{\textbf{Pre-Trained YOLOv3}} & \multicolumn{2}{l}{\textbf{Pre-Trained YOLOv3-Tiny}} \\ \toprule
& \multicolumn{1}{l}{\textbf{80 classes}} & \multicolumn{1}{l}{\textbf{Traffic light only}} & \multicolumn{1}{l}{\textbf{80 classes}} & \multicolumn{1}{l}{\textbf{Traffic light only}} \\ \midrule
\textbf{mAP with IoU of 0.25} & 0.91 & 0.75 & 0.60 & \\
\textbf{mAP with IoU of 0.50} & 0.77 & 0.35 & 0.07 & \\
\textbf{mAP with IoU of 0.75} & 0.37 & 0.15 & 0.00 & \\
\textbf{time (min)} & 93 ms & 98 ms & & 25 ms \\
\textbf{time (25\% percentile)} & 97 ms & 101 ms & & 27 ms \\
\textbf{time (50\% percentile)} & 98 ms & 102 ms & & 28 ms \\
\textbf{time (75\% percentile)} & 100 ms & 104 ms & & 30 ms \\
\textbf{time (max)} & 135 ms & 135 ms & & 43 ms \\ \bottomrule
\end{tabularx}
\caption{Comparison of YOLOv3 variants with Pre-Trained Weights}
\label{tab:comparison-default}
\end{table}
The reference weightings achieve an AP\textsuperscript{IoU=0.5} of 0.35 in the class traffic lights, significantly below the average across all classes of 0.70.
A possible reason for this can again be the size.
An absolute translation of a few pixels in a small bounding box results in a much faster declining IoU than in large objects.
The tiny variant is about 3.5 times faster than the large network.
However, the average precision is only approximately two thirds as high.
The resulting annotations also clearly show the difference: Many traffic lights are not recognized or only inaccurately.
When viewing the annotated images, an additional problem becomes apparent.
In the COCO triangulation dataset, the traffic light class includes all traffic lights, including those that are not directed at the vehicle from the front or are only indirectly relevant, such as pedestrian lights.
In addition, small yellow lights, such as those mounted on construction site pillars, also seem to belong to the class.
That is technically correct but not desirable.
This form is an implication of human bias.
The Cityscapes dataset mainly contains facing traffic lights.
DriveU distinguishes between the poses of traffic lights, including \q{frontal}.
\subsection{Investigation of Different Partial Aspects}\label{subsec:investigation-of-different-partial-aspects}
With this background, the following tests base on a completely re-trained network with the Cityscapes dataset.
The default configuration includes early stopping.
If the accuracy of the validation dataset does not improve during the last three epochs, the training has reached a plateau and stops.
Continuing the training would promptly increase the accuracy of the training dataset to improve accuracy, but the accuracy of the test dataset would decrease.
Overfitting becomes effective.\cite{doi:10.1021/ci0342472}
The size is set to 416x416 pixels to allow comparison with the imported reference weights.
Optimization of the size parameter also takes place in the course of development.

View File

@@ -0,0 +1,6 @@
@misc{coco,
title={COCO - Common Objects in Context\_2020},
url={https://cocodataset.org/#explore},
year={2020},
month={Oct}
}

View File

@@ -0,0 +1,248 @@
\section{YOLOv3 Against YOLOv3\textendash Tiny}\label{sec:yolo-against-yolo-tiny}
The goal is to fundamentally determine how the performance of both networks is on traffic lights.
For this purpose, both networks are trained on the Cityscapes dataset without transfer learning.
The implementation is completed by a routine that converts the Cityscapes dataset in shape of single images into a file format that is faster readable for Tensorflow, so\textendash called tfrecords.
Tfrecords are a list of entities consisting of data and labels.
The input data is unified, and the images with the corresponding labels from a separate folder are stored together in one entity each.
The label for an image is, in this case, a list of bounding boxes.
A bounding box is defined by the upper left and the lower right coordinate.
The introduction of the metric \ac{mAP} allows measuring precision and is, therefore, one of the benchmarks for measuring overall performance.\myref{equ:mAP}
The results show a significant difference in speed.
YOLOv3\textendash Tiny, with about one\textendash tenth of the weights, creates about twice as much \ac{FPS}\footnote{The system used for the test has a \ac{HDD} build\textendash in. Larger differences can be expected when using a SSD}.
Due to the smaller capacity of the network, the \ac{mAP} is lower.
That manifests itself with dropouts in image recognition and instability of the bounding boxes in the video: The center moves around the actual center, and the size changes in successive frames between too large and too small.
YOLOv3 also has localization problems.
Although the misframes are reduced when using the extensive network, the detected size is still unstable and still deviates strongly from the target.
It is to be expected that YOLOv3\textendash Tiny achieves worse results than YOLOv3.
Whether the higher frame rate and the lower computational effort is sufficient, compensation will have to be shown by further tests.
\section{Varying the Image Input Size}\label{sec:varying-the-image-input-size}
Traffic lights have the particular characteristic of being incredibly small.
As already explained in detail \myref{subsec:driveu-traffic-light-dataset-(dtld)}, a traffic light at a distance of 100 meters with an image width of 2048 pixels is only about 1.7 pixels wide.
Especially YOLOv3\textendash Tiny has problems with this because of the missing feature pyramid layer for small objects.
Since the architecture of YOLOv3 only consists of a convolutional layer, the time needed for recognition increases linearly with the image's size.
When downsampling the image to a smaller resolution, it is mainly the width that becomes a problem because there is a risk that the traffic light is no longer included in the image.
The images of both datasets, Cityscapes, and DriveU, have a resolution of 1024x2048 and are twice as wide as high\myref{fig:reshape_original}.
Downsampling to a square image would therefore have more effect on width than on height\myref{fig:reshape_square}.
Scaling to a rectangular image helps to counteract the problem\myref{fig:reshape_rect}.
% Image
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{0.33\textwidth}
\centerline{\includegraphics[width=\linewidth]{../data/bremen_000032_000019_leftImg8bit.png}}
\caption{Original image} %Bildunterschrift
\label{fig:reshape_original} %fig:ID
\endminipage\hfill
\minipage{0.33\textwidth}
\centerline{\includegraphics[width=0.5\linewidth,height=2.45cm]{../data/bremen_000032_000019_leftImg8bit.png}}
\caption{Square reshape} %Bildunterschrift
\label{fig:reshape_square} %fig:ID
\endminipage\hfill
\minipage{0.33\textwidth}
\centerline{\includegraphics[width=0.66\linewidth]{../data/bremen_000032_000019_leftImg8bit.png}}
\caption{Same area, more width} %Bildunterschrift
\label{fig:reshape_rect} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
The previous Python implementation of YOLO can theoretically handle different image sizes, but bugs occur in recognition of the bounding boxes during training.
These are fixed, and additionally, a possibility is implemented to adjust the width independently from the height to train YOLOv3 on rectangular images.
In principle, a trained YOLOv3 network can be applied to different image sizes~\myref{subsubsec:flexible-image-size}, but detection is low if traffic lights now have an unknown aspect ratio or have an unknown size.
At this point, one can take advantage of the fact that traffic lights are usually located in the upper half of the image.
If the network tests only the upper half of the image, the speed doubles almost without a loss of precision.
\section{Make Use of the Pre-Trained Darknet-53}\label{sec:make-use-of-the-pre-trained-darknet-53}
The pre-trained weights of YOLOv3 and the preceding Darknet are trained on the COCO dataset.\cite{yoloWebsite} That contains over 120,000 images.\cite{coco}
It is recommended to use the already trained Darknet and to train only the YOLO part with the own data.
For this purpose, the trained weights for the Darknet are loaded and frozen so that they cannot be trained further.
The YOLO part is still initialized randomly and trained on the own datasets.
Despite a slight decrease in mAP, the detections are more stable and reliable.
\section{Recurrent Neural Network}
Videos, such as those created while traveling in an autonomous car, contain not only a spatial component but also a temporal one.
YOLO, however, only considers the spatial component, and thus, in a video, all images are processed individually.
YOLOv3 needs to be extended by a temporal component in order to make use of the knowledge from the previous steps.
Both datasets do not include annotated videos, so the implementation must be based on the existing annotated photos.
There are two approaches to the implementation.
\subsection{Heat Map}\label{subsec:heat-map}
The concept of a heat map is based on the assumption that near the position where a traffic light was detected in the previous frame, there must also be a traffic light in the current frame, only shifted by a few pixels.
Instead of an image with the dimensions 3x416x832 as input tensor, a heat map is added next to the color components, resulting in an input sensor of 4x416x832.
This additional layer contains the results of the previous steps.
The heat map is initially 0 at every position and not activated anywhere.
After the first frame, each bounding box and its area get added to the heat map.
Now the areas are a bit warmer.
That repeats every frame, whereby the old values cool down a bit before adding so that areas that have not been activated for a long time return to 0.
% Image
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{1.0\textwidth}
\includegraphics[width=\linewidth]{../data/heatmap.jpg}
\caption[Heatmap Proposal]{\textbullet~First Row: First iteration; Initial heat map not activated; Add bounding boxes to heat map \\ \textbullet~Second Row: Further iteration; Pass last heat map; Cool down; Add bounding boxes to heat map } %Bildunterschrift
\label{fig:heatmap} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
Annotated images can be used to achieve teacher learning.
To generate them, the labels are used and applied to the heat map with some translation towards the center of the image and blur.
That represents the previous step.
This approach has the advantage that it is comparatively easy to implement and extends several frames back into the past.
Currently, the network is only used to detect traffic lights.
A great strength of YOLO is to recognize several classes at the same time without additional resources.
To get a network with multiple classes working, each new class needs its heat map.
The input tensor reaches a high degree of depth;
the three color layers lose their influence.
Darknet has to be trained again.
The input tensor of the pre-trained weights has a fixed depth of three.
Since the use of the pre-trained weights gives the network high stability, they should be used.
During the implementation, many hyper-parameters accumulate.
The cooling rate of the old values after each step, the warming rate of the new added values, the amount of the added blur and shift to the center during training.
The correct choice of these hyper-parameters is fragile and makes goal-oriented training difficult.
The implementation of the prototype is aborted in favor of using another method.
\subsection{Recurrent Layers}
The use of recurrent layers is more challenging.
The output boxes of YOLOv3 from the previous step are combined with those from the current step, passed through LSTM cells to result in the new output boxes.
% Image
\begin{figure}[!htbp]
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{0.50\textwidth}
\includegraphics[width=\linewidth]{../data/lstm.pdf}
\caption{Schematic of the Recurrent Extension Made to YOLOv3} %Bildunterschrift
\label{fig:lstm} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}
For teacher learning, the label must be minimally randomized and then transformed into the definition domain that the network outputs.
The transformation includes reverse engineering of the last layer in the network and calculating inverse functions for the last layer.
The use of LSTM cells makes it possible to parse a video without additional manual calculations, like creating a heat map.
This approach works best when already annotated training videos are available.
That makes it possible not to only emulate the last frame but to actually provide the last, for example, 16 frames as input.
For this purpose, recorded ROS bags can be used, which have recorded traffic lights with classical image processing methods.
The marginal improvements indicate that one frame in the past is not sufficient.
\section{Further Improvements}\label{sec:further-improvements}
The following improvements do not necessarily serve to increase precision, but they do show the limitations of YOLOv3.
\subsection{Detection of the Traffic Light Signal}\label{subsec:detection-of-the-traffic-light-signal}
Up to now, all traffic lights have been combined in one class.
The DriveU dataset can differentiate the traffic light signal into individual classes depending on their stage.
When creating the tfrecords for DriveU, a parameter for differentiation has already been extracted.
With this parameter, the dataset is divided into classes and trained.
The result is not very satisfying and will not be pursued further.
Especially for small, distant traffic lights, the recognition is imprecise.
The detection of small objects has to be improved first.
\subsection{Distance Estimation}\label{subsec:distance-estimation}
The closer a traffic light is, the larger it is in the image.
This circumstance can be used for distance estimation.
Based on this, one can determine a braking behavior.
In an earlier chapter \myref{equ:distance_measure}, a formula was introduced while arguing that a traffic light in a certain distance is only tiny on the captured picture.
After rearranging, the traffic light distance can be determined by specifying the camera opening angle, the traffic light width (20cm), and the proportion of the traffic light within the image relative to the image width (for example, 2px of 832px).
{\begin{figure}
\def\arraystretch{1.2}
\begin{equation}
\begin{array}{rcl}
a&:=&\textup{\q{Traffic light width (mostly $20cm$)}} \\
m&:=&\textup{\q{Measured width of bounding box in px}} \\
m&=&\frac{a}{p(d)}\cdot w \\
p(d)&:=&\textup{\q{Plane width in a distance $d$}} \\
p(d)&=&2d\cdot tan(\frac{\theta}{2}) \\
w&:=&\textup{\q{Width of the captured image (e.g. 2048px)}} \\
\theta&:=&\textup{\q{Camera aperture}} \\
d&:=&\textup{\q{Distance of the traffic light}} \\
m&=&\frac{aw}{2d\cdot tan(\frac{\theta}{2})} \\
d&=&\underbrace{\frac{aw}{2\cdot tan(\frac{\theta}{2})}}_{const.}\cdot \frac{1}{m}
\end{array}\label{eq:distance_measure}
\end{equation}
\caption{Calculating the Distance of an Traffic Light}
\end{figure}
}
The problem is the parameters needed for the formula.
While the camera aperture is fixed, the other parameters can only be determined inaccurately.
According to EN12368\cite{anforderungenLichtsignalanlagen}, the traffic light's width is standardized to 20 centimeters but can be 30 centimeters in hazardous areas, and in rare cases 10 centimeters are allowed.
Even more challenging to determine is the traffic light width in the picture.
Furthermore, from a certain distance, the difference in width is only maginal.
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}
[
xlabel={traffic light width in px},
ylabel={distance in m},
grid,
scaled y ticks=false,
]
\addplot+[mark repeat=10,domain=1:21, samples=200] {0.1*2048.0/tan(0.87665)/x/78};
\end{axis}
\end{tikzpicture}
\caption{Traffic light size compared with its distance using a 100 degree camera opening angle}
\label{fig:plot_distance_vs_px} %fig:ID
\end{figure}
Using a 100 degree camera opening angle a 2-pixel wide bounding box can belong to a traffic light 113 or 68 meters away.
Due to the antiproportionality, a usefully accurate distance indication is given beneath a distance of about 40 meters.\myref{fig:plot_distance_vs_px}
% Image
\begin{figure}
\vspace{0cm}
\minipage{0.0\textwidth}
\endminipage\hfill
\minipage{0.85\textwidth}
\includegraphics[width=\linewidth]{../data/distance_measure.png}
\caption{Output of traffic lights with distance marker} %Bildunterschrift
\label{fig:distance_measure} %fig:ID
\endminipage\hfill
\minipage{0.0\textwidth}
\endminipage
\end{figure}

View File

189
src/04a31-experiments.tex Normal file
View File

@@ -0,0 +1,189 @@
The experiments take place parallel to the implementation in order to let those benefit from the results.
This chapter deals with the evaluation phase of the trained networks.
The networks get tested with the test dataset of Cityscapes.
Later on, test data from the DriveU set will be added.
In addition to the hard metric \ac{mAP}, the tested and annotated images and videos are manually examined in detail.
A well\textendash trained network characterizes that for every traffic light, which can be seen immediately by humans, an exact box is found around its corpus.
Traffic lights that are not directly relevant to the driver, such as the backside or traffic lights for other traffic participants, remain undetected.
In case of videos, there is also no jumping of the box between several frames.
Also no dropouts should be present in a video.
\section{YOLOv3 Against YOLOv3\textendash Tiny}\label{sec:yolov3-against-yolov3-tiny}
Training on the training dataset leads to a minimum validation loss of 16.3835 for YOLOv3.
The loss indicates the absolute error of the weights.
On its own, this value is insignificant and is only used for relative comparison of the results of individual experiments within the same architecture.
\begin{table}[]
\centering
\begin{tabular}{@{}lrr@{}}
\toprule
& \multicolumn{1}{l}{\textbf{YOLOv3}} & \multicolumn{1}{l}{\textbf{YOLOv3\textendash Tiny}} \\ \midrule
mAP with IoU of 0.25 & 0.88 & 0.83 \\
mAP with IoU of 0.50 & 0.43 & 0.26 \\
mAP with IoU of 0.75 & 0.04 & 0.03 \\
time (min) & 89 & 22 \\
time (25\% percentile) & 96 & 24 \\
time (50\% percentile) & 98 & 25 \\
time (75\% percentile) & 101 & 25 \\
time (max) & 135 & 44 \\
avg. \acs{FPS} & 17 & 30 \\
\# of epochs & 14 & 9 \\
min. val loss & 16.3835 & 12.4516 \\ \bottomrule
\end{tabular}
\caption{Comparison: YOLOv3 Against YOLOv3\textendash Tiny}
\label{tab:compare_yolov3-against-yolov3-tiny}
\end{table}
When testing the images, YOLOv3\textendash Tiny is about four times faster.
When measuring the time for images, only the time it takes to perform the object recognition is taken into account.
The metric FPS specifies how many frames per second are displayed.
The loading of new images is, therefore, included.
Using this metric, the time advantage of the small version shrinks to half.
The limiting factor is the \ac{HDD} used to store the datasets, which reaches its maximum reading and writing speed in many tests and training sessions.
The 200 \ac{FPS} on a weaker graphics card, as stated on the official YOLO website, supports this argument.\cite{yoloWebsite}
The speed of 30 \ac{FPS} or 17 \ac{FPS} with YOLO achieved under these conditions is still sufficient for traffic light detection.
The mAP of YOLOv3\textendash Tiny is about one third worse than that of the large network.
The low precision is also evident when examining images and videos.
The tiny YOLO variant does not recognize many traffic lights in the pictures.
Especially small traffic lights are not recognized.
When driving, this leads to a reduction in visibility.
Traffic lights are only recognized later.
%TODO IMAGE späte Erkennung
In videos, the inaccuracy of YOLOv3\textendash Tiny becomes noticeable by high instability.
That is also observable in YOLOv3 but not as prominent.
%TODO IMAGE video Instabilität
A fundamental difference between the two versions is the number of levels of the feature pyramid.
YOLOv3 has one additional step with one\textendash eighth of the image width and height for the detection of small objects.
That is entirely missing in YOLOv3\textendash Tiny.
Besides, YOLOv3\textendash Tiny has no Darknet backbone and a lower depth.
The context\textendash sensitivity with which the tiny variant works cannot be as complex as YOLOv3.
\section{Varying the Image Input Size}\label{sec:varying-the-image-input-size2}
The larger the input image during training, the larger the validation loss.
Shifting in a larger input image by a given size in pixels has more relative effect than on a small image.
In contrast more can be captured in more detail on a larger image.
\begin{table}[]
\centering
\begin{tabular}{@{}lrrrr@{}}
\toprule
& \multicolumn{4}{l}{\textbf{YOLOv3}} \\ \midrule
Size & 416x416 & 832x832 & 512x1024 & 416x832 \\
Pixels in $10^5$ & 1.73 & 6.92 & 5.24 & 3.46 \\
Batch Size & 8 & 4 & 4 & 8 \\ \midrule
mAP with IoU of 0.25 & 0.88 & 0.84 & 0.85 & 0.86 \\
mAP with IoU of 0.50 & 0.43 & 0.44 & 0.44 & 0.43 \\
mAP with IoU of 0.75 & 0.04 & 0.07 & 0.07 & 0.06 \\
time (min) & 89 & 99 & 95 & 92 \\
time (25\% percentile) & 96 & 112 & 109 & 101 \\
time (50\% percentile) & 98 & 125 & 121 & 110 \\
time (75\% percentile) & 101 & 137 & 132 & 120 \\
time (max) & 135 & 150 & 143 & 140 \\
avg. \acs{FPS} & 17 & 13 & 14 & 17 \\
\# of epochs & 14 & 6 & 7 & 10 \\
min. val loss & 16.3835 & 21.8095 & 19.4198 & 17.5367 \\ \bottomrule
\end{tabular}
\caption{Varying the Image Input Size}
\label{tab:varying-the-image-input-size}
\end{table}
The time it takes the network to process an image increases linearly with the number of pixels.
As the image size increases, the mAP also increases, but the slope of the mAP is monotonically decreasing.
Since the number of pixels is proportional to the required time, the required computing time increases overproportionally to mAP.
Even if the amount of data to be processed is more massive with a large image as input, the required training time reduces.
The validation loss decreases faster and to a lower value.
The target object is easier to identify with more details.
If the images are too small, the traffic lights are not identifiable, and the mAP is disproportionately bad.
With too large images, the additional gain in knowledge decreases because all relevant details are already present in the image.
Training on large images is also tricky because the batch size has to be reduced to ensure that the graphics card's memory is sufficient.
A smaller batch size makes the network more vulnerable to outliers and overfitting.
The optimal size for our problem is 416x832.
This size has roughly as much pixels as a 588x588 image and is therefore comparatively small.
Nevertheless, the width is doubled and the traffic lights are more clearly visible in the image.
\section{More Data}\label{sec:more-data}
The introduction of the DriveU dataset leads to a reduction of the validation loss.
\begin{table}
\footnotesize
\centering
\begin{tabularx}{\textwidth}{@{}lrrrrrr@{}}
\toprule
& \multicolumn{3}{l}{\textbf{YOLOv3}} & \multicolumn{3}{l}{\textbf{YOLOv3-Tiny}} \\ \toprule
& \multicolumn{1}{l}{\textbf{Cityscapes}} & \multicolumn{1}{l}{\textbf{CS + DU}} & \multicolumn{1}{l}{\textbf{DriveU}} & \multicolumn{1}{l}{\textbf{Cityscapes}} & \multicolumn{1}{l}{\textbf{CS + DU}} & \multicolumn{1}{l}{\textbf{DriveU}} \\ \midrule
\textbf{mAP with IoU of 0.25} & 0.88 & 0.87 & 0.88 & 0.83 & 0.78 & 0.81 \\
\textbf{mAP with IoU of 0.50} & 0.43 & 0.44 & 0.42 & 0.26 & 0.36 & 0.30 \\
\textbf{mAP with IoU of 0.75} & 0.04 & 0.07 & 0.05 & 0.03 & 0.04 & 0.03 \\
\textbf{time (min)} & 89 ms & 88 ms & 90 ms & 22 ms & 26 ms & 24 ms \\
\textbf{time (25\% percentile)} & 96 ms & 97 ms & 95 ms & 24 ms & 26 ms & 26 ms \\
\textbf{time (50\% percentile)} & 98 ms & 101 ms & 98 ms & 25 ms & 27 ms & 27 ms \\
\textbf{time (75\% percentile)} & 101 ms & 104 ms & 99 ms & 25 ms & 30 ms & 32 ms \\
\textbf{time (max)} & 135 ms & 132 ms & 133 ms & 43 ms & 42 ms & 49 ms \\ \bottomrule
\end{tabularx}
\caption{Comparison: More Data}
\label{tab:comparison:_more-data}
\end{table}
This experiment shows that the combination of both datasets leads to a higher precision than when used alone.
Even though Cityscapes contains some irrelevant traffic lights, the precision increases, and the videos become visibly more interference-free.
Both datasets have different white balance, color matching, and distribution of traffic light positions.
It seems likely that this will make the network more robust and give it a better understanding of the concept of traffic lights.
YOLOv3 benefits from additional data more than YOLOv3-Tiny.
The small YOLO variant does not have sufficient capacity to hold all the information from the dataset.
\section{Pre-Trained Weights}\label{sec:pre-trained-weights}
Using the pre-trained weightings the number of epochs required is further reduced.
Now that Darknet, with 18 billion weights, is frozen, only about 71% of the network is trained.
\begin{table}[]
\centering
\begin{tabular}{@{}lrr@{}}
\toprule
& \multicolumn{1}{l}{\textbf{YOLOv3 (CS+DU)}} & \multicolumn{1}{l}{\textbf{YOLOv3 (CD+DU; Darknet freezed)}} \\ \midrule
mAP with IoU of 0.25 & 0.87 & 0.82 \\
mAP with IoU of 0.50 & 0.44 & 0.42 \\
mAP with IoU of 0.75 & 0.07 & 0.08 \\
time (min) & 88 ms & 89 ms \\
time (25\% percentile) & 97 ms & 98 ms \\
time (50\% percentile) & 99 ms & 99 ms \\
time (75\% percentile) & 103 ms & 105 ms \\
time (max) & 127 ms & 133 ms \\
avg. \acs{FPS} & 17 & 17 \\
\# of epochs & 8 & 9 \\
min. val loss & 12.6568 & 12.4516 \\ \bottomrule
\end{tabular}
\caption{Comparison: Pre-Trained Weights}
\label{tab:compare_pre-trained-weights}
\end{table}
In comparison to previous experiments, the mAP has decreased slightly.
The Darknet-53 now focuses on general object recognition.
The complete information about the structure and context of traffic lights is inside the YOLO part.
The information reduction to the YOLO part makes the mAP decrease, but the task splitting with Darknet-53 makes the network more stable.
When looking at the images, the decreased mAP is not visible.
Especially the good results in the annotation of the videos add value to the action.
The detection is more stable, and the bounding box sizes between the frames do not fluctuate as much anymore.
\section{Recurrent Layers}\label{sec:recurrent-layers}
The use of recurrent layers cannot be measured with mAP because the datasets used do not contain annotated videos.
When examining the videos manually, one can see that there are fewer, but not statistically significant fewer, micro-failures than before.
Using the last frame alone is not sufficient to compensate for interruptions over several frames.
Some fluctuations in image size have disappeared, but there are still some outliers.
Recurrent layers work in this application case, but the past frames must be more extensive and must consist of the last 16 frames, for example.

View File

@@ -16,9 +16,13 @@
\usepackage{url}
\usepackage{graphicx} %Bilder einfügen
%\usepackage{pdfpages} %PDF einfügen
\usepackage{amsmath}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage[a4paper, margin=1in]{geometry}
\usepackage[right]{eurosym} %Euro-Zeichen
\usepackage{amssymb}
\usepackage{pdfpages}
\usepackage{cite} %Quellenangaben
\usepackage[
colorlinks, % Links ohne Umrandungen in zu wählender Farbe
@@ -47,7 +51,8 @@
\usepackage{makecell}
\usepackage{enumitem}
\newcommand{\bc}[1]{{\centering\bf #1}}
\newcommand{\myref}[1]{{(see \autoref{#1}~\nameref{#1})}}
%\newcommand{\myref}[1]{{(see~\autoref{#1}~\q{\nameref{#1}})}}
\newcommand{\myref}[1]{{(see~\autoref{#1})}}
\newcommand{\q}[1]{{\enquote{#1}}}
@@ -70,7 +75,7 @@
%%%%%%%%%%%% Studentenname
\newcommand{\studentName}{Seedorf, Sebastian}
%%%%%%%%%%%% Typ der Arbeit
\newcommand{\type}{Masterarbeit}
\newcommand{\type}{Master Thesis}
%%%%%%%%%%%% Thema
\newcommand{\topic}{Detection of Traffic Lights in Urban Traffic Using a Convolutional Neural Network (CNN)}
%%%%%%%%%%%% Matrikelnummer
@@ -83,7 +88,8 @@
\newcommand{\jahrgang}{2017}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{listings,xcolor} %Codeanzeige
\usepackage{listings,xcolor}
\usepackage{cleveref} %Codeanzeige
\definecolor{dkgreen}{rgb}{0,.6,0}
\definecolor{dkblue}{rgb}{0,0,.6}
\definecolor{dkyellow}{cmyk}{0,0,.8,.3}
@@ -258,9 +264,9 @@
\pagenumbering{Roman}%römische Seitenzahlen
\setcounter{page}{6}
\bibliographystyle{ieeetr}
\bibliography{02-abstract,04a01-requirement-analysis,04a02-yolo,04a03-datasets,04-a21-implementation-init}
\bibliography{04a11-requirement-analysis.bib}
% backup 02-abstract,04a01-requirement-analysis,04a02-yolo,04a03-datasets,04a21-implementation-init,04a22-implementation-changes,04a31-experiments,04a-41-conclusion
% backup 02-abstract,04a11-requirement-analysis,04a12-yolo,04a13-datasets,04a21-implementation-init,04a22-implementation-changes,04a31-experiments,04a-41-conclusion
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Eidesstattliche Erklärung
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@@ -38,5 +38,5 @@ F-AUT-3 & Falls der Benutzer auf die Login-Seite zugreifen möchte und bereits e
% Quote
\begin{chapquote}{Thomas Zink, Datenmanager; \textit{3.8., Gedächtnisprotokoll von Sebastian Seedorf}}
\q{Momentan bietet das System noch nicht allzu viele Vorzüge im Vergleich zur Access-Datenbank. Zumal nicht sonderlich viele neue Daten regelmäßig dazu kommen, ist ein Austauschen der neuen Datenbank kein sonderliches Problem. Wenn das System dann bald soweit ist [Verweis auf seine Tickets im Issue-Tracker, A. der Red.], dass man auch DDS [\acl{DDS}, A. der Red.] automatisch als XLSX erzeugen kann, dann erleichtert es Unmengen an Arbeit. Es muss nicht zwingend XLSX sein, PDF wäre mir sogar lieber, aber das ist ja dann eine Frage der Implementierung.}
\q{Momentan bietet das System noch nicht allzu viele Vorzüge im Vergleich zur Access\textendash Datenbank. Zumal nicht sonderlich viele neue Daten regelmäßig dazu kommen, ist ein Austauschen der neuen Datenbank kein sonderliches Problem. Wenn das System dann bald soweit ist [Verweis auf seine Tickets im Issue\textendash Tracker, A. der Red.], dass man auch DDS [\acl{DDS}, A. der Red.] automatisch als XLSX erzeugen kann, dann erleichtert es Unmengen an Arbeit. Es muss nicht zwingend XLSX sein, PDF wäre mir sogar lieber, aber das ist ja dann eine Frage der Implementierung.}
\end{chapquote}