# YOLOv3: An Incremental Improvement

University of Washington

University of Washington，UW, Washington or U-Dub：华盛顿大学
Allen Institute for Artificial Intelligence，Allen Institute for AI，AI2
Computer Science，CS：计算机科学
Computer Vision，CV：计算机视觉
interpretation [ɪnˌtɜːprəˈteɪʃn]：n. 解释，翻译，演出
intuition [ˌɪntjuˈɪʃn]：n. 直觉，直觉力，直觉的知识
moderation [ˌmɒdəˈreɪʃn]：n. 适度，节制，温和，缓和


arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

## Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × \times 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 A P 50 AP_{50} in 51 ms on a Titan X, compared to 57.5 A P 50 AP_{50} in 198 ms by RetinaNet, similar performance but 3.8 × \times faster. As always, all the code is online at https://pjreddie.com/darknet/yolo/.

swell [swel]：v. 膨胀，肿胀，(使) 凸出，鼓出，(使) 增加，扩大，(声音) 变响亮，充满 (激情) n. 凸起处，隆起处，逐渐增长，感情高涨，浪涌，音量调节器，(非正式) 名流 adj. (非正式) 极好的，非常愉快的，漂亮的，时髦的 adv. 极好地，出色地


Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

## 1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT!

The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

twitter ['twɪtə(r)]：n. 兴奋，(鸟的) 唧啾声，紧张，激动 v. 叽喳，唧唧喳喳地说话，运用推特社交网络发送信息
intro [ˈɪntrəʊ]：n. 前奏，前言，导言，介绍，简介
contemplate [ˈkɒntəmpleɪt]：vt. 沉思，注视，思忖，预期 vi. 冥思苦想，深思熟虑
signpost [ˈsaɪnpəʊst]：n. 路标，指示牌


## 2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

### 2.1. Bounding Box Prediction (边界框的预测)

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, t x , t y , t w , t h t_{x}, t_{y}, t_{w}, t_{h} . If the cell is offset from the top left corner of the image by ( c x , c y ) (c_{x}, c_{y}) and the bounding box prior has width and height p w , p h p_{w}, p_{h} , then the predictions correspond to:

b x = σ ( t x ) + c x b y = σ ( t y ) + c y b w = p w e t w b h = p h e t h \begin{aligned} b_{x} &= \sigma(t_{x}) + c_{x} \\ b_{y} &= \sigma(t_{y}) + c_{y} \\ b_{w} &= p_{w}e^{t_{w}} \\ b_{h} &= p_{h}e^{t_{h}} \end{aligned}

anchor boxes 是通过聚类的方法得到的。cell (图像划分成 S × \times S 个网格 cell) 相对于图像左上角的偏移 ( c x , c y ) (c_{x}, c_{y}) 。bounding box prior (anchor box) 宽和高 (width and height) p w , p h p_{w}, p_{h}

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is t ^ ∗ \hat{t}_{\ast} our gradient is the ground truth value (computed from the ground truth box) minus our prediction: t ^ ∗ − t ∗ \hat{t}_{\ast} - t_{\ast} . This ground truth value can be easily computed by inverting the equations above.

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.
YOLOv3 使用逻辑回归预测每个边界框的 objectness score。如果 bounding box prior 与 ground truth 目标的重叠量大于任何其他 bounding box prior，则应为 1。如果 bounding box prior 不是最好的，但是与 ground truth 目标的重叠超过某个阈值，我们将忽略预测 [17]。我们使用的阈值为 .5。与 [17] 不同，我们的系统仅为每个 ground truth 目标分配一个 bounding box prior。如果没有将 bounding box prior 分配给 ground truth 目标，则不会产生坐标或类别预测 loss，而只会产生 objectness 预测 loss。

yolov3.cfg 的训练轮数是 max_batches = 500200，数据量较小时，每一轮训练显示的损失值都是 nan，其原因可能是因为阈值直接忽略掉了这个 bounding box 导致没有loss。

If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of 0.5. 预测的 bounding box 被忽略，coordinate or class 预测不会产生 loss，只会产生 objectness 预测 loss。

invert [ɪnˈvɜːt]：vt. 使...转化，使...颠倒，使...反转，使...前后倒置 n. 颠倒的事物，倒置物，倒悬者 adj. 转化的
incur [ɪnˈkɜː(r)]：v. 招致，遭受，引致，带来...
plagiarize ['pleɪdʒəraɪz]：vi. 剽窃，抄袭 vt. 剽窃，抄袭


Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].
Figure 2. Bounding boxes with dimension priors and location prediction. 我们将框的宽度和高度预测为与 cluster centroids 的偏移量。我们使用 sigmoid function 预测 box 的中心坐标相对于 filter 应用的位置。这个图片公然从 [15] 自剽窃。

### 2.2. Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.

This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

Class Prediction 是将原来的单标签分类改进为多标签分类，网络结构上就是将原来用于单标签多分类的 softmax 层换成用于多标签多分类的 independent logistic classifiers。原来分类网络中的 softmax 层是假设一张图像或一个 object 只属于一个类别，但是在一些复杂场景下，一个 object 可能属于多个类。例如类别中有 woman 和 person 这两个类，如果一张图像中有一个 woman，那么检测的结果中类别标签就要同时有 woman 和 person 两个类，这就是多标签分类，需要用 independent logistic classifiers 对每个类别做二分类。

During training we use binary cross-entropy loss for the class predictions. YOLOv1 使用 sum-squared error 计算 class loss，sum-squared error 在训练的时候相比 cross-entropy (交叉熵) 不易收敛，一般采用 cross-entropy (交叉熵) 计算 class loss。

multilabel classification：多标签分类
cross entropy：交叉熵
prediction [prɪˈdɪkʃn]：n. 预报，预言


### 2.3. Predictions Across Scales (跨尺度的预测)

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [ 3 ∗ ( 4 + 1 + 80 ) ] N \times N \times [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
YOLOv3 预测 3 种不同尺度的 boxes。我们的系统使用一个相似于 feature pyramid networks [8] 的概念，从这些尺度来提取特征。从基本特征提取器中，我们添加了几个卷积层。这些中的最后一个预测 3-d tensor 来编码 bounding box, objectness, and class predictions。在我们用 COCO [10] 进行的实验中，我们预测了每个尺度上的 3 个 box，因此对于 4 bounding box offsets, 1 objectness prediction, and 80 class predictions，张量为 N × N × [ 3 ∗ ( 4 + 1 + 80 ) ] N \times N \times [3 ∗ (4 + 1 + 80)]

Next we take the feature map from 2 layers previous and upsample it by 2 × \times . We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.

YOLOv2 中通过 passthrough layer 增加细粒度特性。YOLOv3 中对前面两层得到的 feature map 进行上采样 2 × \times ，将更之前得到的 feature map 与经过上采样得到的 feature map 进行连接，这种方法可以让我们获得上采样层的语义信息以及更之前层的细粒度信息，将合并得到的 feature map 经过几个卷积层处理最终得到一个之前层两倍大小的张量。

semantic [sɪˈmæntɪk]：adj. 语义的，语义学的 (等于 semantical)


We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.

We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10 × \times 13), (16 × \times 30), (33 × \times 23), (30 × \times 61), (62 × \times 45), (59 × \times 119), (116 × \times 90), (156 × \times 198), (373 × \times 326).

network resolution = 416 × \times 416

82 detection - scale 1
13 × \times 13 = 169 feature map stride = 32 416 / 32 = 13
13 × \times 13 × \times 3 = 507 大尺度 box
stride = 32，下采样数较高。feature map 的感受野较大，适合检测图像中尺寸较大的对象。COCO dataset bounding box priors (116 × \times 90), (156 × \times 198), (373 × \times 326)。

94 detection - scale 2
26 × \times 26 = 676 feature map stride = 16 13 × \times 2 = 26
26 × \times 26 × \times 3 = 2028 中尺度 box
stride = 16，下采样数中等。feature map 的感受野中等，适合检测图像中尺寸中等的对象。COCO dataset bounding box priors (30 × \times 61), (62 × \times 45), (59 × \times 119)。

106 detection - scale 3
52 × \times 52 = 2704 feature map stride = 8 26 × \times 2 = 52
52 × \times 52 × \times 3 = 8112 小尺度 box
stride = 8，下采样数较低。feature map 的感受野较小，适合检测图像中尺寸较小的对象。COCO dataset bounding box priors (10 × \times 13), (16 × \times 30), (33 × \times 23)。

YOLOv3 加深网络，同时收窄网络。下采样阶段，对象的语义信息 (对象类型) 越来越强，位置信息 (对象定位) 越来越弱。后续再上采样，将语义信息的尺寸扩展到之前的高分辨率特征图，以便结合语义和位置信息，有助于检测不同尺度的对象。

In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [ 3 ∗ ( 4 + 1 + 80 ) ] N \times N \times [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

YOLOv2 有 5 个尺寸预选框，YOLOv3 有 3 个尺寸预选框，但是 YOLOv3 有 3 个检测输出层，所以 YOLOv3 预测的 bounding box 比 YOLOv2 要多。
YOLOv2: 13 × \times 13 × \times 5 = 845。
YOLOv3: (13 × \times 13 + 26 × \times 26 + 52 × \times 52) × \times 3 = (169 + 676 + 2704) × \times 3 = 3549 × \times 3 = 10647

Feature Pyramid Network，FPN

res1, res2, …, res8 等，表示 res_block 里面含有多少个 res_unit。每个 res_unit 需要一个 add 层，一共有 1 + 2 + 8 + 8 + 4 = 23 res_unit，包含 23 add 层。每个 res_block 都会用一个零填充，一共有 5 个 res_block，5 个 Zero Padding。

upsample 2 次，concatenate 2 次。

YOLOv3 没有池化层和全连接层。前向传播过程中，feature map 尺寸缩小是通过改变卷积核的步长来实现的。

### 2.4. Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × \times 3 and 1 × \times 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it… wait for it… Darknet-53!

YOLOv3 的特征提取模型是一个杂交的模型，它使用了 YOLOv2、Darknet-19 以及 residual network。YOLO v3 特征提取网络有 53 个卷积层，因此把它们叫成 Darknet-53。Darknet-53 只是特征提取网络，YOLOv3 使用 Avgpool 层前面的卷积层来提取特征，multi-scale 的特征融合和预测支路并没有在 Darknet-53 中体现。

newfangled ['nju:,fæŋɡl]：adj. 新奇的，最新流行的，最新式的 (等于 newfangled) n. 新式的东西 v. 使流行
stuff [stʌf]：n. 东西，材料，填充物，素材资料 vt. 塞满，填塞，让吃饱 vi. 吃得过多
residual [rɪˈzɪdjuəl]：adj. (数量) 剩余的，(物质状态在成因消失后) 剩余的，残留的，(实验误差) 舍去的，残差的，(土壤) 残余的 n. 剩余物，残渣，残差，剩余误差，(付给表演者的) 复播追加酬金，(地质)残丘，蚀余山，(新车购入一定时间后的) 转售值


Table 1. Darknet-53

1 × \times , 2 × \times , 4 × \times , 8 × \times … 表示有多少个重复的残差组件。每个残差组件包含两个卷积层。

This new network is much more powerful than Darknet-19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.

accuracy [ˈækjərəsi]：n. 精确度，准确性


Each network is trained with identical settings and tested at 256 × \times 256, single crop accuracy. Run times are measured on a Titan X at 256 × \times 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5 × \times faster. Darknet-53 has similar performance to ResNet-152 and is 2 × \times faster.

Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.
Darknet-53 also achieves the highest measured floating point operations per second. 这意味着网络结构可以更好地利用 GPU，从而使其运算效率更高，速度更快。这主要是因为 ResNets 层太多了，效率也不高。

ResNets 的层数太多，效率不高。

par [pɑː(r)]：n. 标准，票面价值，平均数量 adj. 标准的，票面的


### 2.5. Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

hard negative mining 选择有代表性的负样本，分类器将背景预测为正样本的样本。

## 3. How We Do

YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3 × \times faster. It is still quite a bit behind other models like RetinaNet in this metric though.
YOLOv3 很好！请参阅表 3。就 COCO 而言，average mean AP 指标很奇怪，与 SSD 变体相当，但速度快 3 倍。 不过，在此指标上，它仍然比其他模型 (例如 RetinaNet) 要落后很多。

weird [wɪəd]：adj. 怪异的，不可思议的，超自然的 n. (苏格兰) 命运，预言


However, when we look at the “old” detection metric of mAP at IOU= .5 (or A P 50 AP_{50} in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object.

In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high A P S AP_{S} performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.

When we plot accuracy vs speed on the A P 50 AP_{50} metric (see figure 5) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.

decent [ˈdiːsnt]：adj. 正派的，得体的，相当好的
reversal [rɪˈvɜːsl]：n. 逆转，反转，撤销


Table 3. I’m seriously just stealing all these tables from [9] they take soooo long to make from scratch. Ok, YOLOv3 is doing alright. Keep in mind that RetinaNet has like 3.8 × \times longer to process an image. YOLOv3 is much better than SSD variants and comparable to state-of-the-art models on the A P 50 AP_{50} metric.
Table 3. 我很认真地只是从 [9] 中偷走了所有这些表格，它们花了很长时间才能从头开始制作。好的，YOLOv3 一切正常。请记住，RetinaNet 的图像处理时间要长 3.8 × \times 。YOLOv3 比 SSD 变体好得多，并且可以与 A P 50 AP_{50} 指标上的最新模型相媲美。

steal [stiːl]：vt. 剽窃，偷偷地做，偷窃 vi. 窃取，偷偷地行动，偷垒 n. 偷窃，便宜货，偷垒，断球


YOLOv2 对小物体的检测不敏感，主要是 cell 预测阶段导致的。增加了多尺度预测之后，YOLOv3 对小物体的检测方面有了好转，但是现在对中、大 size 的物体表现的不是那么好，这还得需要我们去努力做。

## 4. Things We Tried That Didn’t Work

We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.

Anchor box x , y x, y offset predictions. We tried using the normal anchor box prediction mechanism where you predict the x , y x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.
Anchor box x , y x, y offset predictions. 我们尝试使用常规 anchor box 预测机制，在该机制中，您可以使用线性激活将 x , y x, y 偏移量预测为框宽度或高度的倍数。我们发现此设置降低了模型的稳定性，并且效果不佳。

Anchor box坐标的偏移预测。我们尝试了常规的 anchor box 预测方法，比如利用线性激活将坐标 x , y x, y 的偏移程度预测为边界框宽度或高度的倍数。但我们发现这种做法降低了模型的稳定性，且效果不佳。

Linear x , y x, y predictions instead of logistic. We tried using a linear activation to directly predict the x , y x, y offset instead of the logistic activation. This led to a couple point drop in mAP.
Linear x , y x, y predictions instead of logistic. 我们尝试使用线性激活来直接预测 x , y x, y 偏移量，而不是逻辑激活。这导致 mAP 下降了两点。

Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.
Focal loss. 我们尝试使用 focal loss。它降低了我们的 mAP 大约 2 点。YOLOv3 可能已经对 focal loss 试图解决的问题具有鲁棒性，因为它具有独立的 objectness predictions and conditional class predictions。因此，对于大多数示例而言，分类预测不会带来损失吗？或者其他的东西？我们不太确定。

Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3 - .7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.
Dual IOU thresholds and truth assignment. Faster RCNN 在训练期间使用两个 IOU 阈值。如果预测与 ground truth 的重叠为 0.7，则为正例；在 [.3 - .7] 之间的预测将被忽略；对于所有 ground truth 目标，小于 0.3 则为负例。我们尝试了类似的策略，但未取得良好的效果。

We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.

eventually [ɪˈventʃuəli]：adv. 最后，终于


## 5. What This All Means

YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.
YOLOv3 是一个很好的检测器。快速，准确。在 .5 至 .95 IOU 度量标准之间的 COCO average AP 效果不佳。但是，对于 .5 IOU 的旧检测指标而言，这非常好。

Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is surprisingly difficult.” [18] If humans have a hard time telling the difference, how much does it matter?

cryptic [ˈkrɪptɪk]：adj. 神秘的，含义模糊的，隐藏的


But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to… wait, you’re saying that’s exactly what it will be used for?? Oh.

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait…

I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

In closing, do not @ me. (Because I finally quit Twitter).

The author is funded by the Office of Naval Research and Google.

collaboration [kəˌlæbəˈreɪʃn]：n. 合作，勾结，通敌
harvest [ˈhɑːvɪst]：n. 收获，产量，结果 vt. 收割，得到 vi. 收割庄稼
owe [əʊ]：vt. 欠，感激，应给予，应该把...归功于 vi. 欠钱


Figure 3. Again adapted from the [9], this time displaying speed/accuracy tradeoff on the mAP at .5 IOU metric. You can tell YOLOv3 is good because it’s very high and far to the left. Can you cite your own paper? Guess who’s going to try, this guy ! [16]. Oh, I forgot, we also fix a data loading bug in YOLOv2, that helped by like 2 mAP. Just sneaking this in here to not throw off layout.

sneak [sniːk]：vi. 溜，鬼鬼祟祟做事，向老师打小报告 vt. 偷偷地做，偷偷取得 n. 鬼鬼祟祟的人，偷偷摸摸的行为，告密者 adj. 暗中进行的
layout [ˈleɪaʊt]：n. 布局，设计，安排，陈列


## Rebuttal

We would like to thank the Reddit commenters, labmates, emailers, and passing shouts in the hallway for their lovely, heartfelt words. If you, like me, are reviewing for ICCV then we know you probably have 37 other papers you could be reading that you’ll invariably put off until the last week and then have some legend in the field email you about how you really should finish those reviews execept it won’t entirely be clear what they’re saying and maybe they’re from the future? Anyway, this paper won’t have become what it will in time be without all the work your past selves will have done also in the past but only a little bit further forward, not like all the way until now forward. And if you tweeted about it I wouldn’t know. Just sayin.

Reviewer #2 AKA Dan Grossman (lol blinding who does that) insists that I point out here that our graphs have not one but two non-zero origins. You’re absolutely right Dan, that’s because it looks way better than admitting to ourselves that we’re all just here battling over 2-3% mAP. But here are the requested graphs. I threw in one with FPS too because we look just like super good when we plot on FPS.
Reviewer #2 AKA Dan Grossman (笑的是谁呢) 坚持认为，我在这里指出，我们的图不是有一个而是有两个非零的原点。Dan，您说的完全正确，那是因为它看起来比向我们自己承认我们所有人都在争夺 2-3% 的平均分。但是这是要求的图表。我也加入了 FPS，因为当我们在 FPS 上绘图时，我们看起来就像是超级棒。

Figure 4. Zero-axis charts are probably more intellectually honest… and we can still screw with the variables to make ourselves look good!

also known as，Aka, AKA or a.k.a.：亦称为，别名
rebuttal [rɪˈbʌtl]：n. 反驳，辩驳，反证
commenter ['kɔmentər]：n. 批评家，评论家
Reddit (/ˈrɛdɪt/) is an American social news aggregation, web content rating, and discussion website.
aggregation [ˌæɡrɪˈɡeɪʃn]：n. 聚合，聚集，聚集体，集合体
shout [ʃaʊt]：vi. 呼喊，喊叫，大声说 vt. 呼喊，大声说 n. 呼喊，呼叫
hallway [ˈhɔːlweɪ]：n. 走廊，门厅，玄关
legend [ˈledʒənd]：n. 传奇，说明，图例，刻印文字
tweet [twiːt]：n. 小鸟叫声，自录音再现装置发出的高音，推特 vi. 吱吱地叫，啾鸣
throw [θrəʊ]：vt. 投，抛，掷 vi. 抛，投掷 n. 投掷，冒险
battle [ˈbætl]：n. 战役，斗争 vi. 斗争，作战 vt. 与...作战
laugh(ing) out loud，LOL or lol：大声地笑


Reviewer #4 AKA JudasAdventus on Reddit writes “Entertaining read but the arguments against the MSCOCO metrics seem a bit weak”. Well, I always knew you would be the one to turn on me Judas. You know how when you work on a project and it only comes out alright so you have to figure out some way to justify how what you did actually was pretty cool? I was basically trying to do that and I lashed out at the COCO metrics a little bit. But now that I’ve staked out this hill I may as well die on it.
Reviewer #4 AKA JudasAdventus 在 Reddit 上写道：“有趣的阅读，但反对 MSCOCO 指标的论点似乎有些虚弱”。好吧，我一直都知道你会成为打开我 Judas 的人。你知道当你在一个项目上工作时，而且只能顺利进行，因此您必须找出某种方法来证明您所做的工作真的很酷吗？我基本上是想这样做，并且对 COCO 指标大加抨击。但现在我已经把这座山推下去了，我不妨死在它上面。

lash [læʃ]：vt. 鞭打，冲击，摆动，扎捆，煽动，讽刺 vi. 鞭打，猛击，急速甩动 n. 鞭打，睫毛，鞭子，责骂，讽刺
judas [ˈdʒuːdəs]：n. (门上的) 窥视孔 n. (Judas) 出卖朋友的人，叛徒
stake [steɪk]：n. 桩，棍子，赌注，火刑，奖金 vt. 资助，支持，系...于桩上，把...押下打赌 vi. 打赌


See here’s the thing, mAP is already sort of broken so an update to it should maybe address some of the issues with it or at least justify why the updated version is better in some way. And that’s the big thing I took issue with was the lack of justification. For PASCAL VOC, the IOU threshold was ”set deliberately low to account for inaccuracies in bounding boxes in the ground truth data“ [2]. Does COCO have better labelling than VOC? This is definitely possible since COCO has segmentation masks maybe the labels are more trustworthy and thus we aren’t as worried about inaccuracy. But again, my problem was the lack of justification.

deliberately [dɪˈlɪbərətli]：adv. 故意地，谨慎地，慎重地
segmentation [ˌseɡmenˈteɪʃn]：n. 分割，割断，细胞分裂
justification [ˌdʒʌstɪfɪˈkeɪʃn]：n. 理由，辩护，认为有理，认为正当，释罪


The COCO metric emphasizes better bounding boxes but that emphasis must mean it de-emphasizes something else, in this case classification accuracy. Is there a good reason to think that more precise bounding boxes are more important than better classification? A miss-classified example is much more obvious than a bounding box that is slightly shifted.
COCO 度量标准强调更好的边界框，但强调必须意味着它不再强调其他内容，在这种情况下，是分类准确性的重要性没有体现。是否有充分的理由认为更精确的边界框比更好的分类更重要？未分类的示例比稍微偏离的边界框更明显。

mAP is already screwed up because all that matters is per-class rank ordering. For example, if your test set only has these two images then according to mAP two detectors that produce these results are JUST AS GOOD:
mAP 已经搞砸了，因为所有重要的事情都是按照排名排序。例如，如果你的测试集只有这两个图像，那么根据 mAP，产生这些结果的两个检测器都是非常好的：

Figure 5. These two hypothetical detectors are perfect according to mAP over these two images. They are both perfect. Totally equal.

Now this is OBVIOUSLY an over-exaggeration of the problems with mAP but I guess my newly retconned point is that there are such obvious discrepancies between what people in the “real world” would care about and our current metrics that I think if we’re going to come up with new metrics we should focus on these discrepancies. Also, like, it’s already mean average precision, what do we even call the COCO metric, average mean average precision?

exaggeration [ɪɡˌzædʒəˈreɪʃn]：n. 夸张，夸大之词，夸张的手法
discrepancy [dɪˈskrepənsi]：n. 不符，矛盾，相差
retcon：重新复述，追溯
screw [skruː]：vt. 旋，拧，压榨，强迫 n. 螺旋，螺丝钉，吝啬鬼 vi. 转动，拧


Here’s a proposal, what people actually care about is given an image and a detector, how well will the detector find and classify objects in the image. What about getting rid of the per-class AP and just doing a global average precision? Or doing an AP calculation per-image and averaging over that?

Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.

stupid [ˈstjuːpɪd]：adj. 愚蠢的，麻木的，乏味的 n. 傻瓜，笨蛋


## References

[3] DSSD: Deconvolutional Single Shot Detector
[8] Feature Pyramid Networks for Object Detection
[11] SSD: Single Shot MultiBox Detector

## WORDBOOK

mean Average Precision，mAP：平均精度均值
floating point operations per second，FLOPS
frame rate or frames per second，FPS：每秒帧数
hertz，Hz：赫兹 (频率单位)
billion，Bn
operations，Ops
configuration，cfg
AP small，AP_S
AP medium，AP_M
AP large，AP_L
Feature Pyramid Network，FPN

12-10 5万+
04-12 5803
07-10 358
03-06 138
10-08 2821