# A Large-Scale Car Dataset for Fine-Grained Categorization and Verification

Linjie Yang, Ping Luo, Chen Change Loy, Xiaoou Tang

Department of Information Engineering, The Chinese University of Hong Kong
Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

fine-grained ['fain-'ɡreind]：adj. 细粒的，有细密纹理的
categorization [,kætəgərɪ'zeʃən]：n. 分类，分门别类，编目方法
verification [,verɪfɪ'keɪʃ(ə)n]：n. 确认，查证，核实
The Chinese University of Hong Kong，CUHK：香港中文大学，港中大


## Abstract

This paper aims to highlight vision related tasks centered around “car”, which has been largely neglected by vision community in comparison to other objects. We show that there are still many interesting car-related problems and applications, which are not yet well explored and researched. To facilitate future car-related research, in this paper we present our on-going effort in collecting a large-scale dataset, “CompCars”, that covers not only different car views, but also their different internal and external parts, and rich attributes. Importantly, the dataset is constructed with a cross-modality nature, containing a surveillance-nature set and a web-nature set. We further demonstrate a few important applications exploiting the dataset, namely car model classification, car model verification, and attribute prediction. We also discuss specific challenges of the car-related problems and other potential applications that worth further investigations. The latest dataset can be downloaded at http://mmlab.ie.cuhk.edu.hk/ datasets/comp_cars/index.html

neglect [nɪ'glekt]：vt. 疏忽，忽视，忽略 n. 疏忽，忽视，怠慢
explore [ɪk'splɔː; ek-]：vt. 探索，探测，探险 vi. 探索，探测，探险
facilitate [fə'sɪlɪteɪt]：vt. 促进，帮助，使容易


http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/index.html

Update: This technical report serves as an extension to our earlier work [28] published in CVPR 2015. The experiments shown in Sec. 5 gain better performance on all three tasks, i.e. car model classification, attribute prediction, and car model verification, thanks to more training data and better network structures. The experimental results can serve as baselines in any later research works. The settings and the train/test splits are provided on the project page.

split [splɪt]：vt. 分离，使分离，劈开，离开，分解 vi. 离开，被劈开，断绝关系 n. 劈开，裂缝 adj. 劈开的


Update 2: This update provides preliminary experiment results for fine-grained classification on the surveillance data of CompCars. The train/test splits are provided in the updated dataset. See details in Section 6.

preliminary [prɪ'lɪmɪn(ə)rɪ]：n. 准备，预赛，初步措施 adj. 初步的，开始的，预备的


## 1. Introduction

Cars represent a revolution in mobility and convenience, bringing us the flexibility of moving from place to place. The societal benefits (and cost) are far-reaching. Cars are now indispensable from our modern life as a vehicle for transportation. In many places, the car is also viewed as a tool to help project someone’s economic status, or reflects our economic stratification. In addition, the car has evolved into a subject of interest amongst many car enthusiasts in the world. In general, the demand on car has shifted over the years to cover not only practicality and reliability, but also high comfort and design. The enormous number of car designs and car model makes car a rich object class, which can potentially foster more sophisticated and robust computer vision models and algorithms.

revolution [revə'luːʃ(ə)n]：n. 革命，旋转，运行，循环
mobility [məʊ'bɪlətɪ]：n. 移动性，机动性，迁移率
convenience [kən'viːnɪəns]：n. 便利，厕所，便利的事物
flexibility [,fleksɪ'bɪlɪtɪ]：n. 灵活性，弹性，适应性
benefit ['benɪfɪt]：n. 利益，好处，救济金 vt. 有益于，对...有益 vi. 受益，得益
transportation [trænspɔː'teɪʃ(ə)n; trɑːns-]：n. 运输，运输系统，运输工具，流放
stratification [,strætɪfɪ'keɪʃən]：n. 层理，成层
evolve [ɪ'vɒlv]：vt. 发展，进化，使逐步形成，推断出 vi. 发展，进展，进化，逐步形成
amongst [ə'mʌŋst]：prep. 在...之中，在...当中 (among)
enthusiast [ɪn'θjuːzɪæst; en-]：n. 狂热者，热心家
practicality [,præktɪ'kælɪtɪ]：n. 实用性，实际性，实际，实例
reliability [rɪ,laɪə'bɪlətɪ]：n. 可靠性
sophisticate [sə'fɪstɪkeɪt]：vt. 弄复杂，使变得世故，曲解 n. 久经世故的人，精通者 vi. 诡辩
cue [kjuː]：n. 提示，暗示，线索 vt. 给...暗示


Figure 1. (a) Can you predict the maximum speed of a car with only a photo? Get some cues from the examples. (b) The two SUV models are very similar in their side views, but are rather different in the front views. (c) The evolution of the headlights of two car models from 2006 to 2014 (left to right).

Cars present several unique properties that other objects cannot offer, which provides more challenges and facilitates a range of novel research topics in object categorization. Specifically, cars own large quantity of models that most other categories do not have, enabling a more challenging fine-grained task. In addition, cars yield large appearance differences in their unconstrained poses, which demands viewpoint-aware analyses and algorithms (see Fig. 1(b)). Importantly, a unique hierarchy is presented for the car category, which is three levels from top to bottom: make, model, and released year. This structure indicates a direction to address the fine-grained task in a hierarchical way, which is only discussed by limited literature [17]. Apart from the categorization task, cars reveal a number of interesting computer vision problems. Firstly, different designing styles are applied by different car manufacturers and in different years, which opens the door to fine-grained style analysis [14] and fine-grained part recognition (see Fig. 1(c)). Secondly, the car is an attractive topic for attribute prediction. In particular, cars have distinctive attributes such as car class, seating capacity, number of axles, maximum speed and displacement, which can be inferred from the appearance of the cars (see Fig. 1(a)). Lastly, in comparison to human face verification [22], car verification, which targets at verifying whether two cars belong to the same model, is an interesting and under-researched problem. The unconstrained viewpoints make car verification arguably more challenging than traditional face verification.

facilitate [fə'sɪlɪteɪt]：vt. 促进，帮助，使容易
categorization [,kætəgərɪ'zeʃən]：n. 分类，分门别类，编目方法
unconstraint [,ʌnkən'strent]：n. 自愿，无拘无束，自由自在
displacement [dɪs'pleɪsm(ə)nt]：n. 取代，位移，排水量
reveal [rɪ'viːl]：vt. 显示，透露，揭露，泄露 n. 揭露，暴露，门侧，窗侧
axle ['æks(ə)l]：n. 车轴，轮轴


Automated car model analysis, particularly the fine-grained car categorization and verification, can be used for innumerable purposes in intelligent transportation system including regulation, description and indexing. For instance, fine-grained car categorization can be exploited to inexpensively automate and expedite paying tolls from the lanes, based on different rates for different types of vehicles. In video surveillance applications, car verification from appearance helps tracking a car over a multiple camera network when car plate recognition fails. In post-event investigation, similar cars can be retrieved from the database with car verification algorithms. Car model analysis also bears significant value in the personal car consumption. When people are planning to buy cars, they tend to observe cars in the street. Think of a mobile application, which can instantly show a user the detailed information of a car once a car photo is taken. Such an application will provide great convenience when people want to know the information of an unrecognized car. Other applications such as predicting popularity based on the appearance of a car, and recommending cars with similar styles can be beneficial both for manufacturers and consumers.

innumerable [ɪ'njuːm(ə)rəb(ə)l]：adj. 无数的，数不清的
exploit [ˈeksplɔɪt;ɪkˈsplɔɪt]：vt. 开发，开拓，剥削，开采 n. 勋绩，功绩
toll [təʊl]：n. 通行费，代价，钟声，伤亡人数 vt. 征收，敲钟 vi. 鸣钟，征税
bear [beə]：vt. 结果实，开花 vt. 忍受，承受，具有，支撑 n. 熊


Despite the huge research and practical interests, car model analysis only attracts few attentions in the computer vision community. We believe the lack of high quality datasets greatly limits the exploration of the community in this domain. To this end, we collect and organize a large-scale and comprehensive image database called “Comprehensive Cars”, with “CompCars” being short. The “CompCars” dataset is much larger in scale and diversity compared with the current car image datasets, containing 208, 826 images of 1, 716 car models from two scenarios: web-nature and surveillance-nature. In addition, the dataset is carefully labelled with viewpoints and car parts, as well as rich attributes such as type of car, seat capacity, and door number. The new dataset dataset thus provides a comprehensive platform to validate the effectiveness of a wide range of computer vision algorithms. It is also ready to be utilized for realistic applications and enormous novel research topics. Moreover, the multi-scenario nature enables the use of the dataset for cross modality research. The detailed description of CompCars is provided in Section 3.

realistic [rɪə'lɪstɪk]：adj. 现实的，现实主义的，逼真的，实在论的


To validate the usefulness of the dataset and to encourage the community to explore for more novel research topics, we demonstrate several interesting applications with the dataset, including car model classification and verification based on convolutional neural network (CNN) [13]. Another interesting task is to predict attributes from novel car models (see details in Section 4.2). The experiments reveal several challenges specific to the car-related problems. We conclude our analyses with a discussion in Section 7.

encourage [ɪn'kʌrɪdʒ; en-]：vt. 鼓励，怂恿，激励，支持


## 2. Related Work

Most previous car model research focuses on car model classification. Zhang et al. [31] propose an evolutionary computing framework to fit a wireframe model to the car on an image. Then the wireframe model is employed for car model recognition. Hsiao et al. [7] construct 3D space curves using 2D training images, then match the 3D curves to 2D image curves using a 3D view-based alignment technique. The car model is finally determined with the alignment result. Lin et al. [15] optimize 3D model fitting and fine-grained classification jointly. All these works are restricted to a small number of car models. Recently, Krause et al. [10] propose to extract 3D car representation for classifying 196 car models. The experiment is the largest scale that we are aware of. Car model classification is a fine-grained categorization task. In contrast to general object classification, fine-grained categorization targets at recognizing the subcategories in one object class. Following this line of research, many studies have proposed different datasets on a variety of categories: birds [25], dogs [16], cars [10], flowers [19], etc. But all these datasets are limited by their scales and subcategory numbers.

wireframe ['waiəfreim]：n. 线框
curve [kɜːv]：n. 曲线，弯曲，曲线球，曲线图表 vt. 弯，使弯曲 vi. 成曲形 adj. 弯曲的，曲线形的


To our knowledge, there is no previous attempt on the car model verification task. Closely related to car model verification, face verification has been a popular topic [8, 12, 22, 32]. The recent deep learning based algorithms [22] first train a deep neural network on human identity classification, then train a verification model with the feature extracted from the deep neural network. Joint Bayesian [2] is a widely-used verification model that models two faces jointly with an appropriate prior on the face representation. We adopt Joint Bayesian as a baseline model in car model verification.

prior ['praɪə]：adj. 优先的，在先的，在前的 n. 小修道院院长，大修道院的副院长，会长，犯罪前科


Attribute prediction of humans is a popular research topic in recent years [1, 4, 12, 29]. However, a large portion of the labeled attributes in the current attribute datasets [4], such as long hair and short pants lack strict criteria, which causes annotation ambiguities [1]. The attributes with ambiguities will potentially harm the effectiveness of evaluation on related datasets. In contrast, the attributes provided by CompCars (e.g. maximum speed, door number, seat capacity) all have strict criteria since they are set by the car manufacturers. The dataset is thus advantageous over the current datasets in terms of the attributes validity.

portion ['pɔːʃ(ə)n]：n. 部分，一份，命运 vt. 分配，给...嫁妆
short pants：短裤
criteria [kraɪ'tɪərɪə]：n. 标准，条件 (criterion 的复数)
ambiguity [æmbɪ'gjuːɪtɪ]：n. 含糊，不明确，暧昧，模棱两可的话


Other car-related research includes detection [23], tracking [18] [26], joint detection and pose estimation [6, 27], and 3D parsing [33]. Fine-grained car models are not explored in these studies. Previous research related to car parts includes car logo recognition [20] and car style analysis based on mid-level features [14].

Similar to CompCars, the Cars dataset [10] also targets at fine-grained tasks on the car category. Apart from the larger-scale database, our CompCars dataset offers several significant benefits in comparison to the Cars dataset. First, our dataset contains car images diversely distributed in all viewpoints (annotated by front, rear, side, front-side, and rear-side), while Cars dataset mostly consists of front-side car images. Second, our dataset contains aligned car part images, which can be utilized for many computer vision algorithms that demand precise alignment. Third, our dataset provides rich attribute annotations for each car model, which are absent in the Cars dataset.

diversely：adv. 不同地，各色各样地


## 3. Properties of CompCars

The CompCars dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The images of the web-nature are collected from car forums, public websites, and search engines. The images of the surveillance-nature are collected by surveillance cameras. The data of these two scenarios are widely used in the real-world applications. They open the door for cross-modality analysis of cars. In particular, the web-nature data contains 163 car makes with 1, 716 car models, covering most of the commercial car models in the recent ten years. There are a total of 136, 727 images capturing the entire cars and 27, 618 images capturing the car parts, where most of them are labeled with attributes and viewpoints. The surveillance-nature data contains 44, 481 car images captured in the front view. Each image in the surveillance-nature partition is annotated with bounding box, model, and color of the car. Fig. 2 illustrates some examples of surveillance images, which are affected by large variations from lightings and haze. Note that the data from the surveillance-nature are significantly different from the web-nature data in Fig. 1, suggesting the great challenges in cross-scenario car analysis. Overall, the CompCars dataset offers four unique features in comparison to existing car image databases, namely car hierarchy, car attributes, viewpoints, and car parts.
A Large-Scale Car Dataset for Fine-Grained Categorization and Verification 论文中介绍了 The Comprehensive Cars (CompCars) dataset。CompCars 数据集规模大、类别丰富、用于评测车辆精细识别的公开数据集。数据集通过网络和监控设备采集得到车辆图像。其中网络图像共 136727 幅，涵盖了 163 个汽车厂家的 1716 类车型，27618 张车辆局部图片。监控图像共 44481 幅车辆正面图像，包含 281 类车型。

modality [mə(ʊ)'dælɪtɪ]：n. 形式，形态，程序，物理疗法，主要的感觉
hierarchy ['haɪərɑːkɪ]：n. 层级，等级制度
forum ['fɔːrəm]：n. 论坛，讨论会，法庭，公开讨论的广场


Figure 2. Sample images of the surveillance-nature data. The images have large appearance variations due to the varying conditions of light, weather, traffic, etc.

Car Hierarchy The car models can be organized into a large tree structure, consisting of three layers , namely car make, car model, and year of manufacture, from top to bottom as depicted in Fig. 3. The complexity is further compounded by the fact that each car model can be produced in different years, yielding subtle difference in their appearances. For instance, three versions of “Audi A4L” were produced between 2009 to 2011 respectively.

depict [dɪ'pɪkt]：vt. 描述，描画
compound ['kɒmpaʊnd]：n. 化合物，混合物，复合词 adj. 复合的，混合的 v. 合成，混合，恶化，加重，和解，妥协


Figure 3. The tree structure of car model hierarchy. Several car models of Audi A4L in different years are also displayed.

Car Attributes Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. These attributes provide rich information while learning the relations or similarities between different car models. For example, we define twelve types of cars, which are MPV, SUV, hatchback, sedan, minibus, fastback, estate, pickup, sports, crossover, convertible, and hardtop convertible, as shown in Fig. 4. Furthermore, these attributes can be partitioned into two groups: explicit and implicit attributes. The former group contains door number, seat number, and car type, which are represented by discrete values, while the latter group contains maximum speed and displacement (volume of an engine’s cylinders), represented by continuous values. Humans can easily tell the numbers of doors and seats from a car’s proper viewpoint, but hardly recognize its maximum speed and displacement. We conduct interesting experiments to predict these attributes in Section 4.2.

A minivan, people carrier, MPV (multi-purpose vehicle) or MUV (multi-utility vehicle) is a vehicle classification describing a high-roof vehicle with a flexible interior layout.
Sport-utility (vehicle), SUV or sport-ute is an automotive classification, typically a kind of station wagon / estate car with off-road vehicle features like raised ground clearance and ruggedness, and available four-wheel drive.
sedan [sɪ'dæn]：n. 轿车，轿子
fastback ['fɑːs(t)bæk]：n. 长坡度的车顶，斜背式车身小汽车，快速返回
estate [ɪ'steɪt; e-]：n. 房地产，财产，身份
pickup ['pɪkʌp]：n. 收集，整理，小卡车，拾起，搭车者，偶然结识者
crossover ['krɒsəʊvə]：n. 交叉，天桥，转线路，变向运球过人
hardtop ['hɑːdtɒp]：n. 有金属顶盖的汽车，室内电影院 adj. 有金属顶盖的 vt. 给...铺硬质路面


Figure 4. Each image displays a car from the 12 car types. The corresponding model names and car types are shown below the images.

Benz [benz]：n. 奔驰
Audi['ɔdi]：n. 奥迪公司，奥迪汽车
Chevrolet[ʃɛvrə'le]：n. 美国雪佛兰牌汽车
Mitsubishi[mi'tsubiʃi]：n. 三菱
Foton：n. 北汽福田汽车股份有限公司
Skoda：斯柯达
Dodge [dɑdʒ]：n. 躲闪，托词 vt. 躲避，避开 vi. 躲避，避开 n. 道奇
Nissan ['nɪsn]：n. 尼桑
Volvo ['vɔlvəu]：n. 沃尔沃


Viewpoints We also label five viewpoints for each car model, including front (F), rear ®, side (S), front-side (FS), and rear-side (RS). These viewpoints are labeled by several professional annotators. The quantity distribution of the labeled car images is shown in Table 1. Note that the numbers of viewpoint images are not balanced among different car models, because the images of some less popular car models are difficult to collect.

rear [rɪə]：vt. 培养，树立，栽种 vi. 暴跳，高耸 adv. 向后，在后面 adj. 后方的，后面的，背面的 n. 后面，屁股，后方部队


Car Parts We collect images capturing the eight car parts for each car model, including four exterior parts (i.e. headlight, taillight, fog light, and air intake) and four interior parts (i.e. console, steering wheel, dashboard, and gear lever). These images are roughly aligned for the convenience of further analysis. A summary and some examples are given in Table 2 and Fig. 5 respectively.

exterior [ɪk'stɪərɪə; ek-]：adj. 外部的，表面的，外在的 n. 外部，表面，外型，外貌
gear lever：变速杆
steering wheel：方向盘，驾驶盘，舵轮
air intake：进气口，吸气，吸气管
fog light：雾灯，雾天灯
taillight ['teɪlaɪt]：n. 尾灯，后灯


Table 1. Quantity distribution of the labeled car images in different viewpoints.

Table 2. Quantity distribution of the labeled car part images.

Figure 5. Each column displays 8 car parts from a car model. The corresponding car models are Buick GL8, Peugeot 207 hatchback, Volkswagen Jetta, and Hyundai Elantra from left to right, respectively.

Buick [bju:ɨk]：n. 别克
Peugeot：n. 法国标致
Volkswagen ['fɔ:lksvɑ:gən]：n. 大众汽车
Jetta：n. 捷达
Hyundai：n. 现代
Elantra：n. 伊兰特


## 4. Applications

In this section, we study three applications using CompCars, including fine-grained car classification, attribute prediction, and car verification. We select 78, 126 images from the CompCars dataset and divide them into three subsets without overlaps. The first subset (Part-I) contains 431 car models with a total of 30, 955 images capturing the entire car and 20, 349 images capturing car parts. The second subset (Part-II) consists 111 models with 4, 454 images in total. The last subset (Part-III) contains 1, 145 car models with 22, 236 images. Fine-grained car classification is conducted using images in the first subset. For attribute prediction, the models are trained on the first subset but tested on the second one. The last subset is utilized for car verification.

subset ['sʌbset]：n. 子集，子设备，小团体


We investigate the above potential applications using Convolutional Neural Network (CNN), which achieves great empirical successes in many computer vision problems, such as object classification [11], detection [5], face alignment [30], and face verification [22, 32]. Specifically, we employ the Overfeat [21] model, which is pretrained on ImageNet classification task [3], and fine-tuned with the car images for car classification and attribute prediction. For car model verification, the fine-tuned model is employed as a feature extractor.

empirical [em'pɪrɪk(ə)l; ɪm-]：adj. 经验主义的，完全根据经验的，实证的


### 4.1. Fine-Grained Classification

We classify the car images into 431 car models. For each car model, the car images produced in different years are considered as a single category. One may treat them as different categories, leading to a more challenging problem because their differences are relatively small. Our experiments have two settings, comprising fine-grained classification with the entire car images and the car parts. For both settings, we divide the data into half for training and another half for testing. Car model labels are regarded as training target and logistic loss is used to fine-tune the Overfeat model.

#### 4.1.1 The Entire Car Images

We compare the recognition performances of the CNN models, which are fine-tuned with car images in specific viewpoints and all the viewpoints respectively, denoted as “front (F)”, “rear (R)”, “side (S)”, “front-side (FS)”, “rearside (RS)”, and “All-View”. The performances of these six models are summarized in Table 3, where “FS” and “RS” achieve better performances than the performances of the other viewpoint models. Surprisingly, the “All-View” model yields the best performance, although it did not leverage the information of viewpoints. This result reveals that the CNN model is capable of learning discriminative representation across different views. To verify this observation, we visualize the car images that trigger high responses with respect to each neuron in the last fully-connected layer. As shown in Fig. 6, these neurons capture car images of specific car models across different viewpoints.

trigger ['trɪgə]：vt. 引发，引起，触发 vi. 松开扳柄 n. 扳机，触发器，制滑机
neuro ['nju:rəʊ]：n. 神经


Table 3. Fine-grained classification results for the models trained on car images. Top-1 and Top-5 denote the top-1 and top-5 accuracy for car model classification, respectively. Make denotes the make level classification accuracy.

Figure 6. Images with the highest responses from two sample neurons. Each row corresponds to a neuron.

Several challenging cases are given in Fig. 7, where the images on the left hand side are the testing images and the images on the right hand side are the examples of the wrong predictions (of the “All-View” model). We found that most of the wrong predictions belong to the same car makes as the test images. We report the “top-1” accuracies of car make classification in the last row of Table 3, where the “All-View” model obtain reasonable good result, indicating that a coarse-to-fine (i.e. from car make to model) classification is possible for fine-grained car recognition.

coarse-to-fine：由粗到精
Lexus：n. 雷克萨斯


Figure 7. Sample test images that are mistakenly predicted as another model in their makes. Each row displays two samples and each sample is a test image followed by another image showing its mistakenly predicted model. The corresponding model name is shown under each image.

To observe the learned feature space of the “All-View” model, we project the features extracted from the last fully-connected layer to a two-dimensional embedding space using multi-dimensional scaling. Fig. 8 visualizes the projected features of twelve car models, where the images are chosen from different viewpoints. We observe that features from different models are separable in the 2D space and features of similar models are closer than those of dissimilar models. For instance, the distances between “BWM 5 Series” and “BWM 7 Series” are smaller than those between “BWM 5 Series” and “Chevrolet Captiva”.

Captiva：n. 科帕奇


Figure 8. The features of 12 car models that are projected to a two-dimensional embedding using multi-dimensional scaling. Most features are separated in the 2D plane with regard to different models. Features extracted from similar models such as BWM 5 Series and BWM 7 Series are close to each other. Best viewed in color.

We also conduct a cross-modality experiment, where the CNN model fine-tuned by the web-nature data is evaluated on the surveillance-nature data. Fig. 9 illustrates some predictions, suggesting that the model may account for data variations in a different modality to a certain extent. This experiment indicates that the features obtained from the web-nature data have potential to be transferred to data in the other scenario.

modality [mə(ʊ)'dælɪtɪ]：n. 形式，形态，程序，物理疗法，主要的感觉
variation [veərɪ'eɪʃ(ə)n]：n. 变化，变异，变种


Figure 9. Top-5 predicted classes of the classification model for eight cars in the surveillance-nature data. Below each image is the ground truth class and the probabilities for the top-5 predictions with the correct class labeled in red. Best viewed in color.

#### 4.1.2 Car Parts

Car enthusiasts are able to distinguish car models by examining the car parts. We investigate if the CNN model can mimic this strength. We train a CNN model using images from each of the eight car parts. The results are reported in Table 4, where “taillight” demonstrates the best accuracy. We visualize taillight images that have high responses with respect to each neuron in the last fully-connected layer. Fig. 10 displays such images with respect to two neurons. “Taillight” wins among the different car parts, mostly likely due to the relatively more distinctive designs, and the model name printed close to the taillight, which is a very informative feature for the CNN model.

We also combine predictions using the eight car part models by voting strategy. This strategy significantly improves the performance due to the complementary nature of different car parts.

enthusiast [ɪn'θjuːzɪæst; en-]：n. 狂热者，热心家
neuron ['njʊərɒn]：n. 神经元，神经单位
mimic ['mɪmɪk]：vt. 模仿，摹拟 n. 效颦者，模仿者，仿制品，小丑 adj. 模仿的，模拟的，假装的
strength [streŋθ; streŋkθ]：n. 力量，力气，兵力，长处
vote [vəʊt]：n. 投票，选举，选票，得票数 vt. 提议，使投票，投票决定，公认 vi. 选举，投票


Figure 10. Taillight images with the highest responses from two sample neurons. Each row corresponds to a neuron.

Table 4. Fine-grained classification results for the models trained on car parts. Top-1 and Top-5 denote the top-1 and top-5 accuracy for car model classification, respectively.

### 4.2. Attribute Prediction

Human can easily identify the car attributes such as numbers of doors and seats from a proper viewpoint, without knowing the car model. For example, a car image captured in the side view provides sufficient information of the door number and car type, but it is hard to infer these attributes from the frontal view. The appearance of a car also provides hints on the implicit attributes, such as the maximum speed and the displacement. For instance, a car model is probably designed for high-speed driving, if it has a low under-pan and a streamline body.

underpan ['ʌndə,pæn]：n. 底盘
streamline ['striːmlaɪn]：vt. 把...做成流线型，使现代化，组织，使合理化，使简单化 n. 流线，流线型 adj. 流线型的


In this section, we deliberately design a challenging experimental setting for attribute recognition, where the car models presented in the test images are exclusive from the training images. We fine-tune the CNN with the sum-of-square loss to model the continuous attributes, such as “maximum speed” and “displacement”, but a logistic loss to predict the discrete attributes such as “door number”, “seat number”, and “car type”. For example, the “door number” has four states, i.e. {2, 3, 4, 5} doors, while “seat number” also has four states, i.e. {2, 4, 5, > 5} seats. The attribute “car type” has twelve states as discussed in Sec. 3.

deliberately [dɪ'lɪbərətli]；adv. 故意地，谨慎地，慎重地
exclusive [ɪk'skluːsɪv; ek-]：adj. 独有的，排外的，专一的 n. 独家新闻，独家经营的项目，排外者


To study the effectiveness of different viewpoints for attribute prediction, we train CNN models for different viewpoints separately. Table 5 summarizes the results, where the “mean guess” represents the errors computed by using the mean of the training set as the prediction. We observe that the performances of “maximum speed” and “displacement” are insensitive to viewpoints. However, for the explicit attributes, the best accuracy is obtained under side view. We also found that the implicit attributes are more difficult to predict then the explicit attributes. Several test images and their attribute predictions are provided in Fig. 11.

insensitive [ɪn'sensɪtɪv]：adj. 感觉迟钝的，对...没有感觉的


Table 5. Attribute prediction results for the five single viewpoint models. For the continuous attributes (maximum speed and displacement), we display the mean difference from the ground truth. For the discrete attributes (door and seat number, car type), we display the classification accuracy. Mean guess denotes the mean error with a prediction of the mean value on the training set.

continuous [kən'tɪnjʊəs]：adj. 连续的，持续的，继续的，连绵不断的


Figure 11. Sample attribute predictions for four car images. The continuous predictions of maximum speed and displacement are rounded to nearest proper values.

### 4.3. Car Verification

In this section, we perform car verification following the pipeline of face verification [22]. In particular, we adopt the classification model in Section 4.1.1 as a feature extractor of the car images, and then apply Joint Bayesian [2] to train a verification model on the Part-II data. Finally, we test the performance of the model on the Part-III data, which includes 1, 145 car models. The test data is organized into three sets, each of which has different difficulty, i.e. easy, medium, and hard. Each set contains 20, 000 pairs of images, including 10, 000 positive pairs and 10, 000 negative pairs. Each image pair in the “easy set” is selected from the same viewpoint, while each pair in the “medium set” is selected from a pair of random viewpoints. Each negative pair in the “hard set” is chosen from the same car make.

Bayesian ['beɪzɪən]：adj. 贝叶斯定理的


Deeply learned feature combined with Joint Bayesian has been proven successful for face verification [22]. Joint Bayesian formulates the feature x x as the sum of two independent Gaussian variables

where µ ∼ N ( 0 , S µ ) µ ∼ N(0, S_{µ}) represents identity information, and ϵ ∼ N ( 0 , S ϵ ) \epsilon∼ N(0, S_{\epsilon}) the intra-category variations. Joint Bayesian models the joint probability of two objects given the intra or extra-category variation hypothesis, P ( x 1 , x 2 ∣ H I ) P(x_{1}, x_{2}|H_{I}) and P ( x 1 , x 2 ∣ H E ) P (x_{1}, x_{2}|H_{E}) . These two probabilities are also Gaussian with variations

and

respectively. S µ S_{µ} and S ϵ S_{\epsilon} can be learned from data with EM algorithm. In the testing stage, it calculates the likelihood ratio

which has closed-form solution. The feature extracted from the CNN model has a dimension of 4, 096, which is reduced to 20 by PCA. The compressed features are then utilized to train the Joint Bayesian model. During the testing stage, each image pair is classified by comparing the likelihood ratio produced by Joint Bayesian with a threshold. This model is denoted as (CNN feature + Joint Bayesian).

The second method combines the CNN features and SVM, denoted as CNN feature + SVM. Here, SVM is a binary classifier using a pair of image features as input. The label ‘1’ represents positive pair, while ‘0’ represents negative pair. We extract 100, 000 pairs of image features from Part-II data for training.

The performances of the two models are shown in Table 6 and the ROC curves for the “hard set” are plotted in Fig. 13. We observe that CNN feature + Joint Bayesian outperforms CNN feature + SVM with large margins, indicating the advantage of Joint Bayesian for this task. However, its benefit in car verification is not as effective as in face verification, where CNN and Joint Bayesian nearly saturated the LFW dataset [8] and approached human performance [22]. Fig. 12 depicts several pairs of test images as well as their predictions by CNN feature + Joint Bayesian. We observe two major challenges. First, for the image pair of the same model but different viewpoints, it is difficult to obtain the correspondences directly from the raw image pixels. Second, the appearances of different car models of the same car make are extremely similar. It is difficult to distinguish these car models using the entire images. Part localization or detection is crucial for car verification.

crucial ['kruːʃ(ə)l]；adj. 重要的，决定性的，定局的，决断的


Table 6. The verification accuracy of three baseline models.

Figure 12. Four test samples of verification and their prediction results. All these samples are very challenging and our model obtains correct results except for the last one.

Figure 13. The ROC curves of two baseline models for the hard flavor.

## 5. Updated Results: Comparing Different Deep Models

As an extension to the experiments in Section 4, we conduct experiments for fine-grained car classification, attribute prediction, and car verification with the entire dataset and different deep models, in order to explore the different capabilities of the models on these tasks. The split of the dataset into the three tasks is similar to Section 4, where three subsets contain 431, 111, and 1, 145 car models, with 52, 083, 11, 129, and 72, 962 images respectively. The only difference is that we adopt full set of CompCars in order to establish updated baseline experiments and to make use of the dataset to the largest extent. We keep the testing sets of car verification same to those in Section 4.3.

We evaluate three network structures, namely AlexNet [11], Overfeat [21], and GoogLeNet [24] for all three tasks. All networks are pre-trained on the ImageNet classification task [3], and fine-tuned with the same mini-batch size, epochs, and learning rates for each task. All predictions of the deep models are produced with a single center crop of the image. We use Caffe [9] as the platform for our experiments. The experimental results can serve as baselines in any later research works. The train/test splits can be downloaded from CompCars webpage http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/index.html.

http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/index.html

### 5.1. Fine-Grained Classification

In this section, we classify the car images into 431 car models as in Section 4.1.1. We divide the data into 70% for training and 30% for testing. We train classification models using car images in all viewpoints. The performances of the three networks are summarized in Table 7. Overfeat beats AlexNet with a large margin of 6.0% while GoogLeNet beats Overfeat by 3.3% in Top-1 accuracy, which is in consistency with their performances on the ImageNet classification task. Given more data, the accuracy rises about 11% for Overfeat compared to Table 31. We also release the fine-tuned GoogLeNet model on the CompCars webpage.

1Due to the difference in testing sets, the accuracies are not directly comparable. However a rough estimate is still viable.

Table 7. The classification accuracies of three deep models.

### 5.2. Attribute Prediction

We predict attributes from 111 models not existed in the training set. Different from Section 4.2 where models are trained with cars in single viewpoints, we train with images in all viewpoints to build a compact model. Table 8 summarizes the results for the three networks, where “mean guess” represents the prediction with the mean of the values on the training set. GoogLeNet performs the best for all attributes and Overfeat is a close running-up.

Table 8. Attribute prediction results of three deep models. For the continuous attributes (maximum speed and displacement), we display the mean difference from the ground truth (lower is better). For the discrete attributes (door and seat number, car type), we display the classification accuracy (higher is better).

### 5.3. Car Verification

The evaluation pipeline follows Section 4.3. We evaluate the three deep models combined with two verification models: Joint Bayesian [2] and SVM with polynomial kernel. The feature extracted from the CNN models is reduced to 200 by PCA before training and testing in all experiments.

polynomial [,pɒlɪ'nəʊmɪəl]：n. 多项式，由 2 字以上组成的学名 adj. 多项式的，多词学名


The performances of the three networks combined with the two verification models are shown in Table 9, where each model is denoted by {name of the deep model} + {name of the verification model}. GoogLeNet + Joint Bayesian achieves the best performance in all three settings. For each deep model, Joint Bayesian outperforms SVM consistently. Compared to Table 6, Overfeat + Joint Bayesian yields a performance gain of 2 ∼ 4% in the three settings, which is purely due to the increase in training data. The ROC curves for the three sets are plotted in Figure 14.

Table 9. The verification accuracies of six models.

Figure 14. The ROC curves of six verification models for (a) easy, (b) medium, and (c) hard set.

## 6. Fine-Grained Classification with Surveillance Data

This is a follow-up experiment for fine-grained classification with surveillance-nature data. The data includes 44, 481 images in 281 different car models. 70% images are for training and 30% are for testing.The car images are all in front views with various environment conditions such as rainy, foggy, and at night. We adopt the same three network structures (AlexNet, Overfeat, and GoogLeNet) as in the web-nature data applications for this task. The networks are also pre-trained on the ImageNet classification task, and the test is done with a single center crop. The car images are first cropped with the labeled bounding boxes with paddings of around 7% on each side. All cropped images are resized to 256 × 256 pixels. The experimental results are shown in Table 10. The three networks all achieve very high accuracies for this task. The result indicates that the fixed view (front view) greatly simplifies the fine-grained classification task, even when large environmental differences exist.

Table 10. The classification accuracies of three deep models on surveillance data.

## 7. Discussions

In this paper, we wish to promote the field of research related to “cars”, which is largely neglected by the computer vision community. To this end, we have introduced a large-scale car dataset called CompCars, which contains images with not only different viewpoints, but also car parts and rich attributes. CompCars provides a number of unique properties that other fine-grained datasets do not have, such as a much larger subcategory quantity, a unique hierarchical structure, implicit and explicit attributes, and large amount of car part images which can be utilized for style analysis and part recognition. It also bears cross modality nature, consisting of web-nature data and surveillance-nature data, ready to be used for cross modality research. To validate the usefulness of the dataset and inspire the community for other novel tasks, we have conducted baseline experiments on three tasks: car model classification, car model verification, and attribute prediction. The experimental results reveal several challenges of these tasks and provide qualitative observations of the data, which is beneficial for future research.

bear [beə]：vt. 结果实，开花 vt. 忍受，承受，具有，支撑 n. 熊


There are many other potential tasks that can exploit CompCars. Image ranking is one of the long-lasting topics in the literature, car model ranking can be adapted from this line of research to find the models that users are mostly interested in. The rich attributes of the dataset can be used to learn the relationships between different car models. Combining with the provided 3-level hierarchy, it will yield a stronger and more meaningful relationship graph for car models. Car images from different viewpoints can be utilized for ultra-wide baseline matching and 3D reconstruction, which can benefit recognition and verification in return.

long-lasting ['lɔ:ŋla:stiŋ\]：adj. 持续时间长的


## References

Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
A Large-Scale Car Dataset for Fine-Grained Categorization and Verification
Automated Flower Classification over a Large Number of Classes
Dog Breed Classification Using Part Localization
3D Object Representations for Fine-Grained Categorization
The Caltech-UCSD Birds-200-2011 Dataset
https://github.com/intel/caffe/wiki/Model-Zoo

## WORDBOOK

Caltech-UCSD Birds-200-2011，CUB-200-2011

## KEY POINTS

To facilitate future car-related research, in this paper we present our on-going effort in collecting a large-scale dataset, “CompCars”, that covers not only different car views, but also their different internal and external parts, and rich attributes.

Importantly, the dataset is constructed with a cross-modality nature, containing a surveillance-nature set and a web-nature set.

The experiments shown in Sec. 5 gain better performance on all three tasks, i.e. car model classification, attribute prediction, and car model verification, thanks to more training data and better network structures.

Importantly, a unique hierarchy is presented for the car category, which is three levels from top to bottom: make, model, and released year.

In particular, cars have distinctive attributes such as car class, seating capacity, number of axles, maximum speed and displacement, which can be inferred from the appearance of the cars (see Fig. 1(a)).

To this end, we collect and organize a large-scale and comprehensive image database called “Comprehensive Cars”, with “CompCars” being short. The “CompCars” dataset is much larger in scale and diversity compared with the current car image datasets, containing 208, 826 images of 1, 716 car models from two scenarios: web-nature and surveillance-nature. In addition, the dataset is carefully labelled with viewpoints and car parts, as well as rich attributes such as type of car, seat capacity, and door number.

The recent deep learning based algorithms [22] first train a deep neural network on human identity classification, then train a verification model with the feature extracted from the deep neural network.

In contrast, the attributes provided by CompCars (e.g. maximum speed, door number, seat capacity) all have strict criteria since they are set by the car manufacturers.

The CompCars dataset contains data from two scenarios, including images from web-nature and surveillance-nature.

In particular, the web-nature data contains 163 car makes with 1, 716 car models, covering most of the commercial car models in the recent ten years. There are a total of 136, 727 images capturing the entire cars and 27, 618 images capturing the car parts, where most of them are labeled with attributes and viewpoints. The surveillance-nature data contains 44, 481 car images captured in the front view. Each image in the surveillance-nature partition is annotated with bounding box, model, and color of the car.

Overall, the CompCars dataset offers four unique features in comparison to existing car image databases, namely car hierarchy, car attributes, viewpoints, and car parts.

Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car.

For example, we define twelve types of cars, which are MPV, SUV, hatchback, sedan, minibus, fastback, estate, pickup, sports, crossover, convertible, and hardtop convertible, as shown in Fig. 4.

We also label five viewpoints for each car model, including front (F), rear ®, side (S), front-side (FS), and rear-side (RS).

The former group contains door number, seat number, and car type, which are represented by discrete values, while the latter group contains maximum speed and displacement (volume of an engine’s cylinders), represented by continuous values. Humans can easily tell the numbers of doors and seats from a car’s proper viewpoint, but hardly recognize its maximum speed and displacement.

We collect images capturing the eight car parts for each car model, including four exterior parts (i.e. headlight, taillight, fog light, and air intake) and four interior parts (i.e. console, steering wheel, dashboard, and gear lever).

For car model verification, the fine-tuned model is employed as a feature extractor.

For both settings, we divide the data into half for training and another half for testing. Car model labels are regarded as training target and logistic loss is used to fine-tune the Overfeat model.

Surprisingly, the “AllView” model yields the best performance, although it did not leverage the information of viewpoints. This result reveals that the CNN model is capable of learning discriminative representation across different views.

We report the “top-1” accuracies of car make classification in the last row of Table 3, where the “All-View” model obtain reasonable good result, indicating that a coarse-to-fine (i.e. from car make to model) classification is possible for fine-grained car recognition.

To observe the learned feature space of the “All-View” model, we project the features extracted from the last fully-connected layer to a two-dimensional embedding space using multi-dimensional scaling.

We observe that features from different models are separable in the 2D space and features of similar models are closer than those of dissimilar models.

We also conduct a cross-modality experiment, where the CNN model fine-tuned by the web-nature data is evaluated on the surveillance-nature data.

We investigate if the CNN model can mimic this strength.

We visualize taillight images that have high responses with respect to each neuron in the last fully-connected layer. Fig. 10 displays such images with respect to two neurons. “Taillight” wins among the different car parts, mostly likely due to the relatively more distinctive designs, and the model name printed close to the taillight, which is a very informative feature for the CNN model.

The appearance of a car also provides hints on the implicit attributes, such as the maximum speed and the displacement.

We fine-tune the CNN with the sum-of-square loss to model the continuous attributes, such as “maximum speed” and “displacement”, but a logistic loss to predict the discrete attributes such as “door number”, “seat number”, and “car type”.

We observe that the performances of “maximum speed” and “displacement” are insensitive to viewpoints. However, for the explicit attributes, the best accuracy is obtained under side view. We also found that the implicit attributes are more difficult to predict then the explicit attributes.

In particular, we adopt the classification model in Section 4.1.1 as a feature extractor of the car images, and then apply Joint Bayesian [2] to train a verification model on the Part-II data.

Deeply learned feature combined with Joint Bayesian has been proven successful for face verification [22].

The feature extracted from the CNN model has a dimension of 4, 096, which is reduced to 20 by PCA. The compressed features are then utilized to train the Joint Bayesian model. During the testing stage, each image pair is classified by comparing the likelihood ratio produced by Joint Bayesian with a threshold. This model is denoted as (CNN feature + Joint Bayesian).

However, its benefit in car verification is not as effective as in face verification, where CNN and Joint Bayesian nearly saturated the LFW dataset [8] and approached human performance [22].

We observe two major challenges. First, for the image pair of the same model but different viewpoints, it is difficult to obtain the correspondences directly from the raw image pixels. Second, the appearances of different car models of the same car make are extremely similar. It is difficult to distinguish these car models using the entire images.

We also release the fine-tuned GoogLeNet model on the CompCars webpage.

The feature extracted from the CNN models is reduced to 200 by PCA before training and testing in all experiments.

Compared to Table 6, Overfeat + Joint Bayesian yields a performance gain of 2 ∼ 4% in the three settings, which is purely due to the increase in training data.

70% images are for training and 30% are for testing.

The result indicates that the fixed view (front view) greatly simplifies the fine-grained classification task, even when large environmental differences exist.

Following this line of research, many studies have proposed different datasets on a variety of categories: birds [25], dogs [16], cars [10], flowers [19], etc. But all these datasets are limited by their scales and subcategory numbers.

Similar to CompCars, the Cars dataset [10] also targets at fine-grained tasks on the car category.

12-18 1886
11-08 1219
09-04 1万+
07-04 810
08-02 2万+
07-07 4714
04-04 1万+