# $k$-nearest neighbors algorithm - $k$ 最近邻算法

In pattern recognition, the $k$-nearest neighbors algorithm ($k$-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the $k$ closest training examples in the feature space. The output depends on whether $k$-NN is used for classification or regression.

• In $k$-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its $k$ nearest neighbors ($k$ is a positive integer, typically small). If $k = 1$, then the object is simply assigned to the class of that single nearest neighbor.

• In $k$-NN regression, the output is the property value for the object. This value is the average of the values of $k$ nearest neighbors.

• $k$-NN 分类中，输出是一个分类族群。一个对象的分类是由其邻居的多数表决确定的，$k$ 个最近邻 ($k$ 为正整数，通常较小) 中最常见的分类决定了赋予该对象的类别。若 $k = 1$，则该对象的类别直接由最近的一个节点赋予。

• $k$-NN 回归中，输出是该对象的属性值。该值是其 $k$ 个最近邻的值的平均值。

plurality [plʊə'rælɪtɪ]；n. 多数，复数，兼职，胜出票数 Example of $k$-NN classification. The test sample (green dot) should be classified either to the first class of blue squares or to the second class of red triangles. If $k$ = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If $k$ = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).
$k$ 近邻算法示例。测试样本 (绿色圆形) 应归入要么是第一类的蓝色方形或是第二类的红色三角形。如果 $k$ = 3 (实线圆圈) 它被分配给第二类，因为有 2 个三角形和只有 1 个正方形在内侧圆圈之内。如果 $k$ = 5 (虚线圆圈) 它被分配到第一类 ( 3 个正方形与 2 个三角形在外侧圆圈之内)。

$k$-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The $k$-NN algorithm is among the simplest of all machine learning algorithms.
$k$-NN 是一种基于实例的学习，或者是局部近似和将所有计算推迟到分类之后的惰性学习。$k$-近邻算法是所有的机器学习算法中最简单的之一。

Both for classification and regression, a useful technique can be used to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of $1/d$, where $d$ is the distance to the neighbor.

distant ['dɪst(ə)nt]：adj. 遥远的，冷漠的，远隔的，不友好的，冷淡的
defer [dɪ'fɜː]：vi. 推迟，延期，服从 vt. 使推迟，使延期
triangle ['traɪæŋg(ə)l]：n. 三角，三角关系，三角形之物，三人一组


The neighbors are taken from a set of objects for which the class (for $k$-NN classification) or the object property value (for $k$-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

explicit [ɪk'splɪsɪt; ek-]；adj. 明确的，清楚的，直率的，详述的
peculiarity [pɪ,kjuːlɪ'ærɪtɪ]：n. 特性，特质，怪癖，奇特


A peculiarity of the $k$-NN algorithm is that it is sensitive to the local structure of the data.
$k$-近邻算法的缺点是对数据的局部结构非常敏感。本算法与 $k$-平均算法没有任何关系，请勿混淆。

## Statistical setting

Suppose we have pairs $(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})$ taking values in ${\mathbb{R}}^{d}\times \{1,2\}$, where $Y$ is the class label of $X$, so that $X|Y=r\sim P_{r}$ for $r = 1, 2$ (and probability distributions $P_{r}$). Given some norm $\|\cdot \|$ on $\mathbb {R} ^{d}$ and a point $x\in {\mathbb {R}}^{d}$, let $(X_{{(1)}}, Y_{{(1)}}), \dots, (X_{{(n)}}, Y_{{(n)}})$ be a reordering of the training data such that $\|X_{{(1)}}-x\|\leq \dots \leq \|X_{{(n)}}-x\|$.

## Algorithm - 算法

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, $k$ is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the $k$ training samples nearest to that query point.

multidimensional [,mʌltɪdɪ'mɛnʃənl]；adj. 多维的，多面的


A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, $k$-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric. Often, the classification accuracy of $k$-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

gene expression：基因表达，基因表现
microarray [,maikrəuə'rei]：n. 微阵列，微阵列芯片


A drawback of the basic majority voting classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the $k$ nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its $k$ nearest neighbors. The class (or value, in regression problems) of each of the $k$ nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. $k$-NN can then be applied to the SOM.

majority [mə'dʒɒrɪtɪ]：n. 多数，成年
skew [skjuː]：n. 斜交，歪斜 adj. 斜交的，歪斜的 vt. 使歪斜 vi. 偏离，歪斜，斜视
dominate ['dɒmɪneɪt]：vt. 控制，支配，占优势，在...中占主要地位 vi. 占优势，处于支配地位
inverse ['ɪnvɜːs; ɪn'vɜːs]：n. 相反，倒转 adj. 相反的，倒转的


## Parameter selection - 参数选择

The best choice of $k$ depends upon the data; generally, larger values of $k$ reduces effect of the noise on the classification, but make boundaries between classes less distinct. A good $k$ can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when $k$ = 1) is called the nearest neighbor algorithm.

heuristic [,hjʊ(ə)'rɪstɪk]；adj. 启发式的，探索的 n. 启发式教育法
optimization [,ɒptɪmaɪ'zeɪʃən]：n. 最佳化，最优化


The accuracy of the $k$-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.

presence ['prez(ə)ns]：n. 存在，出席，参加，风度，仪态


In binary (two class) classification problems, it is helpful to choose $k$ to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal $k$ in this setting is via bootstrap method.

vote [vəʊt]：n. 投票，选举，选票，得票数 vt. 提议，使投票，投票决定，公认 vi. 选举，投票


## Decision boundary - 决策边界

Nearest neighbor rules in effect implicitly compute the decision boundary. It is also possible to compute the decision boundary explicitly, and to do so efficiently, so that the computational complexity is a function of the boundary complexity.

implicitly [ɪm'plɪsɪtlɪ]：adv. 含蓄地，暗中地


## Metric learning

The $k$-nearest neighbor classification performance can often be significantly improved through (supervised) metric learning. Popular algorithms are neighbourhood components analysis and large margin nearest neighbor. Supervised metric learning algorithms use the label information to learn a new metric or pseudo-metric.

pseudo ['sjuːdəʊ]：n. 伪君子，假冒的人 adj. 冒充的，假的


## Data reduction

Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows:

• Select the class-outliers, that is, training data that are classified incorrectly by $k$-NN (for a given $k$)

• Separate the rest of the data into two sets: (i) the prototypes that are used for the classification decisions and (ii) the absorbed points that can be correctly classified by $k$-NN using prototypes. The absorbed points can then be removed from the training set.

• 选择类异常值，即训练数据被 $k$-NN 错误分类 (对于给定的 $k$)。

• 将其余数据分成两组：(i) 用于分类决策的原型和 (ii) 可以使用原型通过 $k$-NN 正确分类的吸收点。然后可以从训练集中移除吸收的点。

prototypes ['prəutətaip]：n. 原型，技术原型，雏型
absorb [əb'zɔːb; -'sɔːb]：vt. 吸收，吸引，承受，理解，使...全神贯注
outlier ['aʊtlaɪə]：n. 异常值，露宿者，局外人，离开本体的部分


## Properties - 属性

The naive version of the algorithm is easy to implement by computing the distances from the test example to all stored examples, but it is computationally intensive for large training sets. Using an approximate nearest neighbor search algorithm makes $k$-NN computationally tractable even for large data sets. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed.

$k$-NN has some strong consistency results. As the amount of data approaches infinity, the two-class $k$-NN algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). Various improvements to the $k$-NN speed are possible by using proximity graphs.

tractable ['træktəb(ə)l]：adj. 易于管教的，易驾驭的，易处理的，驯良的
consistency [kən'sɪst(ə)nsɪ]：n. 一致性，稠度，相容性
infinity [ɪn'fɪnɪtɪ]：n. 无穷，无限大，无限距


## Feature extraction

When the input data to an algorithm is too large to be processed and it is suspected to be redundant (e.g. the same measurement in both feet and meters) then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature extraction is performed on raw data prior to applying $k$-NN algorithm on the transformed data in feature space.

suspect [ˈsʌspekt; (for v.) səˈspekt]：n. 嫌疑犯 adj. 可疑的，不可信的 vt. 怀疑，猜想 vi. 怀疑，猜想


An example of a typical computer vision computation pipeline for face recognition using $k$-NN including feature extraction and dimension reduction pre-processing steps (usually implemented with OpenCV):

• Haar face detection
• Mean-shift tracking analysis
• PCA or Fisher LDA projection into feature space, followed by $k$-NN classification

## Dimension reduction

For high-dimensional data (e.g., with number of dimensions more than 10) dimension reduction is usually performed prior to applying the $k$-NN algorithm in order to avoid the effects of the curse of dimensionality.

fisher ['fɪʃə]：n. 渔夫，渔船，食鱼貂


The curse of dimensionality in the $k$-NN context basically means that Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (imagine multiple points lying more or less on a circle with the query point at the center; the distance from the query to all data points in the search space is almost the same).
$k$-NN 上下文中维数的诅咒基本上意味着欧几里德距离在高维度上是无用的，因为所有向量几乎与搜索查询向量等距 (想象多个点或多或少地位于圆上，查询点位于中心，查询到搜索空间中所有数据点的距离几乎相同。)

Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), or canonical correlation analysis (CCA) techniques as a pre-processing step, followed by clustering by $k$-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding.

For very-high-dimensional datasets (e.g. when performing a similarity search on live video streams, DNA data or high-dimensional time series) running a fast approximate $k$-NN search using locality sensitive hashing, “random projections”, “sketches” or other high-dimensional similarity search techniques from the VLDB toolbox might be the only feasible option.

feasible ['fiːzɪb(ə)l]：adj. 可行的，可能的，可实行的

• Selection of class-outliers
• CNN for data reduction

## The weighted nearest neighbour classifier - 加权最近邻分类器

The $k$-nearest neighbour classifier can be viewed as assigning the $k$ nearest neighbours a weight $1/k$ and all others 0 weight. This can be generalised to weighted nearest neighbour classifiers. That is, where the $i$th nearest neighbour is assigned a weight $w_{{ni}}$, with $\sum_{{i=1}}^{n}w_{{ni}}=1$. An analogous result on the strong consistency of weighted nearest neighbour classifiers also holds.
$k$-最近邻分类器可以被视为 $k$ 最近邻居分配权重 $1/k$ 以及为所有其他邻居分配 0 权重。这可以推广到加权最近邻分类器。也就是说，第 $i$ 近的邻居被赋予权重 $w_{{ni}}$，其中 $\sum _{{i=1}}^{n}w_{{ni}}=1$。关于加权最近邻分类器的强一致性的类似结果也成立。

analogous [ə'næləgəs]：adj. 类似的，同功的，可比拟的
generalise ['dʒenərəlaɪz]：vt. 概括，归纳，普及 vi. 推广，笼统地讲，概括 (等于 generalize)


## The 1-nearest neighbor classifier

The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point $x$ to the class of its closest neighbour in the feature space, that is $C_{n}^{{1nn}}(x)=Y_{{(1)}}$.

As the size of training data set approaches infinity, the one nearest neighbour classifier guarantees an error rate of no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data).

intuitive [ɪn'tjuːɪtɪv]：adj. 直觉的，凭直觉获知的


## Error rates

There are many results on the error rate of the $k$ nearest neighbour classifiers. The $k$-nearest neighbour classifier is strongly (that is for any joint distribution on $(X, Y)$) consistent provided $k:=k_{n}$ diverges and $k_{n}/n$ converges to zero as $n\to \infty$.

consistent [kən'sɪst(ə)nt]：adj. 始终如一的，一致的，坚持的
diverge [daɪ'vɜːdʒ; dɪ-]：vi. 分歧，偏离，分叉，离题 vt. 使偏离，使分叉
converge [kən'vɜːdʒ]：vt. 使汇聚 vi. 聚集，靠拢，收敛


## $k$-NN regression - $k$-NN 回归

In $k$-NN regression, the $k$-NN algorithm is used for estimating continuous variables. One such algorithm uses a weighted average of the $k$ nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows:

• Compute the Euclidean or Mahalanobis distance from the query example to the labeled examples.
• Order the labeled examples by increasing distance.
• Find a heuristically optimal number $k$ of nearest neighbors, based on RMSE. This is done using cross validation.
• Calculate an inverse distance weighted average with the $k$-nearest multivariate neighbors.

$k$-NN 回归中，$k$-NN 算法用于估计连续变量。其中一种算法使用 $k$ 最近邻的加权平均值，由它们距离的倒数加权。该算法的工作原理如下：

• 计算从查询示例到标记示例的 Euclidean or Mahalanobis distance。
• 通过增加距离来排序标记的示例。
• 基于 RMSE，找到最近邻的启发式最优数 $k$。这是通过交叉验证完成的。
• $k$-最近的多变量邻域计算一个距离倒数加权平均值。
Mahalanobis distance：马氏距离，马哈拉诺比斯距离
heuristically：启发式地


## $k$-NN outlier - $k$-NN 异常值

The distance to the $k$th nearest neighbor can also be seen as a local density estimate and thus is also a popular outlier score in anomaly detection. The larger the distance to the $k$-NN, the lower the local density, the more likely the query point is an outlier. To take into account the whole neighborhood of the query point, the average distance to the $k$-NN can be used. Although quite simple, this outlier model, along with another classic data mining method, local outlier factor, works quite well also in comparison to more recent and more complex approaches, according to a large scale experimental analysis.

popular ['pɒpjʊlə]：adj. 流行的，通俗的，受欢迎的，大众的，普及的
anomaly [ə'nɒm(ə)lɪ]：n. 异常，不规则，反常事物


## Validation of results

A confusion matrix or matching matrix is often used as a tool to validate the accuracy of $k$-NN classification. More robust statistical methods such as likelihood-ratio test can also be applied.

confusion matrix：混淆矩阵
likelihood ['laɪklɪhʊd]：n. 可能性，可能


## References

locality-sensitive hashing，LSH：局部敏感哈希
https://en.wikipedia.org/wiki/Locality-sensitive_hashing #### PCA和LDA人脸识别matlab代码（最紧邻分类器）

06-15

06-13 75

#### k-NN最近邻算法(k-nearest neighbors algorithm)

11-20 190

#### 从K近邻算法、距离度量谈到KD树、SIFT+BBF算法

03-26 4万+

#### “程序员数学不行，干啥都不行！”高级开发：90%都是瞎努力！

08-24 5万+

#### KNN(k-nearest neighbor的缩写)最近邻算法原理详解

06-02 850

#### k-Nearest Neighbors(k近邻算法)

02-11 257

#### 大间隔最近邻分类算法

07-15

08-22 1万+

#### Pointnet系列（二）Pointnet++

08-18 235

#### k最邻近算法——加权kNN

05-20 837

#### LDA线性判别分析+KNN分类(含python实现代码)

03-16 133

#### K-Nearest Neighbors

11-07 52

#### 【机器学习笔记】K-Nearest Neighbors Algorithm（最近邻算法，KNN）

05-02 642

#### K-近邻算法分类与回归

11-20 4万+

#### 深度学习与计算机视觉系列(2)_图像分类与KNN

09-12 9334

#### 机器学习——最邻近规则分类（K Nearest Neighbor）KNN算法

10-26 190

#### K最近邻算法-k-nearest neighbors algorithm

09-24 39

#### 机器学习之KNN算法

05-05 244

#### 最近邻算法实现

06-24 ©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客 点击重新获取   扫码支付