最初的sin/cos编码

本文最后更新于：2025年2月6日晚上

位置编码–最初的sin/cos编码

1 1D 序列的sin/cos编码

1.1 介绍

众所周知，Transformers模型本身没有关于位置的inductive bias，所以需要额外注入位置信息。在最初的《Attention is All You Need》^[1]文章中，作者提出了首个流传至今的位置编码方式: sin/cos位置编码。

假设模型的输入embedding为 $x\in \mathbb{R}^{B\times T\times d}$ ，那么一维序列的位置编码公式可写作

\text{PE}_{t,2i} = \sin\left(\frac{t}{10000^{2i/d}}\right) \quad \text{PE}_{t,2i+1}=\cos\left(\frac{t}{10000^{2i/d}}\right)

其中， $t$ 是时间维度 $T$ 的索引， $i$ 是channel维度 $d$ 的索引，观察公式可得，位置编码在偶数和奇数位置上是不同的，且不仅与token位置 $t$ 有关，还与channel维度有关。

这种位置编码计算完成之后是一系列确定的值，所以我们也称这种位置编码为绝对位置编码。

1.2 代码实现

由公式可以知道，只要我们有channel大小和位置，就能够把这一系列位置编码算出来。在代码实现中，要考虑如何并行计算，即完全使用张量操作来完成。

我们可以设计一个函数get_1d_sincos_pos_embed(embed_dim: int, pos: np.array)，输入是两个参数，embed_dim代表channel大小或者embedding大小，pos代表一系列的位置id，是一个一维的数组，假设一共有 $M$ 个位置，这个函数返回一个 $M\times D$ 的Tensor.

首先，我们应该确定输入的embed_dim是否能被2整除，否则将无法实现奇偶数的计算。

1	`assert embed_dim % 2 == 0`

注意到，公式中无论是奇数的encoding还是偶数的encoding，分母的指数均为 $2i/d$ ，所有我们可以先创建有关于 $2i/d$ 的数据。

1
2
3

omega = np.arange(embed_dim // 2, dtype=np.float64)
omega /= embed_dim / 2.
omega = 1. / 10000**omega  # (D/2,)

第一行code创建了 $i\in [0,d/2-1]$ ，第二行code完成了

\omega = \frac{i}{d/2} = \frac{2i}{d}

第三行则变为

\omega = \frac{1}{10000^{\omega}} = \frac{1}{10000^{2i/d}}

我们现在就有了完整的缩放因子，接下来将算出来的 $\omega$ 乘到位置上去。

1 2	`pos = pos.reshape(-1) # (M,) out = np.einsum('m,d->md', pos, omega) # (M, D/2), outer product`

在得到的 $\omega$ 中，我们实际上得到的是一个一维向量

\begin{bmatrix} \frac{1}{(10000^{0/d})} \\ \frac{1}{(10000^{2/d})} \\ \vdots \\ \frac{1}{(10000^{2i/d})} \end{bmatrix} \quad i\in[0,d/2-1]

位置也可以写作一个一维向量

\begin{bmatrix} 0 \\ 1 \\ \vdots \\ M-1 \end{bmatrix}

则外积可得

\begin{bmatrix} 0 \\ 1 \\ \vdots \\ M \end{bmatrix}\otimes \begin{bmatrix} \frac{1}{(10000^{0/d})} \\ \frac{1}{(10000^{2/d})} \\ \vdots \\ \frac{1}{(10000^{2i/d})} \end{bmatrix} = \begin{bmatrix} 0\cdot \frac{1}{(10000^{0/d})} & 0 \cdot \frac{1}{(10000^{2/d})} &\cdots & 0\cdot \frac{1}{(10000^{2i/d})} \\ 1\cdot \frac{1}{(10000^{0/d})} & 1 \cdot \frac{1}{(10000^{2/d})} &\cdots & 1\cdot \frac{1}{(10000^{2i/d})} \\ \vdots & \vdots & \ddots & \vdots \\ (M-1)\cdot \frac{1}{(10000^{0/d})} & (M-1) \cdot \frac{1}{(10000^{2/d})} &\cdots & (M-1)\cdot \frac{1}{(10000^{2i/d})} \end{bmatrix}

这样我们就拿到了 $M\times d/2$ 大小的矩阵，里面包含了每个位置，每个channel位置的编码。即得到了

\frac{t}{10000^{2i/d}}\quad i\in[0,2/d-1], t\in[0,M-1]

接下来是针对cos和sin的不同处理，最后得到一个channel大小是 $D$ 的完整的tensor

1
2
3

emb_sin = np.sin(out) # (M, D/2)
emb_cos = np.cos(out) # (M, D/2)
emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)

我们立马就能注意到，这个实现方法与公式并不一样，首先是 $i$ 只索引到了 $d/2$ ，并且没有将sin，cos项插入到奇偶位置，而是一个放在前面，另一个放在后面。这个实现方式是在后来tensor2tensor的代码仓库^[2]中发现的，之后的开源项目中几乎都使用的是这个版本的位置编码。

完整的code如下

def get_1d_sincos_pos_embed(embed_dim, pos):
    assert embed_dim % 2 == 0
    omega = np.arange(embed_dim // 2, dtype=np.float64)
    omega /= embed_dim / 2.
    omega = 1. / 10000**omega  # (D/2,)

    pos = pos.reshape(-1)  # (M,)
    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product

    emb_sin = np.sin(out) # (M, D/2)
    emb_cos = np.cos(out) # (M, D/2)

    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
    return emb

1.3 原版代码实现

def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

在原版代码中，他们首先计算了一个get_angles

\omega = \frac{t}{10000^{2\cdot(i//2)/d}}

其中i//2因为整除得到 $[0, 0, 1, 1, 2, 2, 3,\dots]$ ，再乘2就是 $[0,0,2,2,4,4,6,\dots]$ ，最后得到序列

\left[\frac{0}{10000^{0/d}}, \frac{1}{10000^{0/d}}, \frac{2}{10000^{2/d}}, \frac{3}{10000^{2/d}}, \frac{4}{10000^{4/d}},\dots\right]

再通过下面的sin/cos和替换操作，得到

\left[\sin\left(\frac{0}{10000^{0/d}}\right),\cos\left(\frac{1}{10000^{0/d}}\right), \sin\left(\frac{2}{10000^{2/d}}\right), \cos\left(\frac{3}{10000^{2/d}}\right), \sin\left(\frac{4}{10000^{4/d}}\right),\dots\right]

可以发现这个代码最后出来的结果是符合原来的公式的，即奇偶位置是sin/cos交替，并且无论奇偶位置都是 $2i$ 在指数位置。

2 2D sin/cos编码

当使用Transformer类模型处理图像数据的时候，我们可能会用到二维的位置编码，但其实idea很简单，就是分别在图像的高和宽上应用1D的位置编码。

我们首先创建二维的grid

1
2
3

grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h)  # here w goes first

其中grid_size是长或者宽，meshgrid之后我们得到的是二维grid的坐标list[np.array, np.array]，其中每个np.array是二维数组。np.meshgrid(X,Y)返回两个坐标索引，第一个是X的，第二个是Y的，shape是 $len(Y)\times len(X)$ 。

接下来stack，reshape，并分割

grid = np.stack(grid, axis=0)  # 2 x h x w
grid = grid.reshape([2, 1, grid_size, grid_size]) # 2 x 1 x h x w
pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)

def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
    assert embed_dim % 2 == 0

    # use half of dimensions to encode grid_h
    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)

    emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
    return emb

这里将grid坐标stack起来得到 $2\times s\times s$ 大小的tensor，然后新增了一个维度，得到 $2\times 1\times s\times s$ ，在接下来的处理中，我们都只计算高和宽一半的embed_dim，然后将grid[0]和grid[1]分别送入1D的encoding中，1D的函数里会直接展平grid进行计算，最后将得到的encoding concatenate起来得到完整的encoding。

笔者注：
这里我认为命名有问题，np.meshgrid返回的第一个是x坐标，即图像宽度的位置id，第二个是y坐标，即图像高度的位置id，所以grid[0]对应emb_w，grid[1]对应emb_h。但可能并没有什么影响？（transpose一张图的位置信息并不影响学习其位置关系）

References

A. Vaswani et al., “Attention Is All You Need,” Aug. 01, 2023, arXiv: arXiv:1706.03762. Accessed: Nov. 24, 2023. [Online]. Available: http://arxiv.org/abs/1706.03762 ↩
https://github.com/tensorflow/tensor2tensor ↩

AIGC > 位置编码

#智能系统 #深度学习 #AIGC

最初的sin/cos编码

https://jesseprince.github.io/2025/02/06/aigcs/position_encodes/original_sin_cos/

作者

林正

发布于

2025年2月6日

许可协议

RLHF -- From Zero to PPO 理论篇上一篇

栈的基本应用下一篇