Optimizer Factory -- 写一个能够按层衰减的优化器工厂

本文最后更新于:2024年1月28日 晚上

1 Introduction

按层调节学习率是很重要的,但原始的torch.optim.optimizer类不含按层调整的功能,所以我们需要自定义一个流程来实现。为了方便的创建带这个功能的optimizer,最好的做法是使用工厂设计模式来创建optimizer。当我们需要不同的优化器的时候,optimizer factory总能够帮我们“加工”torch的optimizer来增加按层调整功能。

2 分组策略

在模型中,我们有时候并不是需要真的每层都按照某个规则来调整参数,例如一个Transformer的block中包含qkv, layernorm等等层,这些可以看作“一层”来调整学习率,也就是说,我们只想按照block来调整学习率。

在CNN模型中也是如此,我们可能想要每三个block的深度调整一次学习率,这个时候就需要写一个函数来映射PyTorch Module内的层到一个逻辑上的“层”,或者说我们把Pytorch Module内的层进行分组。

1
2
3
4
5
6
7
8
9
10
11
12
def get_num_layer_for_vit(var_name, num_max_layer):
if var_name in ("cls_token", "mask_token", "pos_embed"):
return 0
elif var_name.startswith("patch_embed"):
return 0
elif var_name.startswith("rel_pos_bias"):
return num_max_layer - 1 # map relative position bias to last layer
elif var_name.startswith("blocks"):
layer_id = int(var_name.split('.')[1]) # ['blocks', '0', 'qkv', 'weight']
return layer_id + 1 # increase layer one by one
else:
return num_max_layer - 1 # additional layers

以上是BEiT项目中用到的分组策略,它的分组策略可以解释如下

  • 如果模型参数名里面有cls_token, mask_token, pos_embed,那就分到第0层。
  • 如果模型参数名以patch_embed开始也分到第0层
  • 如果模型参数名以rel_pos_bias开始,就分到最后一层
  • 如果模型参数名以blocks开始,首先按.分割字符串,这是因为模型参数名一般以类似blocks.0.qkv.weight命名(如果使用循环来append blocks),那么分割后就是['blocks', '0', 'qkv', 'weight'],取第二个元素就是它在循环创建时被赋予的ID。这个block最后会被分到ID+1层去。
  • 如果有其他的参数名,那么分到最后一层去。

这样一来,我们得到的是以下的分组

  • 第0层: 包括cls_token, mask_token, pos_embed, patch_embed等等
  • 中间层:按照blocks前面的编号+1来计算
  • 最后一层:rel_pos_bias以及其他参数

如果是一个CNN模型,以ConvNeXt为例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def get_num_layer_for_convnext(var_name):
num_max_layer = 12
if var_name.startswith("downsample_layers"):
stage_id = int(var_name.split('.')[1])
if stage_id == 0: # stage 0
layer_id = 0 # map to layer 0
elif stage_id == 1 or stage_id == 2: # stage 1 and 2
layer_id = stage_id + 1 # map to layer 2 and 3
elif stage_id == 3: # stage 3
layer_id = 12 # map to layer 12
return layer_id

elif var_name.startswith("stages"):
stage_id = int(var_name.split('.')[1]) # ['stages', '0', '1', 'conv1', 'weight']
block_id = int(var_name.split('.')[2])
if stage_id == 0 or stage_id == 1:
layer_id = stage_id + 1
# stage 0 and stage 1 is mapped to layer 1 and layer 2
elif stage_id == 2:
layer_id = 3 + block_id // 3
# three blocks in a group
elif stage_id == 3: # last stage
layer_id = 12 # all to the last layer
return layer_id
else:
return num_max_layer + 1 # other layers

在ConvNeXt构建中,参数命名主要是downsample_layersstages,所以这里也主要判断是不是downsample_layers中的sampler和stages中的卷积block。由于创建ConvNeXt模型时主要是以循环来创建各个block,所以sampler的命名类似downsample_layers.0.0.weight,stage的命名类似stages.0.1.conv.weight。故分割参数名字符串后,sampler取第一个元素得到的是stage ID,blocks取第一个元素是stage ID,第二个元素是block ID。这里不取sampler的block ID是因为sampler总共就一个layer norm和一个conv,不需要单独取出来。

在这个分组策略下,最后会得到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for ConvNeXt with depth of [3, 3, 27, 3]
layer 0: [sampler_s0] (stem)
layer 1: [stage_s0]
layer 2: [sampler_s1, stage_s1]
layer 3: [sampler_s2, stage_s2_blk(0,1,2)]
layer 4: [stage_s2_blk(3,4,5)]
layer 5: [stage_s2_blk(6,7,8)]
layer 6: [stage_s2_blk(9,10,11)]
layer 7: [stage_s2_blk(12,13,14)]
layer 8: [stage_s2_blk(15,16,17)]
layer 9: [stage_s2_blk(18,19,20)]
layer 10: [stage_s2_blk(21,22,23)]
layer 11: [stage_s2_blk(24,25,26)]
layer 12: [sampler_s3, stage_s3]
layer 13: other parameters

可以看出分组策略需要自己决定,对于自己的模型需要customize一个get_num_layer_for_xxx

3 Assigner

接下来是根据layer ID,我们需要把不同的learning rate scale赋值给这些layer。这里采用的scale是指lr=lr×scalelr = lr\times scale。每一层的scale不同,那么每一层的学习率就会不同。

1
2
3
4
5
6
7
8
9
10
class LayerDecayValueAssigner(object):
def __init__(self, values):
self.values = values # decay values

def get_scale(self, layer_id):
return self.values[layer_id]
# use layer_id as index to obtain decay scale

def get_layer_id(self, var_name):
return get_num_layer_for_convnext(var_name) # use custom layer map

首先,Assigner会初始化一个values,这可以是一个列表,里面装了每层的learning rate scale,例如[1, 0.9, 0.8, 0.7],这代表着第0层的learning rate scale是1,第0层的参数的学习率将是原始设定的学习率乘以1。第一层的learning rate scale是0.9,第一层的学习率将是原始设定的学习率乘以0.9,以此类推。

get_scale(self, layer_id)就根据layer ID来返回values列表中的learning rate scale值。

get_layer_id(self, var_name)就使用自定义的分组映射函数来通过模型参数名得到其对应的层数。

4 参数分组

接下来,我们需要把模型的所有参数都分个组,以便后续传入优化器

1
2
3
4
5
6
7
8
9
def get_parameter_groups(
model,
weight_decay=1e-5,
skip_list=(),
get_num_layer=None,
get_layer_scale=None
):
parameter_group_names = {}
parameter_group_vars = {}

这个函数需要传入模型,weight_decay值,以及一个get_num_layer函数和get_layer_scale函数,这两个一般就传入自定义的get_num_layer_for_xxxLayerDecayValueAssigner对象中的get_scale方法。

我们还初始化了两个字典,一个是parameter_group_names这个字典负责记录参数名字,parameter_group_vars,这个字典负责记录模型参数值。

首先是遍历模型的名字和参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for name, param in model.named_parameters():
if not param.requires_grad:
continue # frozen weights, skip them
if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
# bias, those in skip list, layer norm, and also layer scale maybe
group_name = "no_decay"
this_weight_decay = 0.
else:
group_name = "decay"
this_weight_decay = weight_decay # assign the weight decay
if get_num_layer is not None: # if get_num_layer is provided, then rearrange the layer
layer_id = get_num_layer(name) # get layer id
group_name = "layer_%d_%s" % (layer_id, group_name)
# group_name will be overided to layer_(layer_id)_(group_name)
else:
layer_id = None # no rearrange

可以发现,如果没有提供get_num_layer,那么group_name只会有no_decaydecay两种。不提供get_num_layer也就意味着不需要按层调整参数,所以只需要区分有weight decay的参数和没有weight decay的参数。

如果提供了get_num_layer,那么group_name按照layer_(layer_id)_(group_name(decay or no_decay))进行命名。

接下来的部分仍然在循环中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
if group_name not in parameter_group_names:
if get_layer_scale is not None:
# if there is a decay assigner
scale = get_layer_scale(layer_id)
# get scale
else:
scale = 1.
# no assigner, then no decay, the scale is 1

parameter_group_names[group_name] = {
"weight_decay": this_weight_decay,
"params": [],
"lr_scale": scale
}

# initialize the storage format for a group,
# - "weight_decay": the weight decay for this layer
# - "params": model parameters
# - "lr_scale": the scale of the learning rate

parameter_group_vars[group_name] = {
"weight_decay": this_weight_decay,
"params": [],
"lr_scale": scale
}

parameter_group_vars[group_name]["params"].append(param)
# append param value to this group
parameter_group_names[group_name]["params"].append(name)
# append param name to this group

这部分代码首先判断前面命名好的group_name在不在parameter_group_names中,如果不在,则进入if语句,然后判断是否提供了get_layer_scale,如果是,就根据之前获取的layer ID来得到那个layer的scale。如果没有提供,那就意味着不需要按层调整,则scale为1。

接下来,parameter_group_namesparameter_group_vars这两个字典会被初始化为

  • weight_decay
  • params
  • lr_scale

这个格式,其中weight_decay的值由之前的代码判断,如果分在了有decay的组,则会赋予传入的weight decay值,如果分在了没有decay的组,则会赋予0。params则是用来装此次遍历中取到的参数,lr_scale则赋予取到的scale值。

接下来先向parameter_group_vars对应的group_name中的params里append这个参数的值,再往parameter_group_names对应的group_name中的params里append这个参数的名字。

所以说parameter_group_names里面记录了所有参数的名字,而parameter_group_vars记录了所有参数的值。

1
2
3
4
5
print("Param groups = %s" % json.dumps(parameter_group_names, indent=2))
# print the dict with json.dumps
return list(parameter_group_vars.values())
# convert the dict to list and return the parameter groups
# format: list, elements are dicts, with the format above

最后,parameter_group_names用来打印输出检查,parameter_group_vars用来返回,返回的是字典中的values部分,并转换为了list。

我们拿ConvNeXt来测试一下

1
2
3
4
5
model = convnext_small()
parameters = get_parameter_groups(
model=model,
weight_decay=0.5,
)

我们不传入get_num_layerget_layer_scale,输出为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Param groups = {
"decay": {
"weight_decay": 0.5,
"params": [
"downsample_layers.0.0.weight",
"downsample_layers.1.1.weight",
"downsample_layers.2.1.weight",
"downsample_layers.3.1.weight",
"stages.0.0.dwconv.weight",
"stages.0.0.pwconv1.weight",
"stages.0.0.pwconv2.weight",
"stages.0.1.dwconv.weight",
......
"stages.3.2.pwconv2.weight",
"head.weight"
],
"lr_scale": 1.0
},
"no_decay": {
"weight_decay": 0.0,
"params": [
"downsample_layers.0.0.bias",
"downsample_layers.0.1.weight",
...
"downsample_layers.3.1.bias",
"stages.0.0.gamma",
"stages.0.0.dwconv.bias",
"stages.0.0.norm.weight",
"stages.0.0.norm.bias",
"stages.0.0.pwconv1.bias",
"stages.0.0.pwconv2.bias",
...
"stages.3.2.pwconv2.bias",
"norm.weight",
"norm.bias",
"head.bias"
],
"lr_scale": 1.0
}
}

可以看到,不需要weight decay的那些参数,例如用于layer scale的gamma,模型的bias,以及norm层的weight和bias都被分到了no_decay这个组,并且这个组的weight_decay也被设置为0.

现在我们尝试加入按层调整。

1
2
3
4
5
6
7
8
9
values = np.linspace(1, 0.5, 14)
assigner = LayerDecayValueAssigner(values)
model = convnext_small()
parameters = get_parameter_groups(
model=model,
weight_decay=0.5,
get_num_layer=get_num_layer_for_convnext,
get_layer_scale=assigner.get_scale
)

输出为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
Param groups = {
"layer_0_decay": {
"weight_decay": 0.5,
"params": [
"downsample_layers.0.0.weight"
],
"lr_scale": 1.0
},
"layer_0_no_decay": {
"weight_decay": 0.0,
"params": [
"downsample_layers.0.0.bias",
"downsample_layers.0.1.weight",
"downsample_layers.0.1.bias"
],
"lr_scale": 1.0
},
"layer_2_no_decay": {
"weight_decay": 0.0,
"params": [
"downsample_layers.1.0.weight",
"downsample_layers.1.0.bias",
"downsample_layers.1.1.bias",
"stages.1.0.gamma",
"stages.1.0.dwconv.bias",
"stages.1.0.norm.weight",
"stages.1.0.norm.bias",
"stages.1.0.pwconv1.bias",
"stages.1.0.pwconv2.bias",
"stages.1.1.gamma",
"stages.1.1.dwconv.bias",
"stages.1.1.norm.weight",
"stages.1.1.norm.bias",
"stages.1.1.pwconv1.bias",
"stages.1.1.pwconv2.bias",
"stages.1.2.gamma",
"stages.1.2.dwconv.bias",
"stages.1.2.norm.weight",
"stages.1.2.norm.bias",
"stages.1.2.pwconv1.bias",
"stages.1.2.pwconv2.bias"
],
"lr_scale": 0.9230769230769231
},
"layer_2_decay": {
"weight_decay": 0.5,
"params": [
"downsample_layers.1.1.weight",
"stages.1.0.dwconv.weight",
"stages.1.0.pwconv1.weight",
"stages.1.0.pwconv2.weight",
"stages.1.1.dwconv.weight",
"stages.1.1.pwconv1.weight",
"stages.1.1.pwconv2.weight",
"stages.1.2.dwconv.weight",
"stages.1.2.pwconv1.weight",
"stages.1.2.pwconv2.weight"
],
"lr_scale": 0.9230769230769231
},
"layer_3_no_decay": {
"weight_decay": 0.0,
"params": [
"downsample_layers.2.0.weight",
"downsample_layers.2.0.bias",
"downsample_layers.2.1.bias",
"stages.2.0.gamma",
"stages.2.0.dwconv.bias",
"stages.2.0.norm.weight",
"stages.2.0.norm.bias",
"stages.2.0.pwconv1.bias",
"stages.2.0.pwconv2.bias",
"stages.2.1.gamma",
"stages.2.1.dwconv.bias",
"stages.2.1.norm.weight",
"stages.2.1.norm.bias",
"stages.2.1.pwconv1.bias",
"stages.2.1.pwconv2.bias",
"stages.2.2.gamma",
"stages.2.2.dwconv.bias",
"stages.2.2.norm.weight",
"stages.2.2.norm.bias",
"stages.2.2.pwconv1.bias",
"stages.2.2.pwconv2.bias"
],
"lr_scale": 0.8846153846153846
},
"layer_3_decay": {
"weight_decay": 0.5,
"params": [
"downsample_layers.2.1.weight",
"stages.2.0.dwconv.weight",
"stages.2.0.pwconv1.weight",
"stages.2.0.pwconv2.weight",
"stages.2.1.dwconv.weight",
"stages.2.1.pwconv1.weight",
"stages.2.1.pwconv2.weight",
"stages.2.2.dwconv.weight",
"stages.2.2.pwconv1.weight",
"stages.2.2.pwconv2.weight"
],
"lr_scale": 0.8846153846153846
},
"layer_12_no_decay": {
"weight_decay": 0.0,
"params": [
"downsample_layers.3.0.weight",
"downsample_layers.3.0.bias",
"downsample_layers.3.1.bias",
"stages.3.0.gamma",
"stages.3.0.dwconv.bias",
"stages.3.0.norm.weight",
"stages.3.0.norm.bias",
"stages.3.0.pwconv1.bias",
"stages.3.0.pwconv2.bias",
"stages.3.1.gamma",
"stages.3.1.dwconv.bias",
"stages.3.1.norm.weight",
"stages.3.1.norm.bias",
"stages.3.1.pwconv1.bias",
"stages.3.1.pwconv2.bias",
"stages.3.2.gamma",
"stages.3.2.dwconv.bias",
"stages.3.2.norm.weight",
"stages.3.2.norm.bias",
"stages.3.2.pwconv1.bias",
"stages.3.2.pwconv2.bias"
],
"lr_scale": 0.5384615384615384
},
"layer_12_decay": {
"weight_decay": 0.5,
"params": [
"downsample_layers.3.1.weight",
"stages.3.0.dwconv.weight",
"stages.3.0.pwconv1.weight",
"stages.3.0.pwconv2.weight",
"stages.3.1.dwconv.weight",
"stages.3.1.pwconv1.weight",
"stages.3.1.pwconv2.weight",
"stages.3.2.dwconv.weight",
"stages.3.2.pwconv1.weight",
"stages.3.2.pwconv2.weight"
],
"lr_scale": 0.5384615384615384
},
"layer_1_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.0.0.gamma",
"stages.0.0.dwconv.bias",
"stages.0.0.norm.weight",
"stages.0.0.norm.bias",
"stages.0.0.pwconv1.bias",
"stages.0.0.pwconv2.bias",
"stages.0.1.gamma",
"stages.0.1.dwconv.bias",
"stages.0.1.norm.weight",
"stages.0.1.norm.bias",
"stages.0.1.pwconv1.bias",
"stages.0.1.pwconv2.bias",
"stages.0.2.gamma",
"stages.0.2.dwconv.bias",
"stages.0.2.norm.weight",
"stages.0.2.norm.bias",
"stages.0.2.pwconv1.bias",
"stages.0.2.pwconv2.bias"
],
"lr_scale": 0.9615384615384616
},
"layer_1_decay": {
"weight_decay": 0.5,
"params": [
"stages.0.0.dwconv.weight",
"stages.0.0.pwconv1.weight",
"stages.0.0.pwconv2.weight",
"stages.0.1.dwconv.weight",
"stages.0.1.pwconv1.weight",
"stages.0.1.pwconv2.weight",
"stages.0.2.dwconv.weight",
"stages.0.2.pwconv1.weight",
"stages.0.2.pwconv2.weight"
],
"lr_scale": 0.9615384615384616
},
"layer_4_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.3.gamma",
"stages.2.3.dwconv.bias",
"stages.2.3.norm.weight",
"stages.2.3.norm.bias",
"stages.2.3.pwconv1.bias",
"stages.2.3.pwconv2.bias",
"stages.2.4.gamma",
"stages.2.4.dwconv.bias",
"stages.2.4.norm.weight",
"stages.2.4.norm.bias",
"stages.2.4.pwconv1.bias",
"stages.2.4.pwconv2.bias",
"stages.2.5.gamma",
"stages.2.5.dwconv.bias",
"stages.2.5.norm.weight",
"stages.2.5.norm.bias",
"stages.2.5.pwconv1.bias",
"stages.2.5.pwconv2.bias"
],
"lr_scale": 0.8461538461538461
},
"layer_4_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.3.dwconv.weight",
"stages.2.3.pwconv1.weight",
"stages.2.3.pwconv2.weight",
"stages.2.4.dwconv.weight",
"stages.2.4.pwconv1.weight",
"stages.2.4.pwconv2.weight",
"stages.2.5.dwconv.weight",
"stages.2.5.pwconv1.weight",
"stages.2.5.pwconv2.weight"
],
"lr_scale": 0.8461538461538461
},
"layer_5_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.6.gamma",
"stages.2.6.dwconv.bias",
"stages.2.6.norm.weight",
"stages.2.6.norm.bias",
"stages.2.6.pwconv1.bias",
"stages.2.6.pwconv2.bias",
"stages.2.7.gamma",
"stages.2.7.dwconv.bias",
"stages.2.7.norm.weight",
"stages.2.7.norm.bias",
"stages.2.7.pwconv1.bias",
"stages.2.7.pwconv2.bias",
"stages.2.8.gamma",
"stages.2.8.dwconv.bias",
"stages.2.8.norm.weight",
"stages.2.8.norm.bias",
"stages.2.8.pwconv1.bias",
"stages.2.8.pwconv2.bias"
],
"lr_scale": 0.8076923076923077
},
"layer_5_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.6.dwconv.weight",
"stages.2.6.pwconv1.weight",
"stages.2.6.pwconv2.weight",
"stages.2.7.dwconv.weight",
"stages.2.7.pwconv1.weight",
"stages.2.7.pwconv2.weight",
"stages.2.8.dwconv.weight",
"stages.2.8.pwconv1.weight",
"stages.2.8.pwconv2.weight"
],
"lr_scale": 0.8076923076923077
},
"layer_6_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.9.gamma",
"stages.2.9.dwconv.bias",
"stages.2.9.norm.weight",
"stages.2.9.norm.bias",
"stages.2.9.pwconv1.bias",
"stages.2.9.pwconv2.bias",
"stages.2.10.gamma",
"stages.2.10.dwconv.bias",
"stages.2.10.norm.weight",
"stages.2.10.norm.bias",
"stages.2.10.pwconv1.bias",
"stages.2.10.pwconv2.bias",
"stages.2.11.gamma",
"stages.2.11.dwconv.bias",
"stages.2.11.norm.weight",
"stages.2.11.norm.bias",
"stages.2.11.pwconv1.bias",
"stages.2.11.pwconv2.bias"
],
"lr_scale": 0.7692307692307692
},
"layer_6_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.9.dwconv.weight",
"stages.2.9.pwconv1.weight",
"stages.2.9.pwconv2.weight",
"stages.2.10.dwconv.weight",
"stages.2.10.pwconv1.weight",
"stages.2.10.pwconv2.weight",
"stages.2.11.dwconv.weight",
"stages.2.11.pwconv1.weight",
"stages.2.11.pwconv2.weight"
],
"lr_scale": 0.7692307692307692
},
"layer_7_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.12.gamma",
"stages.2.12.dwconv.bias",
"stages.2.12.norm.weight",
"stages.2.12.norm.bias",
"stages.2.12.pwconv1.bias",
"stages.2.12.pwconv2.bias",
"stages.2.13.gamma",
"stages.2.13.dwconv.bias",
"stages.2.13.norm.weight",
"stages.2.13.norm.bias",
"stages.2.13.pwconv1.bias",
"stages.2.13.pwconv2.bias",
"stages.2.14.gamma",
"stages.2.14.dwconv.bias",
"stages.2.14.norm.weight",
"stages.2.14.norm.bias",
"stages.2.14.pwconv1.bias",
"stages.2.14.pwconv2.bias"
],
"lr_scale": 0.7307692307692307
},
"layer_7_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.12.dwconv.weight",
"stages.2.12.pwconv1.weight",
"stages.2.12.pwconv2.weight",
"stages.2.13.dwconv.weight",
"stages.2.13.pwconv1.weight",
"stages.2.13.pwconv2.weight",
"stages.2.14.dwconv.weight",
"stages.2.14.pwconv1.weight",
"stages.2.14.pwconv2.weight"
],
"lr_scale": 0.7307692307692307
},
"layer_8_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.15.gamma",
"stages.2.15.dwconv.bias",
"stages.2.15.norm.weight",
"stages.2.15.norm.bias",
"stages.2.15.pwconv1.bias",
"stages.2.15.pwconv2.bias",
"stages.2.16.gamma",
"stages.2.16.dwconv.bias",
"stages.2.16.norm.weight",
"stages.2.16.norm.bias",
"stages.2.16.pwconv1.bias",
"stages.2.16.pwconv2.bias",
"stages.2.17.gamma",
"stages.2.17.dwconv.bias",
"stages.2.17.norm.weight",
"stages.2.17.norm.bias",
"stages.2.17.pwconv1.bias",
"stages.2.17.pwconv2.bias"
],
"lr_scale": 0.6923076923076923
},
"layer_8_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.15.dwconv.weight",
"stages.2.15.pwconv1.weight",
"stages.2.15.pwconv2.weight",
"stages.2.16.dwconv.weight",
"stages.2.16.pwconv1.weight",
"stages.2.16.pwconv2.weight",
"stages.2.17.dwconv.weight",
"stages.2.17.pwconv1.weight",
"stages.2.17.pwconv2.weight"
],
"lr_scale": 0.6923076923076923
},
"layer_9_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.18.gamma",
"stages.2.18.dwconv.bias",
"stages.2.18.norm.weight",
"stages.2.18.norm.bias",
"stages.2.18.pwconv1.bias",
"stages.2.18.pwconv2.bias",
"stages.2.19.gamma",
"stages.2.19.dwconv.bias",
"stages.2.19.norm.weight",
"stages.2.19.norm.bias",
"stages.2.19.pwconv1.bias",
"stages.2.19.pwconv2.bias",
"stages.2.20.gamma",
"stages.2.20.dwconv.bias",
"stages.2.20.norm.weight",
"stages.2.20.norm.bias",
"stages.2.20.pwconv1.bias",
"stages.2.20.pwconv2.bias"
],
"lr_scale": 0.6538461538461539
},
"layer_9_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.18.dwconv.weight",
"stages.2.18.pwconv1.weight",
"stages.2.18.pwconv2.weight",
"stages.2.19.dwconv.weight",
"stages.2.19.pwconv1.weight",
"stages.2.19.pwconv2.weight",
"stages.2.20.dwconv.weight",
"stages.2.20.pwconv1.weight",
"stages.2.20.pwconv2.weight"
],
"lr_scale": 0.6538461538461539
},
"layer_10_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.21.gamma",
"stages.2.21.dwconv.bias",
"stages.2.21.norm.weight",
"stages.2.21.norm.bias",
"stages.2.21.pwconv1.bias",
"stages.2.21.pwconv2.bias",
"stages.2.22.gamma",
"stages.2.22.dwconv.bias",
"stages.2.22.norm.weight",
"stages.2.22.norm.bias",
"stages.2.22.pwconv1.bias",
"stages.2.22.pwconv2.bias",
"stages.2.23.gamma",
"stages.2.23.dwconv.bias",
"stages.2.23.norm.weight",
"stages.2.23.norm.bias",
"stages.2.23.pwconv1.bias",
"stages.2.23.pwconv2.bias"
],
"lr_scale": 0.6153846153846154
},
"layer_10_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.21.dwconv.weight",
"stages.2.21.pwconv1.weight",
"stages.2.21.pwconv2.weight",
"stages.2.22.dwconv.weight",
"stages.2.22.pwconv1.weight",
"stages.2.22.pwconv2.weight",
"stages.2.23.dwconv.weight",
"stages.2.23.pwconv1.weight",
"stages.2.23.pwconv2.weight"
],
"lr_scale": 0.6153846153846154
},
"layer_11_no_decay": {
"weight_decay": 0.0,
"params": [
"stages.2.24.gamma",
"stages.2.24.dwconv.bias",
"stages.2.24.norm.weight",
"stages.2.24.norm.bias",
"stages.2.24.pwconv1.bias",
"stages.2.24.pwconv2.bias",
"stages.2.25.gamma",
"stages.2.25.dwconv.bias",
"stages.2.25.norm.weight",
"stages.2.25.norm.bias",
"stages.2.25.pwconv1.bias",
"stages.2.25.pwconv2.bias",
"stages.2.26.gamma",
"stages.2.26.dwconv.bias",
"stages.2.26.norm.weight",
"stages.2.26.norm.bias",
"stages.2.26.pwconv1.bias",
"stages.2.26.pwconv2.bias"
],
"lr_scale": 0.5769230769230769
},
"layer_11_decay": {
"weight_decay": 0.5,
"params": [
"stages.2.24.dwconv.weight",
"stages.2.24.pwconv1.weight",
"stages.2.24.pwconv2.weight",
"stages.2.25.dwconv.weight",
"stages.2.25.pwconv1.weight",
"stages.2.25.pwconv2.weight",
"stages.2.26.dwconv.weight",
"stages.2.26.pwconv1.weight",
"stages.2.26.pwconv2.weight"
],
"lr_scale": 0.5769230769230769
},
"layer_13_no_decay": {
"weight_decay": 0.0,
"params": [
"norm.weight",
"norm.bias",
"head.bias"
],
"lr_scale": 0.5
},
"layer_13_decay": {
"weight_decay": 0.5,
"params": [
"head.weight"
],
"lr_scale": 0.5
}
}

可以看到其成功按照规则分组,不仅分开了有weight decay和没有weight decay的组,也按层分开了参数并赋予了不同的lr_scale。本来一共13个layer,但head属于其他参数,所以分到了第14层。打印顺序有错乱。

再来查看一下返回值,在代码中可以看到是把字典中的值转换为列表后返回的,所以我们遍历这个列表,

1
2
for param_group in parameters:
print(param_group.keys())

可以看到

1
dict_keys(['weight_decay', 'params', 'lr_scale'])

即返回的列表里面是一个个之前设定的字典,每个字典里面都包含了weight_decay, paramslr_scale,由于返回时调用了values方法,所以没有decay,no_decay,layer_0_decay这些group_name了。这些字典被优化器读进去之后,优化器会自动的获取到字典内设定的参数。

(not finished)


Optimizer Factory -- 写一个能够按层衰减的优化器工厂
https://jesseprince.github.io/2024/01/28/pytorch/optimfact/
作者
林正
发布于
2024年1月28日
许可协议