从bert与bert-adapter异同分析入手
tokenization.py中二者代码完全一致,没有任何区别(逐字符相同)
run_classifier.py二者有细微区别
bert中的
class XnliProcessor(DataProcessor):
在adapter-bert中被删除bert的第593,596行中的output_weights,output_bias
其中的tf.get_variable函数添加了参数
collections=["head", tf.GraphKeys.GLOBAL_VARIABLES]
在这里定义的这些参数是独立于预训练模型的,是classifier下游任务中添加的最顶端的分类层
因此被添加的全局collections名字为head
bert的787行
"xnli": XnliProcessor,
在adapter-bert中被删除
- 在这个部分,XnliProcessor的删除是完全没有任何影响的,因为这个类:Found 0 references in 0 files
optimization.py中二者有较小区别,主要为设置需要更新的参数仅仅为adapter层
在AdamWeightDecayOptimizer中添加了
adapter_weight_decay_rate=0.01
的参数adapter-bert中的62,96,107行
bert的70行定义了整个模型需要被训练的参数原先为
tvars = tf.trainable_variables()
在adapter-bert中改成了:
1
2
3tvars = []
for collection in ["adapters", "layer_norm", "head"]:
tvars += tf.get_collection(collection)adapter-bert中112行添加了
1
2
3self._adapter_variable_names = {
self._get_variable_name(v.name) for v in tf.get_collection("adapters")
}adapter-bert中155行,原先的bert为
1
2if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * paramadapter-bert中为
1
2
3
4
5if self._do_use_weight_decay(param_name):
if param_name in self._adapter_variable_names:
update += self.adapter_weight_decay_rate * param
else:
update += self.weight_decay_rate * param最后一处不同就是定义
def _do_use_weight_decay(self, param_name):
这个函数了,用来判断参数是否需要更新:bert中为
1
2
3
4
5
6
7
8
9def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True而在adapter-bert中为
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if param_name in self._adapter_variable_names:
if not self.adapter_weight_decay_rate:
return False
else:
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
最重要的modeling.py,下面行数以adapter-bert为基准
139行的init函数添加参数
adapter_fn="feedforward_adapter"
参数解释:adapter_fn: (optional) string identifying trainable adapter that takes as input a Tensor and returns a Tensor of the same shape.
主要就是经过adapter模块之后向量的形状不发生改变
208行
self.all_encoder_layers = transformer_model(...)
中添加参数:adapter_fn=get_adapter(adapter_fn)
代码比对软件中最右侧状态栏蓝色部分的意思是新添加上去的(而黑色栏是这部分发生了修改)
两个modeling.py一对比,发现有很多蓝色的部分,而黑色的修改都很微弱
adapter-bert中新增的函数,也是adapter方法的核心(从321行开始):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37def feedforward_adapter(input_tensor, hidden_size=64, init_scale=1e-3):
"""A feedforward adapter layer with a bottleneck.
Implements a bottleneck layer with a user-specified nonlinearity and an
identity residual connection. All variables created are added to the
"adapters" collection.
Args:
input_tensor: input Tensor of shape [batch size, hidden dimension]
hidden_size: dimension of the bottleneck layer.
init_scale: Scale of the initialization distribution used for weights.
Returns:
Tensor of the same shape as x.
"""
with tf.variable_scope("adapters"):
in_size = input_tensor.get_shape().as_list()[1]
w1 = tf.get_variable(
"weights1", [in_size, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=init_scale),
collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
b1 = tf.get_variable(
"biases1", [1, hidden_size],
initializer=tf.zeros_initializer(),
collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
net = tf.tensordot(input_tensor, w1, [[1], [0]]) + b1
net = gelu(net)
w2 = tf.get_variable(
"weights2", [hidden_size, in_size],
initializer=tf.truncated_normal_initializer(stddev=init_scale),
collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
b2 = tf.get_variable(
"biases2", [1, in_size],
initializer=tf.zeros_initializer(),
collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
net = tf.tensordot(net, w2, [[1], [0]]) + b2
return net + input_tensor1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26def get_adapter(function_string):
"""Maps a string to a Python function.
Args:
function_string: String name of the adapter function.
Returns:
A Python function corresponding to the adatper function.
`function_string` is None or empty, will return None.
If `function_string` is not a string, it will return `function_string`.
Raises:
ValueError: The `function_string` does not correspond to a known
adapter.
"""
# We assume that anything that"s not a string is already an adapter
# function, so we just return it.
if not isinstance(function_string, six.string_types):
return function_string
if not function_string:
return None
fn = function_string.lower()
if fn == "feedforward_adapter":
return feedforward_adapter
else:
raise ValueError("Unsupported adapters: %s" % fn)注意把adapter的参数加进了全局的collection,adapter的代码实现非常简单
439行layer_norm的修改:
1
2
3
4
5def layer_norm(input_tensor, name=None):
"""Run layer normalization on the last dimension of the tensor."""
return tf.contrib.layers.layer_norm(
inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name,
variables_collections=["layer_norm", tf.GraphKeys.GLOBAL_VARIABLES])新添加了最后一行的参数
代码843行中
def transformer_model(...)
添加了参数adapter_fn=None
adapter流程在两处添加:938行的attention_output和964行的layer_output:
1
2
3
4
5
6
7
8
9
10
11# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
if adapter_fn:
attention_output = adapter_fn(attention_output)
attention_output = layer_norm(attention_output + layer_input)1
2
3
4
5
6
7
8
9
10
11
12# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob)
if adapter_fn:
layer_output = adapter_fn(layer_output)
layer_output = layer_norm(layer_output + attention_output)
prev_output = layer_output
all_layer_outputs.append(layer_output)