bert-adapter初步分析

从bert与bert-adapter异同分析入手

tokenization.py中二者代码完全一致,没有任何区别(逐字符相同)


run_classifier.py二者有细微区别

  1. bert中的class XnliProcessor(DataProcessor):在adapter-bert中被删除

  2. bert的第593,596行中的output_weights,output_bias

    其中的tf.get_variable函数添加了参数collections=["head", tf.GraphKeys.GLOBAL_VARIABLES]

    在这里定义的这些参数是独立于预训练模型的,是classifier下游任务中添加的最顶端的分类层

    因此被添加的全局collections名字为head

  3. bert的787行"xnli": XnliProcessor,在adapter-bert中被删除

  • 在这个部分,XnliProcessor的删除是完全没有任何影响的,因为这个类:Found 0 references in 0 files

optimization.py中二者有较小区别,主要为设置需要更新的参数仅仅为adapter层

  1. 在AdamWeightDecayOptimizer中添加了adapter_weight_decay_rate=0.01的参数

    adapter-bert中的62,96,107行

  2. bert的70行定义了整个模型需要被训练的参数原先为 tvars = tf.trainable_variables()

    在adapter-bert中改成了:

    1
    2
    3
    tvars = []
    for collection in ["adapters", "layer_norm", "head"]:
    tvars += tf.get_collection(collection)
  3. adapter-bert中112行添加了

    1
    2
    3
    self._adapter_variable_names = {
    self._get_variable_name(v.name) for v in tf.get_collection("adapters")
    }
  4. adapter-bert中155行,原先的bert为

    1
    2
    if self._do_use_weight_decay(param_name):
    update += self.weight_decay_rate * param

    adapter-bert中为

    1
    2
    3
    4
    5
    if self._do_use_weight_decay(param_name):
    if param_name in self._adapter_variable_names:
    update += self.adapter_weight_decay_rate * param
    else:
    update += self.weight_decay_rate * param
  5. 最后一处不同就是定义def _do_use_weight_decay(self, param_name):这个函数了,用来判断参数是否需要更新:

    bert中为

    1
    2
    3
    4
    5
    6
    7
    8
    9
    def _do_use_weight_decay(self, param_name):
    """Whether to use L2 weight decay for `param_name`."""
    if not self.weight_decay_rate:
    return False
    if self.exclude_from_weight_decay:
    for r in self.exclude_from_weight_decay:
    if re.search(r, param_name) is not None:
    return False
    return True

    而在adapter-bert中为

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    def _do_use_weight_decay(self, param_name):
    """Whether to use L2 weight decay for `param_name`."""
    if param_name in self._adapter_variable_names:
    if not self.adapter_weight_decay_rate:
    return False
    else:
    if not self.weight_decay_rate:
    return False

    if self.exclude_from_weight_decay:
    for r in self.exclude_from_weight_decay:
    if re.search(r, param_name) is not None:
    return False

    return True

最重要的modeling.py,下面行数以adapter-bert为基准

  1. 139行的init函数添加参数adapter_fn="feedforward_adapter"

    参数解释:adapter_fn: (optional) string identifying trainable adapter that takes as input a Tensor and returns a Tensor of the same shape.

    主要就是经过adapter模块之后向量的形状不发生改变

    208行self.all_encoder_layers = transformer_model(...)中添加参数:adapter_fn=get_adapter(adapter_fn)

  2. 代码比对软件中最右侧状态栏蓝色部分的意思是新添加上去的(而黑色栏是这部分发生了修改)

    两个modeling.py一对比,发现有很多蓝色的部分,而黑色的修改都很微弱

    adapter-bert中新增的函数,也是adapter方法的核心(从321行开始):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    def feedforward_adapter(input_tensor, hidden_size=64, init_scale=1e-3):
    """A feedforward adapter layer with a bottleneck.
    Implements a bottleneck layer with a user-specified nonlinearity and an
    identity residual connection. All variables created are added to the
    "adapters" collection.
    Args:
    input_tensor: input Tensor of shape [batch size, hidden dimension]
    hidden_size: dimension of the bottleneck layer.
    init_scale: Scale of the initialization distribution used for weights.
    Returns:
    Tensor of the same shape as x.
    """
    with tf.variable_scope("adapters"):
    in_size = input_tensor.get_shape().as_list()[1]
    w1 = tf.get_variable(
    "weights1", [in_size, hidden_size],
    initializer=tf.truncated_normal_initializer(stddev=init_scale),
    collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    b1 = tf.get_variable(
    "biases1", [1, hidden_size],
    initializer=tf.zeros_initializer(),
    collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    net = tf.tensordot(input_tensor, w1, [[1], [0]]) + b1

    net = gelu(net)

    w2 = tf.get_variable(
    "weights2", [hidden_size, in_size],
    initializer=tf.truncated_normal_initializer(stddev=init_scale),
    collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    b2 = tf.get_variable(
    "biases2", [1, in_size],
    initializer=tf.zeros_initializer(),
    collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    net = tf.tensordot(net, w2, [[1], [0]]) + b2

    return net + input_tensor
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    def get_adapter(function_string):
    """Maps a string to a Python function.
    Args:
    function_string: String name of the adapter function.
    Returns:
    A Python function corresponding to the adatper function.
    `function_string` is None or empty, will return None.
    If `function_string` is not a string, it will return `function_string`.
    Raises:
    ValueError: The `function_string` does not correspond to a known
    adapter.
    """

    # We assume that anything that"s not a string is already an adapter
    # function, so we just return it.
    if not isinstance(function_string, six.string_types):
    return function_string

    if not function_string:
    return None

    fn = function_string.lower()
    if fn == "feedforward_adapter":
    return feedforward_adapter
    else:
    raise ValueError("Unsupported adapters: %s" % fn)

    注意把adapter的参数加进了全局的collection,adapter的代码实现非常简单

  3. 439行layer_norm的修改:

    1
    2
    3
    4
    5
    def layer_norm(input_tensor, name=None):
    """Run layer normalization on the last dimension of the tensor."""
    return tf.contrib.layers.layer_norm(
    inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name,
    variables_collections=["layer_norm", tf.GraphKeys.GLOBAL_VARIABLES])

    新添加了最后一行的参数

  4. 代码843行中def transformer_model(...)添加了参数adapter_fn=None

    adapter流程在两处添加:938行的attention_output和964行的layer_output:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # Run a linear projection of `hidden_size` then add a residual
    # with `layer_input`.
    with tf.variable_scope("output"):
    attention_output = tf.layers.dense(
    attention_output,
    hidden_size,
    kernel_initializer=create_initializer(initializer_range))
    attention_output = dropout(attention_output, hidden_dropout_prob)
    if adapter_fn:
    attention_output = adapter_fn(attention_output)
    attention_output = layer_norm(attention_output + layer_input)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # Down-project back to `hidden_size` then add the residual.
    with tf.variable_scope("output"):
    layer_output = tf.layers.dense(
    intermediate_output,
    hidden_size,
    kernel_initializer=create_initializer(initializer_range))
    layer_output = dropout(layer_output, hidden_dropout_prob)
    if adapter_fn:
    layer_output = adapter_fn(layer_output)
    layer_output = layer_norm(layer_output + attention_output)
    prev_output = layer_output
    all_layer_outputs.append(layer_output)
0%