bert-adapter初步分析

从bert与bert-adapter异同分析入手

tokenization.py中二者代码完全一致，没有任何区别（逐字符相同）

run_classifier.py二者有细微区别

bert中的class XnliProcessor(DataProcessor):在adapter-bert中被删除
bert的第593，596行中的output_weights，output_bias

其中的tf.get_variable函数添加了参数collections=["head", tf.GraphKeys.GLOBAL_VARIABLES]

在这里定义的这些参数是独立于预训练模型的，是classifier下游任务中添加的最顶端的分类层

因此被添加的全局collections名字为head
bert的787行"xnli": XnliProcessor,在adapter-bert中被删除

在这个部分，XnliProcessor的删除是完全没有任何影响的，因为这个类：Found 0 references in 0 files

optimization.py中二者有较小区别，主要为设置需要更新的参数仅仅为adapter层

在AdamWeightDecayOptimizer中添加了adapter_weight_decay_rate=0.01的参数

adapter-bert中的62，96，107行
bert的70行定义了整个模型需要被训练的参数原先为 tvars = tf.trainable_variables()

在adapter-bert中改成了：
1
2
3
tvars = []
for collection in ["adapters", "layer_norm", "head"]:
tvars += tf.get_collection(collection)

adapter-bert中112行添加了

1
2
3

self._adapter_variable_names = {
    self._get_variable_name(v.name) for v in tf.get_collection("adapters")
}

adapter-bert中155行，原先的bert为

1 2	if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param

adapter-bert中为

if self._do_use_weight_decay(param_name):
  if param_name in self._adapter_variable_names:
    update += self.adapter_weight_decay_rate * param
  else:
    update += self.weight_decay_rate * param

最后一处不同就是定义def _do_use_weight_decay(self, param_name):这个函数了，用来判断参数是否需要更新：

bert中为

def _do_use_weight_decay(self, param_name):
  """Whether to use L2 weight decay for `param_name`."""
  if not self.weight_decay_rate:
    return False
  if self.exclude_from_weight_decay:
    for r in self.exclude_from_weight_decay:
      if re.search(r, param_name) is not None:
        return False
  return True

而在adapter-bert中为

def _do_use_weight_decay(self, param_name):
  """Whether to use L2 weight decay for `param_name`."""
  if param_name in self._adapter_variable_names:
    if not self.adapter_weight_decay_rate:
      return False
  else:
    if not self.weight_decay_rate:
      return False
   
  if self.exclude_from_weight_decay:
    for r in self.exclude_from_weight_decay:
      if re.search(r, param_name) is not None:
        return False
   
  return True

最重要的modeling.py，下面行数以adapter-bert为基准

139行的init函数添加参数adapter_fn="feedforward_adapter"

参数解释：adapter_fn: (optional) string identifying trainable adapter that takes as input a Tensor and returns a Tensor of the same shape.

主要就是经过adapter模块之后向量的形状不发生改变

208行self.all_encoder_layers = transformer_model(...)中添加参数：adapter_fn=get_adapter(adapter_fn)

代码比对软件中最右侧状态栏蓝色部分的意思是新添加上去的（而黑色栏是这部分发生了修改）

两个modeling.py一对比，发现有很多蓝色的部分，而黑色的修改都很微弱

adapter-bert中新增的函数，也是adapter方法的核心（从321行开始）：

def feedforward_adapter(input_tensor, hidden_size=64, init_scale=1e-3):
  """A feedforward adapter layer with a bottleneck.
  Implements a bottleneck layer with a user-specified nonlinearity and an
  identity residual connection. All variables created are added to the
  "adapters" collection.
  Args:
    input_tensor: input Tensor of shape [batch size, hidden dimension]
    hidden_size: dimension of the bottleneck layer.
    init_scale: Scale of the initialization distribution used for weights.
  Returns:
    Tensor of the same shape as x.
  """
  with tf.variable_scope("adapters"):
    in_size = input_tensor.get_shape().as_list()[1]
    w1 = tf.get_variable(
        "weights1", [in_size, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=init_scale),
        collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    b1 = tf.get_variable(
        "biases1", [1, hidden_size],
        initializer=tf.zeros_initializer(),
        collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    net = tf.tensordot(input_tensor, w1, [[1], [0]]) + b1

    net = gelu(net)

    w2 = tf.get_variable(
        "weights2", [hidden_size, in_size],
        initializer=tf.truncated_normal_initializer(stddev=init_scale),
        collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    b2 = tf.get_variable(
        "biases2", [1, in_size],
        initializer=tf.zeros_initializer(),
        collections=["adapters", tf.GraphKeys.GLOBAL_VARIABLES])
    net = tf.tensordot(net, w2, [[1], [0]]) + b2

  return net + input_tensor

def get_adapter(function_string):
  """Maps a string to a Python function.
  Args:
    function_string: String name of the adapter function.
  Returns:
    A Python function corresponding to the adatper function.
    `function_string` is None or empty, will return None.
    If `function_string` is not a string, it will return `function_string`.
  Raises:
    ValueError: The `function_string` does not correspond to a known
      adapter.
  """

  # We assume that anything that"s not a string is already an adapter
  # function, so we just return it.
  if not isinstance(function_string, six.string_types):
    return function_string

  if not function_string:
    return None

  fn = function_string.lower()
  if fn == "feedforward_adapter":
    return feedforward_adapter
  else:
    raise ValueError("Unsupported adapters: %s" % fn)

注意把adapter的参数加进了全局的collection，adapter的代码实现非常简单

439行layer_norm的修改：

def layer_norm(input_tensor, name=None):
  """Run layer normalization on the last dimension of the tensor."""
  return tf.contrib.layers.layer_norm(
      inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name,
      variables_collections=["layer_norm", tf.GraphKeys.GLOBAL_VARIABLES])

新添加了最后一行的参数

代码843行中def transformer_model(...)添加了参数adapter_fn=None

adapter流程在两处添加：938行的attention_output和964行的layer_output：

# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
  attention_output = tf.layers.dense(
      attention_output,
      hidden_size,
      kernel_initializer=create_initializer(initializer_range))
  attention_output = dropout(attention_output, hidden_dropout_prob)
  if adapter_fn:
    attention_output = adapter_fn(attention_output)
  attention_output = layer_norm(attention_output + layer_input)

# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
  layer_output = tf.layers.dense(
      intermediate_output,
      hidden_size,
      kernel_initializer=create_initializer(initializer_range))
  layer_output = dropout(layer_output, hidden_dropout_prob)
  if adapter_fn:
    layer_output = adapter_fn(layer_output)
  layer_output = layer_norm(layer_output + attention_output)
  prev_output = layer_output
  all_layer_outputs.append(layer_output)