深度学习-PythonTutorial_LSTM_GRU_Attention_LNLSTM_LNGRU

本篇主要涉及LSTM,GRU以及seq2seq_attention的原理以及非调用API方式的代码实现，此外，还实现了LSTM和GRU的层归一化代码，主要是为了更好的理解相关模型的算法，并且可以将之作为脚手架代码进行更多的应用。转载请表明出处。

RNN

首先已经知道了在进行图像相关处理时，使用cnn网络可以实现权值共享并且具有平移不变性等优势，CNN取得了非常好的效果，那么对于序列数据该如何处理呢，序列模型可以看做时间序列数据。

对于普通的RNN结构来说，接收的输入为当前的时刻的输入序列信息$x_{t}$，以及上一时刻隐藏层的输出$h_{t-1}$，当前节点通过输入的信息计算得到当前节点的隐藏层输出$h_{t}$网络输出$y_{t}$。
$$
\begin{align}
h_{t}&=w_{h}*h_{t-1}+w_{i}*x_{t} \
y_{t}&=softmax(w_{o}*h_{t})
\end{align}
$$
对于RNN的模型结构已然了解，使用整个序列的信息作出输出预测，但是会会显著存在一个问题就是距离输出较近的部分对于整个输出影响非常大，并且，传统的RNN容易出现梯度爆炸或者梯度消失的问题，具体原因是由于训练策略基于BTPP。

LSTM & GRU

LSTM

长短时依赖循环单元，如上所示，简单的说就是在传统的RNN基础上添加了3个gate来控制信息的流量，三个gate分别控制输入input_gate，输出output_gate，遗忘forget_gate，当前的cell的信息输入为当前的编码向量$x_t$以及上一个时刻的隐藏状态的输出$h_{t-1}$，并且三个门的计算也是根据这两个变量进行计算得到，网络学习的过程实际上就是学习这些计算参数的过程。计算公式如下
$$
i,f,o=sigmoid(wx_t+wh_{t-1}+b) \
g=tanh(wx_t+wh_{t-1}+b)
$$
其中$i,f,o,g$分别对应为输入门，遗忘门，输出门及当前的输入，并且式子中的参数$w$为对应四个输出的不同变量。使用sigmoid函数是由于该函数的取值范围为0-1，可以显然的去对应控制信息流的流量。根据这几个门就可以得到当前这个cell得到的信息，通过上一个时刻的cell信息$cell_{t-1}$并结合门控单元的遗忘门以及输入门得到
$$
cell_{t}=f \odot cell_{t-1}+i \odot g
$$
对应当前时隐藏状态的输出$h_{t}$才是我们真正要得到的东西，通过当前的cell和输出门共同控制当前cell的隐藏状态输出，实际上这也就是当前时刻经过LSTM得到的输出并以此向后传播，最终得到整个序列最后一个输出就是这个序列的输出
$$
h_{t}=o \odot tanh(cell_{t})
$$

在每一步的LSTM计算中，接收上一个时刻的cell信息和隐藏层状态输出，并通过上述计算得到当前步的cell信息和隐藏层输出，并将这两个信息传递下去直至序列结束。

激活函数的选择

门控的激活函数是sigmoid，生成候选输出是tanh，这两个都是饱和函数，在输入值达到一定情况是，输出不会发生明显的变化，如果使用relu的话，很难实现门控的效果。

sigmoid的函数控制在0-1之间，符合门控的物理意义，当输入较大或者较小的时候，输出会非常接近1或0，保证了门的开或者关。

tanh函数控制输出在-1~1之间，与多数场景下的分布是0中心相吻合，并且在输入为0时相比sigmoid有更大的梯度，是模型更快的收敛。

在计算能力有限的情况下，也可以使用0/1进行门控设计，设定一个阈值控制输出为0或者1

GRU

GRU也是一种门控单元网络，于2014年提出，相比于LSTM计算量更小，效果相当。因此在进行网络设计时优先使用GRU

对于GRU来说，每个cell接收上一时刻隐藏层状态的输出$h_{t-1}$和当前时刻的编码向量$x_{t}$，与普通的RNN单元相同，当前的GRU单元会计算得到当前的输出$y_{t}$和传递给下一个节点的隐藏层状态$h_{t}$。

不同于LSTM，GRU单元具有两个门控单元，分别为重置门$r$和更新门$z$，其计算方式和LSTM相同，同样是使用上一个时刻隐藏状态的输出$h_{t-1}$和当前时刻的编码向量$x_{t}$，其计算方法为
$$
r,z=sigmoid(w \odot [x_{t},h_{t-1}])
$$
使用重置门对上一时刻隐藏层状态的输出$h_{t-1}$进行重置
$$
h_{t-1}^{r}=h_{t-1} \odot r
$$
之后将$h_{t-1}^{r}$和当前时刻的编码向量$x_{t}$拼接之后经过tanh对输入的数据进行缩放
$$
h_{t}^{temp}=tanh(w \odot [x_{t},h_{t-1}^{r}])
$$
最后通过更新门$z$来对上一个时刻隐藏层传递来的信息和当前的信息进行控制
$$
h_{t}=z\odot h_{t-1}+(1-z) \odot h_{t}^{temp}
$$
实际上可以将更新门$z$看成LSTM中输出门和遗忘门的合并

代码实现LSTM

from __future__ import absolute_import, division, print_function

import tensorflow as tf
from tensorflow.contrib.rnn import RNNCell
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.ops.math_ops import sigmoid, tanh

class LSTMCell(RNNCell):
    """Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""

    def __init__(self, name, num_units, input_size=None, activation=tanh):
        if input_size is not None:
            print("%s: The input_size parameter is deprecated." % self)
        self._name = name
        self._num_units = num_units
        self._activation = activation

    @property
    # NOTE: The lstm cell state is contained with  hidden output and memory cell, the shape is 2*num_units
    def state_size(self):
        return 2*self._num_units

    @property
    def output_size(self):
        return self._num_units

    def __call__(self, inputs, state):
        """Gated recurrent unit (LSTM) with nunits cells.
            inputs: the sequence at current step, shape is [B,D]
            state: shape conme from pre step ,shape is [B,2*_num_units], which is contained with hidden output and memory cell for the Previous step
        """
        c_tm1, h_tm1 = tf.split(axis=1, num_or_size_splits=2, value=state)
        with tf.variable_scope(self._name):
            with vs.variable_scope("Gates"):  # Reset gate and update gate.,reuse=True
                # We start with bias of 1.0 to not reset and not update.
                value = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, h_tm1],
                                     axis=1),
                    units=4 * self._num_units, use_bias=True,
                    kernel_initializer=tf.orthogonal_initializer(),
                    bias_initializer=tf.constant_initializer(1.0))
                # i: input; f :forget; o:output ; g: current step input
                i, f, o, g = array_ops.split(value=value, num_or_size_splits=4, axis=1)
                # calculate gate and input vector
                i, f, o, g = sigmoid(i), sigmoid(f), sigmoid(o), tanh(g)
            with vs.variable_scope("Candidate"):
                 # new cell 
                new_cell = f*c_tm1+i*g
            new_h = o * self._activation(new_cell)
            new_state = tf.concat(axis=1, values=[new_cell, new_h])

        return new_h, new_state

代码实现GRU

from __future__ import absolute_import, division, print_function

import tensorflow as tf
from tensorflow.contrib.rnn import RNNCell
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.ops.math_ops import sigmoid, tanh


class GRUCell(RNNCell):
    """Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""

    def __init__(self, name, num_units, input_size=None, activation=tanh):
        if input_size is not None:
            print("%s: The input_size parameter is deprecated." % self)
        self._name = name
        self._num_units = num_units
        self._activation = activation

    @property
    def state_size(self):
        return self._num_units

    @property
    def output_size(self):
        return self._num_units

    def __call__(self, inputs, state):
        """Gated recurrent unit (GRU) with nunits cells.
            inputs: the sequence at each step ,shape is [B,D]
            state: shape conme from pre step ,shape is [B,_num_units]
        """
        with tf.variable_scope(self._name):
            with vs.variable_scope("Gates"):  # Reset gate and update gate.,reuse=True
                # We start with bias of 1.0 to not reset and not update.
                value = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, state], axis=1),
                    units=2 * self._num_units, use_bias=True,
                    kernel_initializer=tf.constant_initializer(value=1.0))
                r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)
               
                r, u = sigmoid(r), sigmoid(u)
            with vs.variable_scope("Candidate"):
                Cand = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, r * state],
                                     axis=1),
                    units=self._num_units, use_bias=True)
                c = self._activation(Cand)
            new_h = u * state + (1 - u) * c
        return new_h, new_h

如上就实现了一个LSTMcell和GRU，其中的计算完全按照公式进行。

在调用的时候，就是先进行LSTMCell或者GRU的设置，也就是__init__中的参数，来建立LSTMcell和输入数据之间的连接，具体如何设置在上面的代码中已有详细的说明。

举个栗子

# 假设输入的维度是[32,40,256]的维度，用来模拟数据，可以认为是batch=32，n_step=40,input_dim=256
# 直接生成随机数
inputs=tf.random_normal([32,40,256],mean=0.0,stddev=1.0,dtype=tf.float32) 
# 创建lstmcell，input为256，对应着inputs的维度
lstm=LSTMCell('LSTM',num_units=512)
# 创建动态RNN输出
out,state=tf.nn.dynamic_rnn(lstm,inputs)
# 创建gru
gru=GRUCell('GRU',num_units=512)
gru_out,gru_state=tf.nn.dynamic_rnn(gru,inputs)

直接在命令行启动jupyter notebook可以进行测试，启动命令如下

1	jupyter lab

上面完成了lstm和gru单元的编写以及测试

如上完成了创建的LSTM和GRU的测试，注意一点，LSTM要从初始的initial_state中提取出memory_cell和hidden因此，输入的state的尺度为[batch,2*num_units]，而对于GRU来说，直接使用提供的state进行门控的计算，因此输入的尺度为[batch,2*num_units]，由于测试时设置的两个cell的参数一样，可以看到经过cell的计算得到的序列的维度是相同的。

seq2seq

seq2seq是sequence to sequence 的缩写顾名思义就是实现序列到序列的模型，主要用于机器翻译，图像描述，语音识别等任务。

核心思想

通过深度神经网络讲一个输入的序列映射为一个作为输出的序列，这个过程中由编码和解码两个过程组成，也就是encoder-decoder，一般的实现中encoder和decoder的部分由循环神经网络完成，并且在seq2seq模型中是一个端对端的训练过程，最简单的seq2seq模型是将输入的序列进行encoder，并将encoder最后的state作为decoder部分的初始输入state。

需要说明的是，对于encoder部分，完成将输入的数据进行编码的过程，目前有使用RNN、CNN的方式，当然也有如论文《Attention is all you need》中所述的全部由attention完成的；在decoder部分一般使用的是RNN的方式。

seq2seq解码

Greedy search

对于seq2seq最核心的是解码部分，最基本的解码方法是贪心算法，选择一个度量标准之后，每次都在当前状态下选择最佳的一个结果(如进行softmax之后使用argmax返回最大值，类似于对输出的每个点进行一个分类计算)，直至整个序列结束，贪心算法的计算代价相对较低，但是这种方法一般获得是一个局部最优解，一般结果不会太好。

beam search

集束搜索是目前常用的改进算法，用于模型的测试阶段，因为在训练过程中，每一个decoder的输出是有正确答案的，也就不需要beam search去加大输出的准确率。保存当前b个较好的选择，然后在解码时，每一步根据保存的选择进行下一步的扩展和排序，接着选择前b个进行保存，循环迭代，直到结束的时候选择最佳的一个作为解码的结果。

此外还有解码的时候使用多层堆叠RNN，增加droupout机制，与编码器之间建立残差连接等。

Attention

原理公式

由于在原始的seq2seq模型中，对于输入的训练序列和输出的目标序列之间之间，每个目标序列的生成都是利用了整个训练序列的信息(encoder部分最后得到的state作为decoder的部分的初始输入，可以看做decoder接受了全部的encoder信息)，但是在实际我们的生活应用中，对于目标序列的表述一般和训练序列之间存在着关键点的联系的，比如说输入的训练序列为中文”我爱你”，对应的输出的英文目标为’i love you’，显然”我”这个词对于生成目标中的”i”的贡献是最大的，但是传统的seq2seq模型中并没有考虑这层关系，直接使用了”我爱你”这三个字来生成”i”之后在根据”I”生成后面的目标序列，显然这样是不太可靠的。

对于传统seq2seq模型，结构图如下所示

首先输入序列和目标序列都要进行embeding转化为词向量，对输入序列进行encoder，并得到序列的final_state作为dencoder的初始输入，认为final_state包含了全部的输入序列的信息，也就是将输入序列的全部信息都压缩到了最后一个state之中，在decoder部分，将encoder的state作为decoder的初始state，目标序列为该部分的输入，至此一步步传递得到了最终输出序列。对于该模型，存在一个显然的问题就是encodr的最后的state显然无法对长度较长的序列进行信息的保存，此外，在预测输出的时候将全部的输入序列信息全部传入decoder，无法进行输入序列中不同词语输出词之间的对应关系。由此引入了attention结构，实际上也是一种序列对齐的方式。

attention结构

对于attention结构，在decoder_cell接收的输入为目标序列的当前输入，上一个时刻的状态，以及上一个时刻的label，以及encoder部分得到的context向量，对于contex由attention向量和encoder的输出进行向量想乘并加和得到，attention向量可以理解为一个encoder部分输出向量的一个权重分布，说道权重，我们就应该想到了softmax，可以很显然的将所有的权重归一化到$0-1$之间，那么attention的权重分布由何计算得到呢？前面我们说了，引入attention是为了将输入序列和输出序列对齐，那么在当前时刻，我们可以使用decoder部分当前时刻的state和encoder部分的输出进行计算得到encoder输出向量和当前时刻decoder的对齐程度，实际上是一中条件概率，在得到encoder输出前提下，得到当前attention的概率。

接下来看一下各部分公式Reference

Here, the function score is used to compared the target hidden state $h_t$ with each of the source hidden states $$\overline{h}_s$$, and the result is normalized to produced attention weights (a distribution over source positions). There are various choices of the scoring function; popular scoring functions include the multiplicative and additive forms given in Eq. (4). Once computed, the attention vector $$a_t$$ is used to derive the softmax logit and loss. This is similar to the target hidden state at the top layer of a vanilla seq2seq model. The function f can also take other forms.

翻译一下：score方程用于计算当前decoder隐藏层encoder部分所有的输出向量之间的对齐程度，常用的有直接向量相城的方程，求和的方式（两种attention计算方式）；将score计算的得分使用softmax的方式归一化为权重分布，也就是方程(1)中的attention权重的计算，attention权重和所有的encoder部分的输出进行向量相乘，得到的当前的contex向量，使用contex向量和当前decoder的输出向量计算得到当前时刻的decoder的输出，该输出即可与目标label进行损失计算。

实现

和上面实现LSTM和GRU类似，按照公式进attention的设计，atttention和decoder进行连接，因此实际上可以设计一个cell实现attention机制的decoder_Cell。计算attention的权重可以通过线性计算的方式，之后在进行softmax，或者使用直接加和然后进行归一化，得到分布，两种方式只是实现的方式不同，本人没感觉有什么不一样。

对于seq2seq模型的attention设计

from __future__ import absolute_import, division, print_function

import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell, LSTMCell, RNNCell
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.ops.math_ops import sigmoid, tanh


class MLTAttentionCell(RNNCell):
    """ Attention structure for the MIT model 
        input vector and target vector are all sequences
    """

    def __init__(self, name, num_units, encoder_output, decoder_cell=None, input_size=None):
        if input_size is not None:
            print("%s: The input_size parameter is deprecated." % self)
        self._name = name
        self._num_units = num_units
        if decoder_cell is None:
            self._decoder_cell = GRUCell(num_units=num_units)
        else:
            self._decoder_cell = decoder_cell
        self._encoder_output = encoder_output  # B,L,E_D
        self._max_length = self._encoder_output.get_shape().as_list()[1]

    @property
    def state_size(self):
        return self._num_units

    @property
    def output_size(self):
        return self._num_units

    def __call__(self, inputs, state):
        """Gated recurrent unit (GRU) with nunits cells.
            inputs: the sequence at each step ,shape is [B,D]
            state: shape conme from pre step ,shape is [B,_num_units]

        """
        hidden = state  # B,D
        with tf.variable_scope(self._name):
            _att = tf.layers.dense(
                inputs=tf.concat(values=[inputs, hidden], axis=1),
                units=self._max_length)  # B,L

            attention_weight = tf.nn.softmax(_att)  # B,L
            attention_weight = tf.expand_dims(attention_weight, axis=1)  # B,1,L

            att_applied = tf.matmul(attention_weight, self._encoder_output)[
                :, 0]  # (B,1,L)*(B,L,D)=(B,1,D)->(B,D)

            output = tf.layers.dense(
                inputs=tf.concat(values=[inputs, att_applied],
                                 axis=1),
                units=self._num_units)  # (B,D)

            output = tf.nn.relu(output)

            output, hidden = self._decoder_cell.__call__(output, hidden)

        return output, hidden

对于ImageCaption的attention设计

对IimageCaption任务来说，输入的是图片，输出的是该图片对应的描述，可以先将图像使用CNN的方式进行编码得到图片的序列编码，将该编码输入到Attention中，之后的方式与普通的seq2seq模型相似，每一步输入图片的label编码向量，并使用图片序列计算得到对应的Attention权重，并使用RNN解码得到输出。

import collections

import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell, LSTMCell, RNNCell

AttentionState = collections.namedtuple("AttentionState", ("cell_state", "output"))

Attention_weight = list()
class AttCell(RNNCell):
    def __init__(
            self, name, att_input, cell, n_hid, dim_att, dim_o, dropuout, vacab_size, batch_size, tiles,
            dtype=tf.float32):
        self._scope_name = name
        self._encoder_sequence = att_input  # img-cnn=rnn之后得到 [B,HW,E_DIM]
        self._cell = cell   # decoder rnn cell
        self._n_hid = n_hid  # decoder的隐藏节点数 D_DIM
        self._dim_att = dim_att   # Attention维度，计算的中间变量，一般可以选择输入的_encoder_sequence相同的维度
        self._dim_o = dim_o   # rnn输出的维度，该维度与rnn设置的num_unit相同
        self._dropout = dropuout
        self._vacab_size = vacab_size  # 词表中包含的单词数量
        self._dtype = dtype
        self._batch_size = batch_size
        self._tiles = tiles
        self._n_regions = tf.shape(self._encoder_sequence)[1]  # HW
        self._n_channels = self._encoder_sequence.shape[2].value  # E_DIM

        # self._state_size = AttentionState(self._cell._state_size, self._dim_o)
        self._state_size = AttentionState(self._n_hid, self._dim_o)

        self._att_img = tf.layers.dense(
            inputs=self._encoder_sequence, units=self._dim_att, use_bias=False, name="att_img")  # B,HW,dim_att
        if self._tiles > 1:
            self._encoder_sequence = tf.expand_dims(
                self._encoder_sequence, axis=1)  # (B,1,HW,E_DIM)
            self._encoder_sequence = tf.tile(self._encoder_sequence, multiples=[
                1, self._tiles, 1, 1])  # (B,T,HW,E_DIM)
            self._encoder_sequence = tf.reshape(
                self._encoder_sequence, shape=[-1, self._n_regions, self._n_channels])  # (B*D,HW,E_DIM)

            self._att_img = tf.expand_dims(self._att_img, axis=1)  # batch,1,HW,dim_att
            self._att_img = tf.tile(self._att_img, multiples=[1, self._tiles, 1, 1])
            self._att_img = tf.reshape(self._att_img, shape=[-1, self._n_regions,
                                                             self._dim_att])

    @property
    def state_size(self):
        return self._state_size

    @property
    def output_size(self):
        return self._vacab_size  # beacause in the function the return is logits,so the size is vocab_size

    @property
    def output_dtype(self):
        return self._dtype

    def initial_cell_state(self, cell):
        _states_0 = []
        for hidden_name in cell._state_size._fields:
            hidden_dim = getattr(cell._state_size, hidden_name)
            h = self._CalStateBasedSeq(hidden_name, hidden_dim)
            _states_0.append(h)

        initial_state_cell = type(cell.state_size)(*_states_0)

        return initial_state_cell

    def _CalStateBasedSeq(self, name, dim):
        """Returns initial state of dimension specified by dim"""
        scope = tf.get_variable_scope()
        with tf.variable_scope(scope):
            img_mean = tf.reduce_mean(self._encoder_sequence, axis=1)  # (N,H*W,C)-->(N,1,C)
            W = tf.get_variable("W_{}_0".format(name), shape=[self._n_channels, dim])
            b = tf.get_variable("b_{}_0".format(name), shape=[1, dim])
            h = tf.tanh(tf.matmul(img_mean, W) + b)

            return h

    def initial_state(self):
        """ setting initial state  and output """
        initial_states = self._CalStateBasedSeq('init_state', self._n_hid)  # batch,
        # initial_states = self.initial_cell_state(self._cell)  # batch,
        initial_out = self._CalStateBasedSeq('init_out', self._dim_o)

        return AttentionState(initial_states, initial_out)

    def _cal_att(self, hid_cur):
        """
        calculate attention weight 
        """
        with tf.variable_scope('att_cal'):
            # computes attention over the hidden vector
            # h [ batch,num_units]
            # att_h [batch,dim_att]
            att_h = tf.layers.dense(inputs=hid_cur, units=self._dim_att, use_bias=False)
            # sums the two contributions
            # att_h --> [batch,1,dim_att]
            att_h = tf.expand_dims(att_h, axis=1)
            # att_img [batch，h*w, _dim_att]
            # att_h [batch,1,dim_att]
            # att shape is [batch,h*w,dim_att]
            att = tf.tanh(self._att_img + att_h)

            # computes scalar product with beta vector
            # works faster with a matmul than with a * and a tf.reduce_sum
            att_beta = tf.get_variable("att_beta", shape=[self._dim_att, 1],
                                       dtype=tf.float32)
            # att_flat shape is [batch*h*w,dim_att]
            att_flat = tf.reshape(att, shape=[-1, self._dim_att])
            # [batch*h*w,1]
            e = tf.matmul(att_flat, att_beta)
            # [batch,h*w]
            e = tf.reshape(e, shape=[-1, self._n_regions])
            # compute weights
            # (B,HW)
            a = tf.nn.softmax(e)
            return a

    def step(self, embeding, attention_cell_state):
        """
        Args:
            embeding: shape is (B,EM_DIM)
            attention_cell_state: state from previous step comes from AttentionState 
        """
        _initial_state, output_tm1 = attention_cell_state
        scope = tf.get_variable_scope()
        with tf.variable_scope(scope, initializer=tf.orthogonal_initializer()):
            x = tf.concat([embeding, output_tm1], axis=-1)
            new_hid, new_cell_state = self._cell.__call__(inputs=x, state=_initial_state)
            _attention = self._cal_att(new_hid)

            def _debug_att(val):
                global Attention_weight
                Attention_weight = []
                Attention_weight += [val]
                return False

            print_func = tf.py_func(_debug_att, [_attention], [tf.bool])
            with tf.control_dependencies(print_func):
                _attention = tf.identity(_attention, name='Attention_weight')

            # B,HW,1
            attention = tf.expand_dims(_attention, axis=-1)
            # [B,HW,1]*[B,HW,C]
            # CONTEX SHAPE IS [B,C]
            contex = tf.reduce_sum(attention * self._encoder_sequence, axis=1)

            o_W_c = tf.get_variable("o_W_c", dtype=tf.float32,
                                    shape=(self._n_channels, self._n_hid))
            o_W_h = tf.get_variable("o_W_h", dtype=tf.float32,
                                    shape=(self._n_hid, self._dim_o))

            new_o = tf.tanh(tf.matmul(new_hid, o_W_h) + tf.matmul(contex, o_W_c))
            new_o = tf.nn.dropout(new_o, self._dropout)

            y_W_o = tf.get_variable("y_W_o", dtype=tf.float32,
                                    shape=(self._dim_o, self._vacab_size))
            # logits for current step
            # shape is [batch_size,vocabsize] for each size
            logits = tf.matmul(new_o, y_W_o)
            new_state = AttentionState(new_cell_state, new_o)

            return logits, new_state

    def __call__(self, _inputs, _state):
        """
        The dynamic rnn function will use this call function to calculate step by step
        Args:
            inputs: the embedding of the previous word for training only
            state: (AttentionState) (h,c, o) where h is the hidden state and
                o is the vector used to make the prediction of
                the previous word
        """
        logits, state = self.step(_inputs, _state)
        return (logits, state)

层归一化的实现

对于LSTM和GRU来说，目前并没有相应实现层归一化的api可以调用，不像CNN中的BN那样方便，基于此，现编写相关的代码以实现层归一化处理。

LN_GRU代码实现

from __future__ import absolute_import, division, print_function

import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell, LSTMCell, RNNCell
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.ops.math_ops import sigmoid, tanh


class LNGRUCell(RNNCell):
    """Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""

    def __init__(self, name, num_units, input_size=None, activation=tanh):
        if input_size is not None:
            print("%s: The input_size parameter is deprecated." % self)
        self._name = name
        self._num_units = num_units
        self._activation = activation

    @property
    def state_size(self):
        return self._num_units

    @property
    def output_size(self):
        return self._num_units

    def _LN(self, tensor, scope=None, epsilon=1e-5):
        assert(len(tensor.get_shape()) == 2)
        m, v = tf.nn.moments(tensor, [1], keep_dims=True)
        if not isinstance(scope, str):
            scope = ''
        with tf.variable_scope(scope + 'layer_norm'):
            scale = tf.get_variable('scale',
                                    shape=[tensor.get_shape()[1]],
                                    initializer=tf.constant_initializer(value=1.0))
            shift = tf.get_variable('shift',
                                    shape=[tensor.get_shape()[1]],
                                    initializer=tf.constant_initializer(value=0.))
        _LnInitial = (tensor - m) / tf.sqrt(v + epsilon)

        return _LnInitial * scale + shift

    def __call__(self, inputs, state):
        """Gated recurrent unit (GRU) with nunits cells.
            inputs: the sequence at each step ,shape is [B,D]
            state: shape conme from pre step ,shape is [B,_num_units]
        """
        with tf.variable_scope(self._name):
            with vs.variable_scope("Gates"):  # Reset gate and update gate.,reuse=True
                # We start with bias of 1.0 to not reset and not update.
                value = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, state], axis=1),
                    units=2 * self._num_units, use_bias=True,
                    kernel_initializer=tf.constant_initializer(value=1.0))
                r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)
                r = self._LN(r, scope='r/')
                u = self._LN(u, scope='u/')
                r, u = sigmoid(r), sigmoid(u)
            with vs.variable_scope("Candidate"):
                Cand = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, r * state],
                                     axis=1),
                    units=self._num_units, use_bias=True)
                c_pre = self._LN(Cand,  scope='new_h/')
                c = self._activation(c_pre)
            new_h = u * state + (1 - u) * c
        return new_h, new_h

LN_LSTM代码实现

class LNLSTMCell(RNNCell):
    """Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""

    def __init__(self, name, num_units, input_size=None, activation=tanh):
        if input_size is not None:
            print("%s: The input_size parameter is deprecated." % self)
        self._name = name
        self._num_units = num_units
        self._activation = activation

    @property
    def state_size(self):
        return 2*self._num_units

    @property
    def output_size(self):
        return self._num_units

    def _LN(self, tensor, scope=None, epsilon=1e-5):
        assert(len(tensor.get_shape()) == 2)
        m, v = tf.nn.moments(tensor, [1], keep_dims=True)
        if not isinstance(scope, str):
            scope = ''
        with tf.variable_scope(scope + 'layer_norm'):
            scale = tf.get_variable('scale',
                                    shape=[tensor.get_shape()[1]],
                                    initializer=tf.constant_initializer(value=1.0))
            shift = tf.get_variable('shift',
                                    shape=[tensor.get_shape()[1]],
                                    initializer=tf.constant_initializer(value=0.))
        _LnInitial = (tensor - m) / tf.sqrt(v + epsilon)

        return _LnInitial * scale + shift

    def __call__(self, inputs, state):
        """Gated recurrent unit (GRU) with nunits cells.
            inputs: the sequence at each step ,shape is [B,D]
            state: shape conme from pre step ,shape is [B,_num_units]
        """
        c_tm1, h_tm1 = tf.split(axis=1, num_or_size_splits=2, value=state)
        with tf.variable_scope(self._name):
            with vs.variable_scope("Gates"):  # Reset gate and update gate.,reuse=True
                # We start with bias of 1.0 to not reset and not update.
                value = tf.layers.dense(
                    inputs=tf.concat(values=[inputs, h_tm1],
                                     axis=1),
                    units=4 * self._num_units, use_bias=True,
                    kernel_initializer=tf.orthogonal_initializer(),
                    bias_initializer=tf.constant_initializer(1.0))
                i, f, o, g = array_ops.split(value=value, num_or_size_splits=4, axis=1)

                i = self._LN(i, scope='input/')
                f = self._LN(f, scope='forget/')
                o = self._LN(o, scope='output/')
                g = self._LN(g, scope='instep/')

                i, f, o, g = sigmoid(i), sigmoid(f), sigmoid(o), tanh(g)

            with vs.variable_scope("Candidate"):
                c_pre = f*c_tm1+i*g
                new_cell = self._LN(c_pre, scope='new_cell/')

            new_h = o * self._activation(new_cell)
            new_state = tf.concat(axis=1, values=[new_cell, new_h])

        return new_h, new_state

总结

可以看到在实现LN的过程中，在标准LSTM/GRU的基础上增加了LN的处理，就是减去变量的均值并处以标准差，很简单即可完成。这两个cell完全可以在模型设计中取代标准的GRU或者LSTM，以提高模型的鲁棒性。