1 RNN 原理

1.1 RNN 的时序反向传播原理

RNN 中采用时序反向传播算法（BPTT）对参数更新，下面将简单介绍下 BPTT 原理，并解释其与传统反向传播的区别。我们还将了解梯度消失问题，这也是推动 LSTM 和 GRU 发展的原因。

1.1.1 时序反向传播算法

我们以基本 RNN 结构来说明 BPTT 算法的原理：

上图为基本的 RNN 展开结构图，下面公式展示的是 RNN 的方程式：

同样，将损失函数定义为交叉熵损失函数：

这里 ytyty_t 为 t 时刻正确的 label，yˆty^t\widehat{y}_t 是我们的预测。通常我们会将一个完整的句子序列视为一个训练样本，因此总误差为各时间步上的误差之和。

别忘了我们的目的是要计算误差对应的参数 U、V 和 W 的梯度，然后借助 SGD 算法来更新参数。当然，我们统计的不只是误差，还包括训练样本在每时间步的梯度：

这一步比较好理解，因为总误差等于每个时刻的误差之和，上面就梯度公式对 WWW、VVV、UUU 均适用。

我们借助导数的链式求导法则来计算梯度，本文后续部分以 E3E3E_3 为例来进行介绍：

上式表明，∂E3∂V∂E3∂V\dfrac{\partial{E_3}}{\partial{V}}的值仅取决于当前时间步的值：ytyty_t、yˆty^t\widehat{y}_t、s3s3s_3。同样的，其他时间步对 EEE 的偏导数也能求出，有了这些值，计算参数 VVV 的梯度就比较简单了。

接下来再看看 ∂E3∂W∂E3∂W\dfrac{\partial{E_3}}{\partial{W}}如何求。由上面的公式我们可以推断出 E3E3E_3 对 WWW 的偏导计算公式如下：

上面的计算公式用到了复合函数求导法则，将每个时间步长对梯度的贡献相加。换言之，由于参数 WWW 时间步长应用于想要的输出，因此需从 t=3t=3t=3 开始通过所有网络路径到 t=0t=0t=0 进行反向传播梯度。

需要注意的是，这与我们在深度神经网络中应用的标准反向传播算法完全一致，主要区别在于我们对每时间步的参数 WWW 的梯度进行了求和，传统的人工神经网络中，我们不在层与层之间共享参数，也就无需求和。

1.1.2 梯度消失问题

我们知道在传统 RNN 中，是无法解决长期以来问题的。而文本信息的句意通常取决于相隔较远的单词，例如 “The man who wore a wig on his head went inside” 的语意重心在于一个人走进屋里，而非男人戴着假发。但标准的 RNN 难以捕获此类信息。那么不妨通过分析上面计算出的梯度来一探究竟：

上式中，别忘了一点：∂s3∂sk∂s3∂sk\dfrac{\partial{s_3}}{\partial{s_k}}本身为链式法则，∂s3∂s1=∂s3∂s2∗∂s2∂s1∂s3∂s1=∂s3∂s2∗∂s2∂s1\dfrac{\partial{s_3}}{\partial{s_1}}=\dfrac{\partial{s_3}}{\partial{s_2}}*\dfrac{\partial{s_2}}{\partial{s_1}}

还需要注意，在对向量函数的向量求导，结果是一个矩阵，逐元素求导，因此，上述梯度可重写为：

tanh、sigmoid 函数及其导数的值域如下图所示（参考 http://nn.readthedocs.io/en/rtd/transfer/）：

tanh 函数及其导数

sigmoid 函数及其导数

可以看到 tanh 和 sigmoid 函数在两端的导数均为 0，近乎呈直线状（导数为 0，函数图像为直线），此种情况下可称相应的神经元已经饱和。两函数的梯度为 0，使前层的其它梯度也趋近于 0。由于矩阵元素数值较小，且矩阵相乘数次（t - k 次）后，梯度值迅速以指数形式收缩（意思相近于，小数相乘，数值收缩，越来越小），最终在几个时间步长后完全消失。“较远” 的时间步长贡献的梯度变为 0，这些时间段的状态不会对你的学习有所贡献：你最终还是无法学习长期依赖。梯度消失不仅存在于循环神经网络，也出现在深度前馈神经网络中。区别在于，循环神经网络非常深（本例中，深度与句长相同），因此梯度消失问题更为常见。

不难想象，如果雅克比矩阵的值非常大，参照激活函数及网络参数可能会出现梯度爆炸，即所谓的梯度爆炸问题。相较于梯度爆炸，梯度消失问题更受关注，主要有两个原因：其一，梯度爆炸现象明显，梯度会变成 Nan（而并非数字），并出现程序崩溃；其二，在预定义阈值处将梯度截断是一种解决梯度爆炸问题简单有效的方法。而梯度消失问题更为复杂，因为其现象不明显，且解决方案尚不明确。

幸运的是，目前有一些方法可解决梯度消失问题。合理初始化矩阵 W 可缓解梯度消失现象。还可采用正则化方法。此外，更好的方法是使用 ReLU，而非 tanh 或 sigmoid 激活函数。ReLU 函数的导数是个常量，0 或 1，因此不太可能出现梯度消失现象。

更常用的方法是借助 LSTM 或 GRU 架构。1997 年，首次提出 LSTM ，目前该模型在 NLP 领域的应用极其广泛。GRU 则于 2014 年问世，是 LSTM 的简化版。这些循环神经网络旨在解决梯度消失和有效学习长期依赖问题。

1.2 RNN 变体之 LSTM

LSTM 的整体结构如下所示；

接下来我们来逐步拆解下 LSTM，看看里面每一个门的原理：

1.2.1 遗忘门

在 LSTM 中的第一步是决定从 cell 状态中丢弃什么信息。这个决定是通过一个称为 “遗忘门” 的结构来完成。这个 “门” 会读取 $h{t-1} 和和和x_t，输出一个 0 到 1 之间的数值给每个在 cell 状态，输出一个 0 到 1 之间的数值给每个在 cell 状态，输出一个0到1之间的数值给每个在cell状态C{t-1}$ 中的值，1 表示 “完全保留”，0 表示 “完全丢弃”。

1.2.2 输入门

下一步是决定让多少新的信息加入到 cell 状态中来。实现这个需要包括两个步骤：首先，一个叫做 “input gate layer” 的 sigmoid 层决定哪些信息需要更新；一个 tanhtanhtanh 层生成一个向量，也就是备选的用来更新的内容，CtCtC_t。在下一步，我们把这两部分联合起来，对 cell 的状态进行一个更新。

有了上述的结构，我们就可以按照下图来更新 cell 状态了，即把 Ct−1Ct−1C_{t-1}更新为 CtCtC_t，这部分信息就是我们要添加的新内容。

1.2.3 输出门

最后，我们需要来决定输出什么值了。这个输出主要是依赖于 cell 的状态 CtCtC_t，但是又不仅仅依赖于 CtCtC_t，而是需要经过一个过滤的处理。

首先，我们还是使用一个 sigmoid 层来（计算出）决定 CtCtC_t 中的哪部分信息会被输出；

接着，我们把 CtCtC_t 通过一个 tanh 层（把数值都归到 -1 和 1 之间），然后把 tanh 层的输出和 sigmoid 层计算出来的权重相乘，这样就得到了最后输出的结果。

在语言模型例子中，假设我们的模型刚刚接触了一个代词，接下来可能要输出一个动词，这个输出可能就和代词的信息相关了。比如说，这个动词应该采用单数形式还是复数的形式，那么我们就得把刚学到的和代词相关的信息都加入到 cell 状态中来，才能够进行正确的预测。

1.3 RNN 变体之 GRU

GRU（Gated Recurrent Unit ），这是由 Cho, et al. (2014) 提出。在 GRU 中，如下图所示，只有两个门：重置门（reset gate）和更新门（update gate）。同时在这个结构中，把细胞状态和隐藏状态进行了合并。最后模型比标准的 LSTM 结构要简单，而且这个结构后来也非常流行。

2 Tensorflow 中的 RNN 实现

2.1 tensorflow 中的 RNN

tensorflow 提供了基本的 RNN 接口：tf.nn.rnn_cell.RNNCell()，我们首先来简单的看下这个接口：

class RNNCell(base_layer.Layer):
  """Abstract object representing an RNN cell.

  Every `RNNCell` must have the properties below and implement `call` with the signature `(output, next_state) = call(input, state)`.  The optional third input argument, `scope`, is allowed for backwards compatibility purposes; but should be left off for new subclasses.
  This definition of cell differs from the definition used in the literature. In the literature, 'cell' refers to an object with a single scalar output. This definition refers to a horizontal array of such units.
  An RNN cell, in the most abstract setting, is anything that has a state and performs some operation that takes a matrix of inputs. This operation results in an output matrix with `self.output_size` columns. If `self.state_size` is an integer, this operation also results in a new state matrix with `self.state_size` columns.  If `self.state_size` is a (possibly nested tuple of) TensorShape object(s), then it should return a matching structure of Tensors having shape `[batch_size].concatenate(s)` for each `s` in `self.batch_size`.
  """
  def __call__(self, inputs, state, scope=None):
    """Run this RNN cell on inputs, starting from the given state.
    Args:
      inputs: `2-D` tensor with shape `[batch_size x input_size]`.
      state: if `self.state_size` is an integer, this should be a `2-D Tensor` with shape `[batch_size x self.state_size]`.  Otherwise, if `self.state_size` is a tuple of integers, this should be a tuple with shapes `[batch_size x s] for s in self.state_size`.
      scope: VariableScope for the created subgraph; defaults to class name.
    Returns:
      A pair containing:
      - Output: A `2-D` tensor with shape `[batch_size x self.output_size]`.
      - New state: Either a single `2-D` tensor, or a tuple of tensors matching the arity and shapes of `state`.
    """
    if scope is not None:
      with vs.variable_scope(scope,
                             custom_getter=self._rnn_get_variable) as scope:
        return super(RNNCell, self).__call__(inputs, state, scope=scope)
    else:
      with vs.variable_scope(vs.get_variable_scope(),
                             custom_getter=self._rnn_get_variable):
        return super(RNNCell, self).__call__(inputs, state)
  def _rnn_get_variable(self, getter, *args, **kwargs):
    variable = getter(*args, **kwargs)
    trainable = (variable in tf_variables.trainable_variables() or
                 (isinstance(variable, tf_variables.PartitionedVariable) and
                  list(variable)[0] in tf_variables.trainable_variables()))
    if trainable and variable not in self._trainable_weights:
      self._trainable_weights.append(variable)
    elif not trainable and variable not in self._non_trainable_weights:
      self._non_trainable_weights.append(variable)
    return variable
  @property
  def state_size(self):
    """size(s) of state(s) used by this cell.
    It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.
    """
    raise NotImplementedError("Abstract method")
  @property
  def output_size(self):
    """Integer or TensorShape: size of outputs produced by this cell."""
    raise NotImplementedError("Abstract method")
  def build(self, _):
    # This tells the parent Layer object that it's OK to call
    # self.add_variable() inside the call() method.
    pass
  def zero_state(self, batch_size, dtype):
    """Return zero-filled state tensor(s).
    Args:
      batch_size: int, float, or unit Tensor representing the batch size.
      dtype: the data type to use for the state.
    Returns:
      If `state_size` is an int or TensorShape, then the return value is a `N-D` tensor of shape `[batch_size x state_size]` filled with zeros.
      If `state_size` is a nested list or tuple, then the return value is a nested list or tuple (of the same structure) of `2-D` tensors with the shapes `[batch_size x s]` for each s in `state_size`.
    """
    with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
      state_size = self.state_size
      return _zero_state_tensors(state_size, batch_size, dtype)

从上面的代码我们得知，RNNCell() 其实是不能直接调用的，需要对其中的几个方法做 implemention，直接调用 RNNCell() 会报错：

input = tf.random_uniform(shape=[3, 4, 6], dtype=tf.float32)
cell = tf.nn.rnn_cell.RNNCell(10)
init_state = cell.zero_state(4, dtype=tf.float32)
output, final_state = tf.nn.dynamic_rnn(cell, inputs=input, initial_state=init_state, time_major=True)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print (sess.run(output))
    print (sess.run(final_state))

运行上面代码会报错，具体错误为：

1 2	raise NotImplementedError("Abstract method") NotImplementedError: Abstract method

这个 Error 正是上面 RNNCell 类中方法抛出来的 Error。

可以认为，RNNCell 是所有其他 RNN 的基类，这个从源码中能够很直接的看到：

## BasicRNNCell
class BasicRNNCell(RNNCell):
## BasicLSTMCell
class BasicLSTMCell(RNNCell):
## GRUCell
class GRUCell(RNNCell):
## 等等其他的RNN变体

因此，我们将上面代码中的 RNNCell 换成 BasicRNNCell/BasicLSTMCell/GRUCell 都可以正确运行。

BATCH_SIZE = 4
## 因为后面time_major=True，因此这里shape=[3, 4, 6]，3表示max_time，也就是文本的长度，6表示文本词向量大小
input = tf.random_uniform(shape=[3, BATCH_SIZE, 6], dtype=tf.float32)
cell = tf.nn.rnn_cell.BasicRNNCell(10)
## 或者用LSTMCell
# cell = tf.nn.rnn_cell.LSTMCell(10)
## 或者用GRUCell
# cell = tf.nn.rnn_cell.GRUCell(10)
init_state = cell.zero_state(BATCH_SIZE, dtype=tf.float32)
output, final_state = tf.nn.dynamic_rnn(cell, inputs=input, initial_state=init_state, time_major=True)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print (sess.run(output))
    print (sess.run(final_state))

上面例子中，在定义 cell 是，无论是 BasicRNNCell、BasicLSTMCell、GRUCell 都有一个参数 num_units，理解这几个 Cell 的参数含义十分重要：
BasicLSTMCell

1	def __init__(self, num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None):

几个参数的含义如下：

num_uints：官方 API 里给出的解释是 “The number of units in the LSTM cell”，也就是上面 LSTM 结构图中 hthth_t 和 CtCtC_t 的维度（hthth_t 和 CtCtC_t 具有相同的维度）。指定这个参数之后，RNN 内部会对 WWW、bbb 作自动维度适配；

forget_bias：没太弄明白这个参数是干嘛的；

state_is_tuple：决定输出是否是 tuple，默认为 True，为 False 根据官方文档会被废弃掉；

activation：指定激活函数，当 activation 为 None 时，默认使用的就是 tanh；

BasicRNNCell、GRUCell 等参数同 BasicLSTMCell，这里不做多述。

2.2 关于 output 和 state

个人认为，RNN 最难理解的地方之一就是 output 和 state，output 对应的是输出，state 对应的是状态，在 tensorflow 中，dynamic_rnn、static_rnn、bidirectional_dynamic_rnn、static_bidirectional_rnn 都是返回 (outputs, last_states) 元组，注意，last_states 是最终的状态，而 outputs 对应的则是每个时刻的输出。在使用 tensorflow 做 RNN 相关任务时，这一点不理解清楚后面就没法儿继续了。

output 和 state 在 RNN 及其变体中的意义是不一样的，所表示的值也不一样，下面来看下几个最基本的 RNN 及其变体中的 output 和 state 的含义：

BasicRNNCell
基本的 RNN 结构如下所示：

在基本的 RNN 结构中，我们可以认为输出就等于隐层状态值。我们来看下以下代码的 outputs 和 last_states 的值：

def dynamic_rnn_test():
    BATCH_SIZE = 2
    EMBEDDING_DIM = 4
    X = np.random.randn(BATCH_SIZE, 5, EMBEDDING_DIM)
    X_lengths = [5, 5]
    cell = tf.nn.rnn_cell.BasicRNNCell(num_units=10)
    outputs, last_states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float64, sequence_length=X_lengths, inputs=X)
    result = tf.contrib.learn.run_n({"outputs": outputs, "last_states": last_states}, n=1, feed_dict=None)
    print (result[0]["outputs"])
    print ("--------------------------------")
    print (result[0]["last_states"])

上面代码的输出结果为：

## outputs结果
[[[-0.26285439 -0.73199998  0.67373167 -0.42019807 -0.76447828 -0.15671307  0.19419611 -0.06485997 -0.59310542  0.41760793]
  [-0.51952513  0.61765864  0.54485767  0.35961272  0.09553398  0.68890209   -0.46678386  0.34405317  0.8904701  -0.04432281]
  [ 0.96647506 -0.50980204  0.55754585  0.93328233  0.57254379  0.6663917  -0.40768854  0.86358991 -0.58068622 -0.72018298]
  [ 0.3345003  -0.09220678  0.69535521 -0.01648253 -0.21293752 -0.12114425  0.14904557  0.59020341  0.3342177   0.25945014]
  [ 0.05128395  0.86625483  0.28549682  0.76454802  0.44757274  0.691485  0.00960586  0.23504622  0.75175537 -0.33478982]]
 [[-0.27505881  0.78801392 -0.92769186  0.38675853  0.31331528 -0.79453833  0.77526593 -0.34045865  0.52494778  0.08722081]
  [ 0.21659185 -0.05254756 -0.46941906 -0.70990551  0.82241305  0.7653751  -0.75469825 -0.65669409 -0.68308972 -0.54132448]
  [ 0.6928769  -0.80066683  0.02133818 -0.66396161 -0.48229484 -0.80333658  0.66119584  0.79458079 -0.73295564 -0.65123496]
  [-0.84663     0.26150571 -0.35573722 -0.88728337 -0.70946976 -0.59880986  0.95380342  0.63640031  0.14041671 -0.74008235]
  [-0.49611388 -0.6615701  -0.91717102 -0.7921021   0.19823286 -0.52368639  0.73433595 -0.42381531 -0.22037713 -0.6572696 ]]]
--------------------------------
## last_states结果
[[ 0.05128395  0.86625483  0.28549682  0.76454802  0.44757274  0.691485  0.00960586  0.23504622  0.75175537 -0.33478982]
 [-0.49611388 -0.6615701  -0.91717102 -0.7921021   0.19823286 -0.52368639  0.73433595 -0.42381531 -0.22037713 -0.6572696 ]]

比较下 outputs[0][4]（第一个样本最后时刻的输出）和 last_states[0]（第一个样本最后的状态）、以及 outputs[1][4]（第二个样本最后时刻的输出）和 last_states[1]（第二个样本最后时刻的输出）的值，不难发现，它们是相等的！这也印证上面的说法。

BasicLSTMCell
LSTM 与基本的 RNN 有些不用（参见 1.3 节），因为 LSTM 引入了 4 个门，多了几个状态，因此 LSTM 的输出和 BasicRNNCell 是不同的。我们通过一个例子看看 BasicLSTMCell 的基本用法：

BATCH_SIZE = 2
    EMBEDDING_DIM = 4
    X = np.random.randn(BATCH_SIZE, 5, EMBEDDING_DIM)
    X_lengths = [5, 5]
    ## 使用LSTM
    cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=10)
    outputs, last_states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float64, sequence_length=X_lengths, inputs=X)
    result = tf.contrib.learn.run_n({"outputs": outputs, "last_states": last_states}, n=1, feed_dict=None)
    print (result[0]["outputs"])
    print ("--------------------------------")
    print (result[0]["last_states"])

以上运行的结果为：

[[[-0.02559858  0.06167649  0.1645186   0.05837131  0.01109476 -0.02938359  0.11438462 -0.0655154   0.05643917 -0.17126968]
  [ 0.03442813  0.01858283  0.19324815 -0.08899226  0.03568535 -0.04120232  0.1620414  -0.12981834  0.13737953 -0.10649411]
  [ 0.021796   -0.0292876   0.04972559 -0.04365079  0.06611464 -0.0123974  0.04471634 -0.09371935  0.07161399 -0.00129043]
  [-0.06856613  0.08481594  0.08859627 -0.07172004 -0.02254162  0.04920269  0.06426967 -0.07178349  0.06880909 -0.03122769]
  [-0.05681122  0.1265717   0.08145183 -0.10992898 -0.04531312  0.08419307  0.05815578 -0.03600487  0.06829341 -0.00815202]]
 [[-0.03048013 -0.05028687  0.04530328 -0.01116215 -0.00322128 -0.0376331  0.05989264 -0.1386925  -0.02739475 -0.0416665 ]
  [-0.07246373  0.00922893 -0.02089626  0.12696067  0.05484725 -0.05276134  0.02418303 -0.0003094  -0.04619291 -0.02940275]
  [-0.06912543  0.06466857  0.22031627 -0.07334317 -0.03599558  0.01374829  0.12909539 -0.1685715   0.05465224 -0.19901284]
  [-0.0769867   0.05043309  0.08731908  0.00185187  0.00557504  0.007338  0.0641817  -0.0849491   0.0245508  -0.07668919]
  [-0.01582939  0.00979516 -0.02073626  0.09953952  0.10595823 -0.0135512  -0.12155518  0.04029387  0.00712342  0.02277357]]]
--------------------------------
LSTMStateTuple(
c=array([[-0.13211159,  0.26529373,  0.18125151, -0.19673843, -0.10883727,  0.16908338,  0.10463188, -0.08444297,  0.17317917, -0.01578971],  [-0.03322975,  0.02126845, -0.04260041,  0.19423348,  0.22194511,  -0.03170695, -0.19370151,  0.10526997,  0.0245572 ,  0.05014028]]), 
h=array([[-0.05681122,  0.1265717 ,  0.08145183, -0.10992898, -0.04531312,  0.08419307,  0.05815578, -0.03600487,  0.06829341, -0.00815202],  [-0.01582939,  0.00979516, -0.02073626,  0.09953952,  0.10595823,  -0.0135512 , -0.12155518,  0.04029387,  0.00712342,  0.02277357]]))

从上面结果中我们看到，和 BasicRNNCell 相同的是，BasicLSTMCell 返回的 outputs 是一样的，都是对应于每个时刻的输出（其实这里的输出也就是每个时刻的隐层状态值；更为一般的做法是，得到 outputs 值之后，在经过一个全连接层、softmax 层做分类任务）。不同的是，last_states 的值，BasicLSTMCell 的 last_states 返回的是一个 LSTMStateTuple，也就是一个 LSTMState 结构的元组，元组里面包含两个元素：c 和 h，c 表示的就是最后时刻 cell 的内部状态值，h 表示的就是最后时刻隐层状态值。

GRUCell
从 1.4 节中 GRU 原理可知，GRU 的输出 outputs 和 LSTM、BasicRNNCell 是一样的，last_states 和 BasicRNNCell 一样，只输出最后一个时刻的隐层状态值。同样用个例子来说明：

BATCH_SIZE = 2
    EMBEDDING_DIM = 4
    X = np.random.randn(BATCH_SIZE, 5, EMBEDDING_DIM)
    X_lengths = [5, 5]
    ## cell用GRU
    cell = tf.nn.rnn_cell.GRUCell(num_units=10)
    outputs, last_states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float64, sequence_length=X_lengths, inputs=X)
    result = tf.contrib.learn.run_n({"outputs": outputs, "last_states": last_states}, n=1, feed_dict=None)
    print (result[0]["outputs"])
    print ("--------------------------------")
    print (result[0]["last_states"])

输出结果为：

[[[ 0.2818741   0.03127117 -0.10587379  0.04028049  0.10053002  0.15848186  -0.18849411  0.27622443  0.38123248 -0.13761087]
  [ 0.20203697  0.27380701  0.20594786  0.32964536 -0.03476539  0.0324929  -0.17276558  0.23946512  0.25474486 -0.03569277]
  [ 0.09995877  0.03133022  0.03788231  0.33481101  0.05394468  0.17044128  -0.22957891  0.07784969  0.12172921 -0.11151596]
  [-0.01079724  0.34425545  0.36282874  0.51701521 -0.13545613  0.20845521  -0.16279659  0.08200397 -0.07883915 -0.0671937 ]
  [ 0.04381321  0.10883886  0.37020907  0.42074759  0.14924879  0.07081199  -0.20527748 -0.0342331  -0.01571459  0.01904762]]
 [[ 0.08714771  0.49216403  0.23638074  0.54007724 -0.12808233  0.05203507  0.04589614  0.20300933  0.00669649 -0.08931576]
  [ 0.15230049  0.31014089  0.25244098  0.44602376 -0.04282711  0.13599053  -0.01098503  0.14189271  0.04150135 -0.06910757]
  [ 0.3996701   0.10472691  0.21537184  0.39543418  0.22428281  0.07584328  -0.20120173  0.10623939  0.26915325 -0.09094824]
  [ 0.38323232  0.09812629  0.04226342  0.37831236  0.27365562  0.20740802  -0.24894298  0.1094313   0.2308372  -0.12473171]
  [ 0.27563199  0.01112365  0.06366856  0.41799209  0.45473254  0.27676832  -0.34215252  0.0085023   0.23020847 -0.23767658]]]
--------------------------------
[[ 0.04381321  0.10883886  0.37020907  0.42074759  0.14924879  0.07081199  -0.20527748 -0.0342331  -0.01571459  0.01904762]
 [ 0.27563199  0.01112365  0.06366856  0.41799209  0.45473254  0.27676832  -0.34215252  0.0085023   0.23020847 -0.23767658]]

2.2 dynamic_rnn 和 static_rnn

上面代码中，我们用到了一个新的 api-dynamic_rnn，具体 dynamic_rnn 如何用？我们不妨先来看下 dynamic_rnn 的参数：

1	def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None):

我们来结合一个例子来说明这些参数的含义：

假设 RNN 的输入为 [2, 5, 4]，其中 2 位 batch_size，5 为文本最大长度，4 为 embedding_size，可以看出，有两个样本，我们假设第二个文本的长度只有 3（很容易理解，两个句子的长度不一样，第一个句子长度为 5，第二个句子长度为 3），那么传统上我们需要对第二个句子做 zero-padding。

假设 cell 的 HIDDEN_SIZE=10，dynamic_rnn 返回两个参数：outputs, last_states，其中 outputs 的 shape 为 [batch_size, max_length, HIDDEN_SIZE]，也就是最终的输出结果；last_states 是最终的状态，是由(c, h) 组成的元组，大小均为[batch_size, HIDDEN_SIZE]。

再来看下一个很重要的参数：sequence_length，这个参数用来指定每个文本的长度，比如上面例子中，我们令 sequence_length = [5, 3]，表示第一个句子的长度为 5，第二个句子的长度为 3；当我们传入这个参数时，对于第二个样本，tensorflow 对 3 以后的 padding 就不做计算了，其 last_states 将重复第 3 步的 last_states 结果直到最后一步，而 outputs 中超过 3 步的结果将会被置零。

完整的例子如下：

BATCH_SIZE = 2
    EMBEDDING_DIM = 4
    X = np.random.randn(BATCH_SIZE, 5, EMBEDDING_DIM)
    X[1, 3:] = 0
    X_lengths = [5, 3]
    print (X)
    cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=10)
    outputs, last_states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float64, sequence_length=X_lengths, inputs=X)
    result = tf.contrib.learn.run_n({"outputs": outputs, "last_states": last_states}, n=1, feed_dict=None)
    print (result[0]["outputs"][0])
    print ("--------------------------------")
    print (result[0]["last_states"])

以上运行结果为：

## 以下是outputs输出结果，可以看出X[1]的最后两维均为0.
[[[ 0.05207716  0.02861694  0.02288652  0.0409087   0.02277852  0.08781847  -0.00247622  0.06011031  0.13020951  0.000251  ]
  [-0.07430806  0.01455017  0.09025484  0.15882481 -0.09267721  0.06793234  -0.1265452   0.01313293  0.0492417   0.07891826]
  [ 0.07847199  0.08677312 -0.00230764  0.05670163 -0.0037383   0.06010096  -0.08496443  0.05015436  0.05824103  0.07590659]
  [-0.07346585 -0.02648895  0.03345542  0.09968405 -0.18647533 -0.07887278  -0.07283668 -0.12607698 -0.05489173  0.06213264]
  [ 0.01166963  0.02953469  0.03331421  0.09969857 -0.07744646  0.03921037  -0.11335107 -0.01561479  0.03017359  0.13571448]]
 [[-0.05919386 -0.00281228  0.00711866  0.06299696 -0.04266599  0.0259747  -0.07946968 -0.00289627 -0.02592751  0.07808545]
  [ 0.03129078 -0.02892244 -0.01596121 -0.0068038   0.04108508  0.01717525  0.02310689  0.0062713   0.01016119 -0.05758133]
  [ 0.2078819  -0.07376988 -0.12247892 -0.07339147  0.06875815 -0.07224105  0.09830238  0.04630288 -0.03923044 -0.15668869]
  [ 0.          0.          0.          0.          0.          0.          0.
    0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.          0.
    0.          0.          0.        ]]]
## 以下是last_states输出结果：
LSTMStateTuple(
c=array([[ 0.02389863,  0.06531061,  0.06234521,  0.20854867, -0.18754284,  0.08002822, -0.2021616 , -0.04034186,  0.05810034,  0.24871611],
       [ 0.45062326, -0.10235791, -0.27717666, -0.1610663 ,  0.23795195,  -0.12335596,  0.16761175,  0.07196362, -0.0869157 , -0.29036735]]), 
h=array([[ 0.01166963,  0.02953469,  0.03331421,  0.09969857, -0.07744646,  0.03921037, -0.11335107, -0.01561479,  0.03017359,  0.13571448],
       [ 0.2078819 , -0.07376988, -0.12247892, -0.07339147,  0.06875815,  -0.07224105,  0.09830238,  0.04630288, -0.03923044, -0.15668869]]))

既然有 dynamic_rnn，自然也就有 static_rnn，来看下 static_rnn 的参数：

def static_rnn(cell,
               inputs,
               initial_state=None,
               dtype=None,
               sequence_length=None,
               scope=None):

2.3 tensorflow 中的双向 RNN

tensorflow 提供了双向 RNN 接口: tf.nn.bidirectional_dynamic_rnn()，我们首先来看一下这个 API 的解释：

def bidirectional_dynamic_rnn(cell_fw, cell_bw, inputs, sequence_length=None, initial_state_fw=None, initial_state_bw=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None):

函数参数说明：
cell_fw/cell_bw：定义前向和反向的 rnn cell；
inputs：输入序列；
sequence_length：序列长度；
initial_state_fw/initial_state_bw：前向、后向 rnn_cell 的初始化状态，一般初始化为全零状态；
dtype：数据类型；
time_major：一个很重要的参数，决定了输入和输出的格式，如果 time_major 为 True，那么输出就是时序优先级，输入 / 输出的 tensor 格式必须为 [max_time, batch_size, depth]，如果 time_major 为 False，那么就是 bacth 优先级的，输入 / 输出的格式必须为 [batch_size, max_time, depth]；默认情况下，time_major=False，这是因为大多数情况下，tensorflow 处理数据都是以 batch 为单位的，但是当 time_major=True 时会更高效一些，因为当 time_major=True 时，可以避免开始和结束时 tensor 类型的转化开销；其中 depth 表示输入的词向量的维度（也就是 embedding size 的大小），max_time 可以理解为句子的长度（一般以一个 batch 中最长的句子为准，不够的需要做 padding）。

函数返回值
bidirectional_dynamic_rnn 返回一个 (outputs, outputs_state) 形式的一个元祖。
其中：

outputs=(outputs_fw, outputs_bw)，是一个包含前向 cell 和后向 cell 输出 tensor 组成的元组；如果 time_major=False，则两个 tensor 的 shape 为 [batch_size, max_time, depth]，应用在文本中，max_time 可以理解为句子的长度（一般以最长的句子为准，短句需要做 padding），depth 为输入句子词向量的维度；

outputs_state=(outputs_state_fw, outputs_state_bw)，包含了前向和后向最后的隐藏状态组成的元组，outputs_state_fw 和 outputs_state_bw 的类型都是 LSTMStateTuple，是由 (c, h) 组成，分别代表 memory cell 和 hidden state 状态；

因为 bidirectional_dynamic_rnn 返回的是前向、后向的结果，最终的结果还需要对前向、后向结果做拼接，利用 tf.concat(outputs, 2) 即可。

前向和后向 cell 的定义
cell_fw 和 cell_bw 的定义完全一样的，如果两个 cell 都是 LSTM 的话就是双向 LSTM，如果两个 cell 都是 GRU 的话，就是双向 GRU：

HIDDEN_SIZE = 100
## 定义LSTM
cell_lstm_fw = tf.nn.rnn_cell.LSTMCell(HIDDEN_SIZE)
cell_lstm_bw = tf.nn.rnn_cell.LSTMCell(HIDDEN_SIZE)
## 定义GRU
cell_gru_fw = tf.contrib.rnn.GRUCell(HIDDEN_SIZE)
cell_gru_bw = tf.contrib.rnn.GRUCell(HIDDEN_SIZE)
## 或者用nn模块下的GRUCell也可以
cell_gru_fw = tf.nn.rnn_cell.GRUCell(HIDDEN_SIZE)
cell_gru_bw = tf.nn.rnn_cell.GRUCell(HIDDEN_SIZE)

这里有一点需要注意：RNN/LSTM/GRU 在声明 cell 时，只需传入一个 HIDDEN_SIZE 即可，它会自动匹配输入数据的维度。

tf.contrib 模块下的功能是开发者提供的，等功能得到进一步验证成熟之后，会被放入到官方的模块中。

在 bidirectional_dynamic_rnn 函数内部，会通过 array_ops.reverse_sequence 函数将输入序列逆序排列，使其达到反向传播的效果。

在实现的时候，只需将定义好的两个 cell 作为参数传入即可：

1	(outputs, outputs_state) = tf.nn.bidirectional_dynamic_rnn(cell_gru_fw, cell_gru_bw, inputs_embedded)

需要注意的是，inputs_embedded 为输入的 tensor，格式为 [batch_size, max_time, depth]。

最终的 outputs = tf.concat((outputs_fw, outputs_bw), 2) 或者直接是 outputs = tf.concat(outputs, 2)。

如果还需要用到最后的输出状态，则需要进一步对 (outputs_state_fw, outputs_state_bw) 做处理：

1
2
3

final_state_c = tf.concat((outputs_state_fw.c, outputs_state_bw.c), 1)
final_state_h = tf.concat((outputs_state_fw.h, outputs_state_bw.h), 1)
outputs_final_state = tf.contrib.rnn.LSTMStateTuple(c=final_state_c, h=final_state_h)

下面给出一个基本的双向 LSTM 实现步骤：

import tensorflow as tf 
vocab_size = 1000
embedding_size = 50
batch_size =100
max_time = 10
hidden_units = 10
inputs = tf.placeholder(shape=(batch_size, max_time), dtype=tf.int32, name='inputs')
embedding = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), dtype=tf.float32)
inputs_embeded = tf.nn.embedding_lookup(embedding, inputs)
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_units)
((outputs_fw, outputs_bw), (outputs_state_fw, outputs_state_bw)) = tf.nn.bidirectional_dynamic_rnn(lstm_cell, lstm_cell, inputs_embeded, sequence_length=max_time)
outputs = tf.concat((outputs_fw, outputs_bw), 2)
final_state_c = tf.concat((outputs_state_fw.c, outputs_state_bw.c), 1)
final_state_h = tf.concat((outputs_state_fw.h, outputs_state_bw.h), 1)
outputs_final_state = tf.contrib.rnn.LSTMStateTuple(c=final_state_c,
                                                    h=final_state_h)

以上就是一个基本的双向 LSTM 实现流程。

参考

[1]https://sthsf.github.io/2017/08/31/tensorflow 基础知识 - bidirectional-rnn/
[2]http://www.cnetnews.com.cn/2017/1118/3100705.shtml
[3]http://blog.csdn.net/wuzqchom/article/details/75453327
[4]http://blog.csdn.net/u012436149/article/details/71080601
[5] 深度学习与自然语言处理 (7)_斯坦福 cs224d 语言模型，RNN，LSTM 与 GRU:https://www.zybuluo.com/hanxiaoyang/note/438990