16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset

  • IMDb (the Internet Movie Database) movie reviews dataset
    • Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher, "Learning Word Vectors for Sentiment Analysis," Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011. Portland, Oregon, USA. Association for Computational Linguistics, pp. 142–150. http://www.aclweb.org/anthology/P11-1015.
  • Perform binary classification to predict whether a review’s sentiment is positive or negative

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)

  • Recurrent neural networks (RNNs) process sequences of data
    • time series
    • text in sentences
  • “Recurrent” because the neural network contains loops
    • Output of a given layer becomes the input to that same layer in the next time step
  • Time step
    • Next point in time for a time series
    • Next word in a sequence of words for a text sequence
  • Loops in RNNs help them learn relationships among data in the sequence

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)

  • “Good” on its own has positive sentiment
  • “Not good” has negative sentiment
    • “not” is earlier in the sequence
  • RNNs take into account the relationships among earlier and later data in a sequence
  • Here, the words that determined sentiment were adjacent
  • When determining text's meaning, there can be many words to consider and an arbitrary number of words between them

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)

  • Long Short-Term Memory (LSTM) layer makes the neural network recurrent
  • Optimized to handle learning from sequences
  • RNNs have been used for many tasks including:[1],[2],[3]
    • predictive text input—displaying possible next words as you type,
    • sentiment analysis
    • responding to questions with predicted best answers from a corpus
    • inter-language translation
    • automated video closed captioningspeech recognition
    • speech synthesis

16.9.1 Loading the IMDb Movie Reviews Dataset

  • Contains 25,000 training samples and 25,000 testing samples, each labeled with its positive (1) or negative (0) sentiment
In [1]:
from tensorflow.keras.datasets import imdb
  • Over 88,000 unique words in the dataset
  • Can specify number of unique words to import when loading training and testing data
  • We'll use top 10,000 most frequently occurring words
    • Due to system memory limitations and training on a CPU (intentionally)
    • Most people don't have systems with Tensorflow-compatible GPUs or TPUs
  • More data takes longer to train, but may produce better models

16.9.1 Loading the IMDb Movie Reviews Dataset (cont.)

  • load_data replaces any words outside the top 10,000 with a placeholder value (discussed shortly)
In [2]:
number_of_words = 10000

NOTE: Following cell was added to work around a known issue with TensorFlow/Keras and NumPy at the time we created the slides—this issue is already fixed in a forthcoming version. See this cell's code on StackOverflow.

In [3]:
import numpy as np

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)
In [4]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(
    num_words=number_of_words)
In [5]:
# This cell completes the workaround mentioned above
# restore np.load for future normal usage
np.load = np_load_old

16.9.2 Data Exploration (cont.)

  • Check sample and target dimensions
  • Note that X_train and X_test appear to be one-dimensional
    • They're actually NumPy arrays of objects (lists of integers)
In [6]:
X_train.shape
Out[6]:
(25000,)
In [7]:
y_train.shape
Out[7]:
(25000,)
In [8]:
X_test.shape
Out[8]:
(25000,)
In [9]:
y_test.shape
Out[9]:
(25000,)

16.9.2 Data Exploration (cont.)

  • The arrays y_train and y_test are one-dimensional arrays containing 1s and 0s, indicating whether each review is positive or negative
  • X_train and X_test are lists of integers, each representing one review’s contents
  • Keras models require numeric dataIMDb dataset is preprocessed for you
In [10]:
%pprint  # toggle pretty printing, so elements don't display vertically
Pretty printing has been turned OFF
In [11]:
X_train[123]
Out[11]:
[1, 307, 5, 1301, 20, 1026, 2511, 87, 2775, 52, 116, 5, 31, 7, 4, 91, 1220, 102, 13, 28, 110, 11, 6, 137, 13, 115, 219, 141, 35, 221, 956, 54, 13, 16, 11, 2714, 61, 322, 423, 12, 38, 76, 59, 1803, 72, 8, 2, 23, 5, 967, 12, 38, 85, 62, 358, 99]

Movie Review Encodings

  • Because the movie reviews are numerically encoded, to view their original text, you need to know the word to which each number corresponds
  • Keras’s IMDb dataset provides a dictionary that maps the words to their indexes
  • Each word’s value is its frequency ranking among all words in the dataset
    • Ranking 1 is the most frequently occurring word
    • Ranking 2 is the second most frequently occurring word
    • ...

Movie Review Encodings (cont.)

  • Ranking values are offset by 3 in the training/testing samples
    • Most frequently occurring word has the value 4 wherever it appears in a review
  • 0, 1 and 2 in each encoded review are reserved:
    • padding (0)
      • All training/testing samples must have same dimensions
      • Some reviews may need to be padded with 0 and some shortened
    • start of a sequence (1) — a token that Keras uses internally for learning purposes
    • unknown word (2) — typically a word that was not loaded
      • load_data uses 2 for words with frequency rankings greater than num_words

Decoding a Movie Review

  • Must account for offset when decoding reviews
  • Get the word-to-index dictionary
In [12]:
word_to_index = imdb.get_word_index()
  • The word 'great' might appear in a positive movie review:
In [13]:
word_to_index['great']  # 84th most frequent word
Out[13]:
84

Decoding a Movie Review (cont.)

  • Reverse word_to_index mapping, so we can look up words by frequency rating
In [14]:
index_to_word = {index: word for (word, index) in word_to_index.items()}
  • Top 50 wordsmost frequent word has the key 1 in the new dictionary
In [15]:
[index_to_word[i] for i in range(1, 51)]
Out[15]:
['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be', 'one', 'all', 'at', 'by', 'an', 'they', 'who', 'so', 'from', 'like', 'her', 'or', 'just', 'about', "it's", 'out', 'has', 'if', 'some', 'there', 'what', 'good', 'more']

Decoding a Movie Review (cont.)

  • Now, we can decode a review
  • i - 3 accounts for the frequency ratings offsets in the encoded reviews
  • For i values 02, get returns '?'; otherwise, get returns the word with the key i - 3 in the index_to_word dictionary
In [16]:
' '.join([index_to_word.get(i - 3, '?') for i in X_train[123]])
Out[16]:
'? beautiful and touching movie rich colors great settings good acting and one of the most charming movies i have seen in a while i never saw such an interesting setting when i was in china my wife liked it so much she asked me to ? on and rate it so other would enjoy too'
  • Can see from y_train[123] that this review is classified as positive
In [17]:
y_train[123]
Out[17]:
1

16.9.3 Data Preparation

  • Number of words per review varies
  • Keras requires all samples to have the same dimensions
  • Prepare data for learning
    • Restrict every review to the same number of words
    • Pad some with 0s, truncate others
  • pad_sequences function reshapes samples and returns a 2D array
In [18]:
words_per_review = 200  
In [19]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
In [20]:
X_train = pad_sequences(X_train, maxlen=words_per_review)
In [21]:
X_train.shape
Out[21]:
(25000, 200)

16.9.3 Data Preparation (cont.)

  • Must also reshape X_test for evaluating the model later
In [22]:
X_test = pad_sequences(X_test, maxlen=words_per_review) 
In [23]:
X_test.shape
Out[23]:
(25000, 200)

Splitting the Test Data into Validation and Test Data

  • Split the 25,000 test samples into 20,000 test samples and 5,000 validation samples
  • We'll pass validation samples to the model’s fit method via validation_data argument
  • Use Scikit-learn’s train_test_split function
In [24]:
from sklearn.model_selection import train_test_split
In [25]:
X_test, X_val, y_test, y_val = train_test_split(
    X_test, y_test, random_state=11, test_size=0.20) 
  • Confirm the split by checking X_test’s and X_val’s shapes:
In [26]:
X_test.shape
Out[26]:
(20000, 200)
In [27]:
X_val.shape
Out[27]:
(5000, 200)

16.9.4 Creating the Neural Network

  • Begin with a Sequential model and import the other layers
In [28]:
from tensorflow.keras.models import Sequential
In [29]:
rnn = Sequential()
In [30]:
from tensorflow.keras.layers import Dense, LSTM, Embedding

Adding an Embedding Layer

  • Our convnet example used one-hot encoding to convert the MNIST’s integer labels into categorical data
    • Result for each label was a vector in which all but one element was 0
  • Could do that for index values that represent words, but with 10,000 unique words:
    • Need a 10,000-by-10,000 array to represent all words
    • 100,000,000 elements and almost all would be 0
    • For all 88,000+ unique words in the dataset, need nearly eight billion elements!

Adding an Embedding Layer (cont.)

  • To reduce dimensionality, RNNs that process text sequences typically begin with an embedding layer
  • Encodes each word in a more compact dense-vector representation
  • These capture the word’s context—how a given word relates to words around it
  • Help RNN learn word relationships
  • Predefined word embeddings, such as Word2Vec and GloVe
    • Can load into neural networks to save training time
    • Sometimes used to add basic word relationships to a model when smaller amounts of training data are available
    • Improve model accuracy by building upon previously learned word relationships, rather than trying to learn those relationships with insufficient data

Adding an Embedding Layer (cont.)

In [31]:
rnn.add(Embedding(input_dim=number_of_words, output_dim=128,
                  input_length=words_per_review))
WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
  • input_dim=number_of_words—Number of unique words
  • output_dim=128—Size of each word embedding
  • input_length=words_per_review—Number of words in each input sample

Adding an LSTM Layer

In [32]:
rnn.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/keras/backend.py:4010: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
  • unitsnumber of neurons in the layer
    • More neurons means network can remember more
    • Guideline: Value between length of the sequences (200 in this example) and number of classes to predict (2 in this example)
  • dropoutpercentage of neurons to randomly disable when processing the layer’s input and output
    • Like pooling layers in a convnet, dropout is a proven technique that reduces overfitting
    • Keras also provides a Dropout layer that you can add to your models
  • recurrent_dropoutpercentage of neurons to randomly disable when the layer’s output is fed back into the layer again to allow the network to learn from what it has seen previously
    • Mechanics of how the LSTM layer performs its task are beyond scope.
      • Chollet says: “you don’t need to understand anything about the specific architecture of an LSTM cell; as a human, it shouldn’t be your job to understand it. Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time.”
      • Chollet, François. Deep Learning with Python. p. 204. Shelter Island, NY: Manning Publications, 2018.

Adding a Dense Output Layer

  • Reduce the LSTM layer’s output to one result indicating whether a review is positive or negative, thus the value 1 for the units argument
  • 'sigmoid' activation function is preferred for binary classification
    • Chollet, François. Deep Learning with Python. p.114. Shelter Island, NY: Manning Publications, 2018.
    • Reduces arbitrary values into the range 0.0–1.0, producing a probability
In [33]:
rnn.add(Dense(units=1, activation='sigmoid'))

Compiling the Model and Displaying the Summary

  • Two possible outputs, so we use the binary_crossentropy loss function:
In [34]:
rnn.compile(optimizer='adam',
            loss='binary_crossentropy', 
            metrics=['accuracy'])
  • Fewer layers than our convnet, but nearly three times as many parameters (the network’s weights)
    • More parameters means more training time
    • The large number of parameters primarily comes from the number of words in the vocabulary (we loaded 10,000) times the number of neurons in the Embedding layer’s output (128)
In [35]:
rnn.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 128)          1280000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
=================================================================
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________

16.9.5 Training and Evaluating the Model

  • For each epoch the RNN model takes significantly longer to train than our convnet
    • Due to the larger numbers of parameters (weights) our RNN model needs to learn
In [36]:
rnn.fit(X_train, y_train, epochs=10, batch_size=32, 
        validation_data=(X_test, y_test))
Train on 25000 samples, validate on 20000 samples
WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
25000/25000 [==============================] - 322s 13ms/sample - loss: 0.4724 - acc: 0.7806 - val_loss: 0.3799 - val_acc: 0.8342
Epoch 2/10
25000/25000 [==============================] - 298s 12ms/sample - loss: 0.3411 - acc: 0.8574 - val_loss: 0.3446 - val_acc: 0.8567
Epoch 3/10
25000/25000 [==============================] - 282s 11ms/sample - loss: 0.2696 - acc: 0.8920 - val_loss: 0.3375 - val_acc: 0.8571
Epoch 4/10
25000/25000 [==============================] - 274s 11ms/sample - loss: 0.2149 - acc: 0.9159 - val_loss: 0.4126 - val_acc: 0.8597
Epoch 5/10
25000/25000 [==============================] - 254s 10ms/sample - loss: 0.1677 - acc: 0.9374 - val_loss: 0.4108 - val_acc: 0.8607
Epoch 6/10
25000/25000 [==============================] - 249s 10ms/sample - loss: 0.1284 - acc: 0.9529 - val_loss: 0.4379 - val_acc: 0.8576
Epoch 7/10
25000/25000 [==============================] - 254s 10ms/sample - loss: 0.1033 - acc: 0.9636 - val_loss: 0.4487 - val_acc: 0.8599
Epoch 8/10
25000/25000 [==============================] - 246s 10ms/sample - loss: 0.0845 - acc: 0.9715 - val_loss: 0.4942 - val_acc: 0.8597
Epoch 9/10
25000/25000 [==============================] - 250s 10ms/sample - loss: 0.0716 - acc: 0.9756 - val_loss: 0.4808 - val_acc: 0.8472
Epoch 10/10
25000/25000 [==============================] - 270s 11ms/sample - loss: 0.0553 - acc: 0.9809 - val_loss: 0.5299 - val_acc: 0.8511
Out[36]:
<tensorflow.python.keras.callbacks.History object at 0x140abeac8>

16.9.5 Training and Evaluating the Model (cont.)

  • Function evaluate returns the loss and accuracy values
In [37]:
results = rnn.evaluate(X_test, y_test)
20000/20000 [==============================] - 34s 2ms/sample - loss: 0.5299 - acc: 0.8511
In [38]:
results
Out[38]:
[0.5299076704353094, 0.85115]
  • Accuracy seems low compared to our convnet, but this is a much more difficult problem
    • Many IMDb sentiment-analysis binary-classification studies show results in the high 80s
  • We did reasonably well with our small recurrent neural network of only three layers
    • We have not tried to tune our model

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.