16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset¶

IMDb (the Internet Movie Database) movie reviews dataset
- Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher, "Learning Word Vectors for Sentiment Analysis," Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011. Portland, Oregon, USA. Association for Computational Linguistics, pp. 142–150. http://www.aclweb.org/anthology/P11-1015.
Perform binary classification to predict whether a review’s sentiment is positive or negative

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)¶

Recurrent neural networks (RNNs) process sequences of data
- time series
- text in sentences
“Recurrent” because the neural network contains loops
- Output of a given layer becomes the input to that same layer in the next time step
Time step
- Next point in time for a time series
- Next word in a sequence of words for a text sequence
Loops in RNNs help them learn relationships among data in the sequence

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)¶

“Good” on its own has positive sentiment
“Not good” has negative sentiment
- “not” is earlier in the sequence
RNNs take into account the relationships among earlier and later data in a sequence
Here, the words that determined sentiment were adjacent
When determining text's meaning, there can be many words to consider and an arbitrary number of words between them

16.9 Recurrent Neural Networks for Sequences; Sentiment Analysis with the IMDb Dataset (cont.)¶

Long Short-Term Memory (LSTM) layer makes the neural network recurrent
Optimized to handle learning from sequences
RNNs have been used for many tasks including:[1],[2],[3]
- predictive text input—displaying possible next words as you type,
- sentiment analysis
- responding to questions with predicted best answers from a corpus
- inter-language translation
- automated video closed captioning — speech recognition
- speech synthesis

16.9.1 Loading the IMDb Movie Reviews Dataset¶

Contains 25,000 training samples and 25,000 testing samples, each labeled with its positive (1) or negative (0) sentiment

from tensorflow.keras.datasets import imdb

Over 88,000 unique words in the dataset
Can specify number of unique words to import when loading training and testing data
We'll use top 10,000 most frequently occurring words
- Due to system memory limitations and training on a CPU (intentionally)
- Most people don't have systems with Tensorflow-compatible GPUs or TPUs
More data takes longer to train, but may produce better models

16.9.1 Loading the IMDb Movie Reviews Dataset (cont.)¶

load_data replaces any words outside the top 10,000 with a placeholder value (discussed shortly)

number_of_words = 10000

NOTE: Following cell was added to work around a known issue with TensorFlow/Keras and NumPy at the time we created the slides—this issue is already fixed in a forthcoming version. See this cell's code on StackOverflow.

import numpy as np

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

(X_train, y_train), (X_test, y_test) = imdb.load_data(
    num_words=number_of_words)

# This cell completes the workaround mentioned above
# restore np.load for future normal usage
np.load = np_load_old

16.9.2 Data Exploration (cont.)¶

Check sample and target dimensions
Note that X_train and X_test appear to be one-dimensional
- They're actually NumPy arrays of objects (lists of integers)

X_train.shape

(25000,)

y_train.shape

(25000,)

X_test.shape

(25000,)

y_test.shape

(25000,)

16.9.2 Data Exploration (cont.)¶

The arrays y_train and y_test are one-dimensional arrays containing 1s and 0s, indicating whether each review is positive or negative
X_train and X_test are lists of integers, each representing one review’s contents
Keras models require numeric data — IMDb dataset is preprocessed for you

%pprint  # toggle pretty printing, so elements don't display vertically

Pretty printing has been turned OFF

X_train[123]

[1, 307, 5, 1301, 20, 1026, 2511, 87, 2775, 52, 116, 5, 31, 7, 4, 91, 1220, 102, 13, 28, 110, 11, 6, 137, 13, 115, 219, 141, 35, 221, 956, 54, 13, 16, 11, 2714, 61, 322, 423, 12, 38, 76, 59, 1803, 72, 8, 2, 23, 5, 967, 12, 38, 85, 62, 358, 99]

Movie Review Encodings¶

Because the movie reviews are numerically encoded, to view their original text, you need to know the word to which each number corresponds
Keras’s IMDb dataset provides a dictionary that maps the words to their indexes
Each word’s value is its frequency ranking among all words in the dataset
- Ranking 1 is the most frequently occurring word
- Ranking 2 is the second most frequently occurring word
- ...

Movie Review Encodings (cont.)¶

Ranking values are offset by 3 in the training/testing samples
- Most frequently occurring word has the value 4 wherever it appears in a review
0, 1 and 2 in each encoded review are reserved:
- padding (0)
  - All training/testing samples must have same dimensions
  - Some reviews may need to be padded with 0 and some shortened
- start of a sequence (1) — a token that Keras uses internally for learning purposes
- unknown word (2) — typically a word that was not loaded
  - load_data uses 2 for words with frequency rankings greater than num_words

Decoding a Movie Review¶

Must account for offset when decoding reviews
Get the word-to-index dictionary

word_to_index = imdb.get_word_index()

The word 'great' might appear in a positive movie review:

word_to_index['great']  # 84th most frequent word

84

Decoding a Movie Review (cont.)¶

Reverse word_to_index mapping, so we can look up words by frequency rating

index_to_word = {index: word for (word, index) in word_to_index.items()}

Top 50 words—most frequent word has the key 1 in the new dictionary

[index_to_word[i] for i in range(1, 51)]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be', 'one', 'all', 'at', 'by', 'an', 'they', 'who', 'so', 'from', 'like', 'her', 'or', 'just', 'about', "it's", 'out', 'has', 'if', 'some', 'there', 'what', 'good', 'more']

Decoding a Movie Review (cont.)¶

Now, we can decode a review
i - 3 accounts for the frequency ratings offsets in the encoded reviews
For i values 0–2, get returns '?'; otherwise, get returns the word with the key i - 3 in the index_to_word dictionary

' '.join([index_to_word.get(i - 3, '?') for i in X_train[123]])

'? beautiful and touching movie rich colors great settings good acting and one of the most charming movies i have seen in a while i never saw such an interesting setting when i was in china my wife liked it so much she asked me to ? on and rate it so other would enjoy too'

Can see from y_train[123] that this review is classified as positive

y_train[123]

1

16.9.3 Data Preparation¶

Number of words per review varies
Keras requires all samples to have the same dimensions
Prepare data for learning
- Restrict every review to the same number of words
- Pad some with 0s, truncate others
pad_sequences function reshapes samples and returns a 2D array

words_per_review = 200

from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=words_per_review)

X_train.shape

(25000, 200)

16.9.3 Data Preparation (cont.)¶

Must also reshape X_test for evaluating the model later

X_test = pad_sequences(X_test, maxlen=words_per_review)

X_test.shape

(25000, 200)

Splitting the Test Data into Validation and Test Data¶

Split the 25,000 test samples into 20,000 test samples and 5,000 validation samples
We'll pass validation samples to the model’s fit method via validation_data argument
Use Scikit-learn’s train_test_split function

from sklearn.model_selection import train_test_split

X_test, X_val, y_test, y_val = train_test_split(
    X_test, y_test, random_state=11, test_size=0.20)

Confirm the split by checking X_test’s and X_val’s shapes:

X_test.shape

(20000, 200)

X_val.shape

(5000, 200)

16.9.4 Creating the Neural Network¶

Begin with a Sequential model and import the other layers

from tensorflow.keras.models import Sequential

rnn = Sequential()

from tensorflow.keras.layers import Dense, LSTM, Embedding

Adding an Embedding Layer¶

Our convnet example used one-hot encoding to convert the MNIST’s integer labels into categorical data
- Result for each label was a vector in which all but one element was 0
Could do that for index values that represent words, but with 10,000 unique words:
- Need a 10,000-by-10,000 array to represent all words
- 100,000,000 elements and almost all would be 0
- For all 88,000+ unique words in the dataset, need nearly eight billion elements!

Adding an Embedding Layer (cont.)¶

To reduce dimensionality, RNNs that process text sequences typically begin with an embedding layer
Encodes each word in a more compact dense-vector representation
These capture the word’s context—how a given word relates to words around it
Help RNN learn word relationships
Predefined word embeddings, such as Word2Vec and GloVe
- Can load into neural networks to save training time
- Sometimes used to add basic word relationships to a model when smaller amounts of training data are available
- Improve model accuracy by building upon previously learned word relationships, rather than trying to learn those relationships with insufficient data

Adding an `Embedding` Layer (cont.)¶

rnn.add(Embedding(input_dim=number_of_words, output_dim=128,
                  input_length=words_per_review))

WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

input_dim=number_of_words—Number of unique words
output_dim=128—Size of each word embedding
- If you load pre-existing embeddings like Word2Vec and GloVe, you must set this to match the size of the word embeddings you load
input_length=words_per_review—Number of words in each input sample

Adding an LSTM Layer¶

rnn.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))

WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/keras/backend.py:4010: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

units—number of neurons in the layer
- More neurons means network can remember more
- Guideline: Value between length of the sequences (200 in this example) and number of classes to predict (2 in this example)
dropout—percentage of neurons to randomly disable when processing the layer’s input and output
- Like pooling layers in a convnet, dropout is a proven technique that reduces overfitting
  - Yarin, Ghahramani, and Zoubin. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” October 05, 2016. https://arxiv.org/abs/1512.05287
  - Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (June 14, 2014): 1929-1958. http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
- Keras also provides a Dropout layer that you can add to your models
recurrent_dropout—percentage of neurons to randomly disable when the layer’s output is fed back into the layer again to allow the network to learn from what it has seen previously
- Mechanics of how the LSTM layer performs its task are beyond scope.
  - Chollet says: “you don’t need to understand anything about the specific architecture of an LSTM cell; as a human, it shouldn’t be your job to understand it. Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time.”
  - Chollet, François. Deep Learning with Python. p. 204. Shelter Island, NY: Manning Publications, 2018.

Adding a Dense Output Layer¶

Reduce the LSTM layer’s output to one result indicating whether a review is positive or negative, thus the value 1 for the units argument
'sigmoid' activation function is preferred for binary classification
- Chollet, François. Deep Learning with Python. p.114. Shelter Island, NY: Manning Publications, 2018.
- Reduces arbitrary values into the range 0.0–1.0, producing a probability

rnn.add(Dense(units=1, activation='sigmoid'))

Compiling the Model and Displaying the Summary¶

Two possible outputs, so we use the binary_crossentropy loss function:

rnn.compile(optimizer='adam',
            loss='binary_crossentropy', 
            metrics=['accuracy'])

Fewer layers than our convnet, but nearly three times as many parameters (the network’s weights)
- More parameters means more training time
- The large number of parameters primarily comes from the number of words in the vocabulary (we loaded 10,000) times the number of neurons in the Embedding layer’s output (128)

rnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 128)          1280000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
=================================================================
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________

16.9.5 Training and Evaluating the Model¶

For each epoch the RNN model takes significantly longer to train than our convnet
- Due to the larger numbers of parameters (weights) our RNN model needs to learn

rnn.fit(X_train, y_train, epochs=10, batch_size=32, 
        validation_data=(X_test, y_test))

Train on 25000 samples, validate on 20000 samples
WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
25000/25000 [==============================] - 322s 13ms/sample - loss: 0.4724 - acc: 0.7806 - val_loss: 0.3799 - val_acc: 0.8342
Epoch 2/10
25000/25000 [==============================] - 298s 12ms/sample - loss: 0.3411 - acc: 0.8574 - val_loss: 0.3446 - val_acc: 0.8567
Epoch 3/10
25000/25000 [==============================] - 282s 11ms/sample - loss: 0.2696 - acc: 0.8920 - val_loss: 0.3375 - val_acc: 0.8571
Epoch 4/10
25000/25000 [==============================] - 274s 11ms/sample - loss: 0.2149 - acc: 0.9159 - val_loss: 0.4126 - val_acc: 0.8597
Epoch 5/10
25000/25000 [==============================] - 254s 10ms/sample - loss: 0.1677 - acc: 0.9374 - val_loss: 0.4108 - val_acc: 0.8607
Epoch 6/10
25000/25000 [==============================] - 249s 10ms/sample - loss: 0.1284 - acc: 0.9529 - val_loss: 0.4379 - val_acc: 0.8576
Epoch 7/10
25000/25000 [==============================] - 254s 10ms/sample - loss: 0.1033 - acc: 0.9636 - val_loss: 0.4487 - val_acc: 0.8599
Epoch 8/10
25000/25000 [==============================] - 246s 10ms/sample - loss: 0.0845 - acc: 0.9715 - val_loss: 0.4942 - val_acc: 0.8597
Epoch 9/10
25000/25000 [==============================] - 250s 10ms/sample - loss: 0.0716 - acc: 0.9756 - val_loss: 0.4808 - val_acc: 0.8472
Epoch 10/10
25000/25000 [==============================] - 270s 11ms/sample - loss: 0.0553 - acc: 0.9809 - val_loss: 0.5299 - val_acc: 0.8511

<tensorflow.python.keras.callbacks.History object at 0x140abeac8>

16.9.5 Training and Evaluating the Model (cont.)¶

Function evaluate returns the loss and accuracy values

results = rnn.evaluate(X_test, y_test)

20000/20000 [==============================] - 34s 2ms/sample - loss: 0.5299 - acc: 0.8511

results

[0.5299076704353094, 0.85115]

Accuracy seems low compared to our convnet, but this is a much more difficult problem
- Many IMDb sentiment-analysis binary-classification studies show results in the high 80s
We did reasonably well with our small recurrent neural network of only three layers
- We have not tried to tune our model

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.