In part 1 and part 2 of this series of posts on Text Classification in Keras we got a step by step intro about:
- processing text in Keras.
- embedding vectors as a way of representing words.
- defining a sequential models from scratch.
Since we are working with a real dataset from the Toxic Comment Classification Challenge on Kaggle, we can always see how our models would score on the leaderboard if we competed with the final submissions. For instance, the fully connected model we used in part 2 gave good results but it would put us at the bottom of the leaderboard. One of the main problems of such a model with text data that it just looks at a sequence of single words without capturing the meaning of longer chunks (n-grams).
In this post we will go over some more advanced architectures including Convolutional and Recurrent Neural Networks (CNNs and RNNs) that perform better with text as they capture more info about about the structure of the language. Thanks to the Keras layers design, we will just add/remove few lines to achieve this. The following sections will discuss three models:
- CNN
- CNN+LSTM
- GRU+CNN
Data Processing
In the previous posts, the data processing part was covered in details. Here we will follow the same steps of tokenization, creating sequences and padding. You can get back to the previous parts for more details.
## load libraries ---------------------------
library(keras)
library(tidyverse)
library(here)
## read train and test data
train_data <- read_csv(here("static/data/", "toxic_comments/train.csv"))
test_data <- read_csv(here("static/data/", "toxic_comments/test.csv"))
vocab_size <- 25000
max_len <- 250
## use keras tokenizer
tokenizer <- text_tokenizer(num_words = vocab_size) %>%
fit_text_tokenizer(train_data$comment_text)
## create sequances
train_seq <- texts_to_sequences(tokenizer, train_data$comment_text)
test_seq <- texts_to_sequences(tokenizer, test_data$comment_text)
## pad sequence
x_train <- pad_sequences(train_seq, maxlen = max_len, padding = "post")
x_test <- pad_sequences(test_seq, maxlen = max_len, padding = "post")
## extract targets columns and convert to matrix
y_train <- train_data %>%
select(toxic:identity_hate) %>%
as.matrix()
Now we have the training data and labels x_train
and y_train
. Note that, it is good keep a portion of the data for validation, in order not to touch the real test before the final submission. But we will just use the whole data here our examples.
Keras Models
CNN
This model will include the following:
layer_embedding()
layer to represent each word with a vector of lengthemd_size
. This always come after the inputs.layer_dropout()
which serves as a method of regularization, as it drops some inputsa convolutional layer
layer_conv_1d
since text data is represented in 1D (unlike images where each channel comes in 2D), In this layer, you can see new hyper-parameters like:filters
: is the length of the output from this layer.kernel_size
: which is the size of the sliding window used over the sequence. So if it is 3, we can consider that the window looks at consecutive trigrams.layer_global_max_pooling_1d()
layer to get the maximum of each filter output. The intuition behind this is that with a good model, each filter can capture something from the n-grams. And the maximum value would be representative of when a neuron would be activated based on a certain input.layer_dense()
which is a fully connected layer. You can try and add multiple ones. Then add the output with 6 outputs andsigmoid
activation.
## define hyper parameters
emd_size <- 50
filters <- 250
kernel_size <- 3
hidden_dims <- 256
## define model
model_cnn <- keras_model_sequential() %>%
# embedding layer
layer_embedding(input_dim = vocab_size,
output_dim = emd_size,
input_length = max_len) %>%
layer_dropout(0.2) %>%
# add a Convolution1D
layer_conv_1d(
filters, kernel_size,
padding = "valid", activation = "relu", strides = 1) %>%
# apply max pooling
layer_global_max_pooling_1d() %>%
# add fully connected layer:
layer_dense(hidden_dims) %>%
# apply 20% layer dropout
layer_dropout(0.2) %>%
layer_activation("relu") %>%
# output layer then sigmoid
layer_dense(6) %>%
layer_activation("sigmoid")
## display model summary
summary(model_cnn)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## embedding_1 (Embedding) (None, 250, 50) 1250000
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 250, 50) 0
## ___________________________________________________________________________
## conv1d_1 (Conv1D) (None, 248, 250) 37750
## ___________________________________________________________________________
## global_max_pooling1d_1 (GlobalMa (None, 250) 0
## ___________________________________________________________________________
## dense_1 (Dense) (None, 256) 64256
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 256) 0
## ___________________________________________________________________________
## activation_1 (Activation) (None, 256) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 6) 1542
## ___________________________________________________________________________
## activation_2 (Activation) (None, 6) 0
## ===========================================================================
## Total params: 1,353,548
## Trainable params: 1,353,548
## Non-trainable params: 0
## ___________________________________________________________________________
## specify model optimizer, loss and metrics
model_cnn %>% compile(
loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy")
## fir model
history <- model_cnn %>%
fit(x_train, y_train,
epochs = 4,
batch_size = 64,
validation_split = 0.05,
verbose = 1)
## predict on test data
predicted_prob <- predict_proba(model_cnn, x_test)
If you fit this model for few epochs and predict the test labels, it can get you around 0.973
in the private score. This is an improvement compared to the pure fully connected model.
CNN +LSTM
One of the other possible architectures combines convolutional with Long Term Short Term (LSTM) layers, which is a special type of Recurrent Neural Networks. The promise of LSTM that it handles long sequences in a way that the network learns what to keep and what to forget.
We can modify the previous model by adding a layer_lstm()
after the layer_conv_1d()
and the pooling layer.
Notice that, this model is computationally heavier than the previous one. It is preferable to use a GPU while training if available. For this, you can use layer_cudnn_lstm()
which runs only on GPU with the TensorFlow
backend.
## define hyper parameters
emd_size <- 50
filters <- 250
kernel_size <- 3
hidden_dims <- 256
pool_size = 4
lstm_op_size = 64
model_cnn_lstm <- keras_model_sequential() %>%
# embedding layer
layer_embedding(input_dim = vocab_size,
output_dim = emd_size,
input_length = max_len) %>%
layer_dropout(0.2) %>%
# add a Convolution1D
layer_conv_1d(
filters, kernel_size,
padding = "valid", activation = "relu", strides = 1) %>%
## add pooling layer
layer_max_pooling_1d(pool_size = pool_size) %>%
## add lstm layer
layer_lstm(units = lstm_op_size) %>%
# add fully connected layer:
layer_dense(hidden_dims) %>%
# apply 20% layer dropout
layer_dropout(0.2) %>%
layer_activation("relu") %>%
# output layer then sigmoid
layer_dense(6) %>%
layer_activation("sigmoid")
## display model summary
summary(model_cnn_lstm)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## embedding_2 (Embedding) (None, 250, 50) 1250000
## ___________________________________________________________________________
## dropout_3 (Dropout) (None, 250, 50) 0
## ___________________________________________________________________________
## conv1d_2 (Conv1D) (None, 248, 250) 37750
## ___________________________________________________________________________
## max_pooling1d_1 (MaxPooling1D) (None, 62, 250) 0
## ___________________________________________________________________________
## lstm_1 (LSTM) (None, 64) 80640
## ___________________________________________________________________________
## dense_3 (Dense) (None, 256) 16640
## ___________________________________________________________________________
## dropout_4 (Dropout) (None, 256) 0
## ___________________________________________________________________________
## activation_3 (Activation) (None, 256) 0
## ___________________________________________________________________________
## dense_4 (Dense) (None, 6) 1542
## ___________________________________________________________________________
## activation_4 (Activation) (None, 6) 0
## ===========================================================================
## Total params: 1,386,572
## Trainable params: 1,386,572
## Non-trainable params: 0
## ___________________________________________________________________________
GRU+CNN
A simpler version of LSTM is the Gated Recurrent Unit(GRU). It is supposed to be more computationally efficient. So we can also try it with CNN. But this time we will but the GRU layer before the CNN layer.
Notice that, we added bidirectional(layer_gru(units = gru_units))
, which does the following:
- add a
layer_gru()
with the defined units as the output. - set the
return_sequences
argumemt asTRUE
to return the full output sequence. - wrap it in
bidirectional()
which helps the network learn about words from previous as well as future time steps. Other RNN layers likelayer_lstm()
can also be passed withbidirectional()
.
Similar to the layer_cudnn_lstm()
, there is a GRU layer_cudnn_gru()
layer that works with GPU.
## define hyper parameters
emd_size <- 50
filters <- 250
kernel_size <- 3
hidden_dims <- 256
gru_units = 128
## define model
model_gru_cnn <- keras_model_sequential() %>%
# embedding layer
layer_embedding(input_dim = vocab_size,
output_dim = emd_size,
input_length = max_len) %>%
layer_dropout(0.2) %>%
## add bidirectional gru
bidirectional(layer_gru(units = gru_units, return_sequences = TRUE)) %>%
## add a Convolution1D
layer_conv_1d(
filters, kernel_size,
padding = "valid", activation = "relu", strides = 1) %>%
## add pooling layer
layer_global_max_pooling_1d() %>%
# add fully connected layer:
layer_dense(hidden_dims) %>%
# apply 20% layer dropout
layer_dropout(0.2) %>%
layer_activation("relu") %>%
# output layer then sigmoid
layer_dense(6) %>%
layer_activation("sigmoid")
## display model summary
summary(model_gru_cnn)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## embedding_3 (Embedding) (None, 250, 50) 1250000
## ___________________________________________________________________________
## dropout_5 (Dropout) (None, 250, 50) 0
## ___________________________________________________________________________
## bidirectional_1 (Bidirectional) (None, 250, 256) 137472
## ___________________________________________________________________________
## conv1d_3 (Conv1D) (None, 248, 250) 192250
## ___________________________________________________________________________
## global_max_pooling1d_2 (GlobalMa (None, 250) 0
## ___________________________________________________________________________
## dense_5 (Dense) (None, 256) 64256
## ___________________________________________________________________________
## dropout_6 (Dropout) (None, 256) 0
## ___________________________________________________________________________
## activation_5 (Activation) (None, 256) 0
## ___________________________________________________________________________
## dense_6 (Dense) (None, 6) 1542
## ___________________________________________________________________________
## activation_6 (Activation) (None, 6) 0
## ===========================================================================
## Total params: 1,645,520
## Trainable params: 1,645,520
## Non-trainable params: 0
## ___________________________________________________________________________
With some tuning, this combination will boost your score compared to the model with just CNN.
In conclusion
There nature of text which carries smenatic relations and a meaning learned from long sequences requires more advanced models that simple fully connected layers. CNNs and RNNs can capture these features and provide good results. But there’s more to do with the data processing, pre-trained embedding and Keras
layers to get even better results. We will see this in future posts!