Convolutional Neural Networks (CNN) with Text

Andrew Jamieson
6 min readOct 19, 2020

CNN’s have been successfully used for image processing for years and have more recently become popular with text processing. This blog gives some background on Neural Networks, followed by some basic code for a Keras sequential model, along with a brief description of how to use it.

What is a Neural Network

A Neural Network is inspired by our understanding of the workings of the human brain. When we see an object, specialized cells in the visual cortex receive that input and respond to different visual features like edges, straight lines, and angles. This information is passed to other neurons so that the brain can combine and interpret the input, giving us the ability to recognize the face of an old friend or perceive the location and direction of a ball.

Photo by Chris Moore on Unsplash

In an artificial neural network, each neuron takes a variety of inputs and performs a calculation on it. This produces an output which if it meets a certain threshold, can be fed to other neurons. The Neural Network is a collection of these neurons which send and receive information between each other, to produce an output at the end.

Each piece of input data that the neuron uses is given a weighting based on its importance. Each input and weight are multiplied together and combined up to give us a weighted sum. If the weighted sum of the inputs is greater than a threshold value the neuron will pass on its data. We can take the negative value of the threshold and move it to the same side of the equation as the weighted sum. Representing all of the weights with a (w) and all of the inputs with an (x), and renaming the negative threshold value as bias (b), we can create a very simple mathematical equation for the neuron’s output:

w.x + b > 0

This is important because the weights and bias of the neurons are optimized when we train the Neural Network.

Simple Neural Network — with 1x weight and bias example

Training

If our model was perfect, our output prediction would be exactly what the actual object is. So, if we were trying to predict whether an image was a cat or not, our prediction for a picture of a cat would be a solid “1”. However, it’s very rare to get an exact prediction, and we might be equally happy with say 0.95. This means that our model is highly confident that our picture of Fluffy is in fact a cat. We would not be so happy with our model if Fluffy’s picture only generated a 0.5, or even worse 0.3.

Photo by Max Baskakov on Unsplash

To evaluate how our model is performing we sum up all these errors — the differences between actual and predicted. This is called a Cost Function. One of the more common cost functions is called quadratic cost or mean squared error. This function takes the differences and squares them, which ensures that all values are positive and the larger differences are penalized by having a greater impact on the total cost. In practice, the quadratic cost is not always the best cost function. We often use a function called cross-entropy cost instead, especially for classification problems.

Now that we have a way to measure our model, we can train it. The goal is to minimize the cost function by adjusting our weights and biases. The main way to achieve this is through Gradient Descent. Gradient Descent works by taking small steps or iterations in the direction of the steepest descent of the cost function until we reach the minimum. When the data is complicated and over many dimensions, it can be difficult to find the exact minimum. We can get around this problem by using a tool called Stochastic Gradient Descent. This involves taking random samples of the data, called batches, and then averaging them to get close to the true gradient. There is one final step: Back Propagation. Back Propagation takes the cost function backward layer by layer through the model, starting with the final layer. It calculates the contribution to the cost function with respect to every weight and bias and adjusts them accordingly. This is the learning of machine learning. The weights and biases iteratively become more precise and the model improves.

Convolutional Neural Network

A convolution is a mathematical combination of two functions to create a third function. A convolution layer is a group of filters that slide over an image, just like our eyes move across text as we read, moving from left to right, top to bottom. Each filter looks for a particular feature in the image. In the first layer, this is a low-level feature, like a straight line. It is also common to use a Pooling Layer between convolutional layers to reduce the complexity of the model. The most common of these is Max Pooling — which takes the maximum value of each window size, both maintaining important information and reducing size.

Photo by Chris Lawton on Unsplash

How does this work with text?

For the most part, working with text is very similar. We need to convert the text into a numerical form which is called tokenizing. Next, we turn the tokenized text into embeddings or multi-dimensional vectors. The other major differences are that we use a 1D convolutional layer, rather than a 2D, as our dimension is text over time, rather than an image with 2 dimensions — height and width.

Preprocessing

There are a few steps of preprocessing, which I won’t cover here. Some things to consider are: converting to lowercase, removing stop words, stemming or lemmatization, n-grams, padding or truncating sequences to make them equal length

Code Example

Amazingly, this is all you need to run a CNN model. Let’s look at each piece in turn.

model = Sequential()model.add(Embedding(vocab_length, 64, input_length=100))model.add(SpatialDropout1D(0.2))model.add(Conv1D(64, 3, activation='relu'))model.add(GlobalMaxPooling1D())model.add(Dense(64,activation='relu'))model.add(Dropout(0.5))model.add(Dense(1, activation='sigmoid')
model = Sequential()

Sequential model — means that you can apply one layer after another

#First layer
model.add(Embedding(vocab_length, 64, input_length=100))
model.add(SpatialDropout1D(0.2))

input_dim = vocab_length: the size of the vocabulary + 1

output_dim = 32; 32 dimension of the dense embedding or word vector space.

input_length = 100: The text sequences are 100 words long

SpatialDropout1D = 0.2: Randomly removes 20% of the neurons to avoid overfitting. Drops entire 1D feature maps instead of individual elements.

#Second layer
model.add(Conv1D(64, 3, activation='relu'))
model.add(GlobalMaxPooling1D())

filters = 64: The dimensionality of the output space (i.e. the number of output filters in the convolution).

kernel_size=3: specifies the length of the 1D convolution window. With a choice of 3, we are using triplets of word embeddings.

activation: This is the output from the layer. We are using a ReLU or Rectified Linear Unit activation function. It is a function that returns itself if positive or zeroes if negative. It is computationally efficient.

#Third layer
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.5))

units=64: Positive integer, the dimensionality of the output space.

activation: This is the output from the layer. We are again using a ReLU activation function.

Dropout=0.5: Randomly removes 50% of the neurons to avoid overfitting. It’s common to increase the dropout in later layers.

#Output layer
model.add(Dense(1, activation='sigmoid')

The output layer reduces dimensions to 1, and sigmoid activation ensures that output is between 0 and 1.

Running the Model

model.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])model.fit(X_train, y_train, batch_size=32, epochs=4, verbose=1, validation_data=(X_test, y_test))

Optimizer = ‘nadam’: Nadam is Adam with Nesterov momentum. This improves upon stochastic gradient descent.

Loss = ‘binary_crossentropy’ : loss functional used in binary classification

Metrics = ‘acc’: Setting metrics=[‘acc’] gives feedback on model accuracy in addition to the default feedback on loss

Batch size =32: Rather than processing all the data at once, we split it into smaller chunks during each round of training. This helps make the data more manageable and also avoids the problem of getting stuck at a local minimum.

Epochs =4: the number of rounds that we would like to train for. CNN NLP models tend to overfit to the data much faster than they do with image models

Happy Modelling

--

--

Andrew Jamieson

Andrew has an analytics background with over 20 years of experience in various industries, working with world-leading brands.