Deep Learning for Time Series and NLP
Project
Problem
In this project we develop a deep learning algorithm, based on a combination of convolutional neural network(CNN) and long-short term memory(LSTM) network, which is applicable for time series data classification and for natural language processing (NLP) applications. The algorithm development will be demonstrated using a 12-way text classification problem. A limited sample of training sentences, corresponding labels, and test sentences are given in the repository.
sentence: letwmktwviuhskenamuhtwlpuhraletwlrvienskuhqvmvamuhxepmuhlekrpmamuhtwamuluhezpmlexeuhtwulultwvienqjuhtwmvypkrxzuhsktwmkpmiwuhskenamuhtwlpuhenuhxepmuhtwmkpmiwuhtwamuluhxepmuhsaendfuhtwamulbhbhsaendfuhqvijsaenvileenmcuhqvtwiwleenameebhbhsaendfuhtwvipmuhtwvipmlruhsaiwtvenmvleenmkvimvuhqvenamuhvienezuhenuhxepmuhskiwlepmdfuhtwamuluhqgqvtwskkrulmvuleniwuhvitwiwiwenxeuhvimvuhletwulvimvdfuhsaiwulqvpmezuhqvmvuhulmvuhvitwamdf
label: 10
Sentences in the file have been encoded in a deterministic one-to-one mapping, to prove that the algorithm developed is data agnostic such that it can be applied to any other NLP or time series data classification problem.
Repository
The repository contains the following project files:
Deep-Learning-for-Time-Series-and-NLP # main folder
├── challenge.py # code in Python script
├── challenge.ipynb # code in iPython notebook
├── xtrain.txt # limited sample training/validation set
├── ytrain.txt # limited sample labels for training/validation set
├── xtest.txt # limited sample test set
└── cnn_lstm-180-0.87.hdf5 # sample saved tensorflow model
Solution
A complete step-by-step approach to solving the problem is described next. The following are included:
- Reading in of files
- Data wrangling
- Partitioning train/validate/test sets
- Building tensorflow machine learning model
- Checking validation set accuracy
- Saving best tensorflow model
- Writing test set predictions to file
We will be using the following tools:
- Python
- Keras
- Jupyter Notebook
Import the required libraries
import pdb
import numpy as np
import csv
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from keras.models import load_model, Model
from keras.layers import Dense, Activation, Dropout, Input, LSTM, Lambda, Bidirectional, Conv1D, MaxPooling1D
import keras.callbacks
Read in the files
def readIn(filepath):
'''
Arguments:
filepath -- path of file to be read
Returns:
x -- list of strings read from the input file
'''
# Read in the files
with open(filepath, 'r') as x:
x = x.read()
x = x.splitlines() # Split files into sentences
return x
x_traindev = readIn('./xtrain.txt')
y_traindev = readIn('./ytrain.txt')
Throughout this python example, we shall use this notation:
- m is number of examples
- n_v is vocabulary size (i.e., vocab_size)
- n_c is number of classes (i.e., number of novels to predict)
- n_a is dimensionality of the hidden state in LSTM cell
- Tx is length of each string. All strings will be appended or truncated to the same length
Set the desired string length and number of classes
# Compute the maximum string length
Txmax = len(max(x_traindev, key=len))
print(Txmax)
# We set desired string length to 448 which is 5 char shorter than Txmax.
# This simplifies the dimensions during convolution later, without significant performance loss
Tx = 448
# Set the number of classes
n_c = 12
Each sentence in the data set is a sequence of characters. Hence, we will build a character level sentence classification model.
First, build a character dictionary. Then, convert each training sample (i.e., sentence) from character sequence to integer sequence using the dictionary.
def char2ix(x):
'''
Arguments:
x -- list of strings
Returns:
char_to_ix -- dictionary mapping character to integer
n_v -- vocabulary size
'''
# Create a python dictionary to map each character to an index 0-26.
string = ''.join(x)
chars = list(string) # Create a list of unique characters (i.e., a to z and \n newline)
chars_set = set(chars)
n_v = len(chars_set) # Vocabulary size
char_to_ix = {ch: ii for ii, ch in enumerate(sorted(chars_set))}
print(char_to_ix)
return char_to_ix, n_v
def listOfStr_to_array(x, Tx, func):
'''
Arguments:
x -- list of strings.
Tx -- maximum length of string
func -- function to map characters to integers
Returns:
x_out -- numpy array. Shape = (m, Tx).
'''
x_out = np.ones((len(x), Tx), dtype=np.int64)*-1 # Create empty array to store one-hot encoding result
for ii, string in enumerate(x): # Iterate over each string in the file
x_array = np.array(list(map(func, string))) # Map characters to integers
if len(x_array) <= Tx:
x_out[ii, : len(x_array)] = x_array
else:
x_out[ii, :] = x_array[: Tx]
return x_out
char_to_ix, n_v = char2ix(x_traindev)
x_traindev = listOfStr_to_array(x_traindev, Tx, lambda x: char_to_ix[x])
# Check type and shape x_traindev
print('x_traindev is of type {} and of shape {}'.format(type(x_traindev), x_traindev.shape))
Cast the y labels into an integer sequence
y_traindev = np.array(list(map(int, y_traindev)))
Shuffle stratify-split the data into train and dev sets
x_train, x_dev, y_train, y_dev = train_test_split(x_traindev,
y_traindev,
test_size=0.20,
random_state=1,
shuffle=True,
stratify=y_traindev)
# Check type and shape x_train, x_dev
print('x_train is of type {} and of shape {}'.format(type(x_train), x_train.shape))
print('x_dev is of type {} and of shape {}'.format(type(x_dev), x_dev.shape))
Since the number of classes are imbalanced, we compute and apply class weights to the cost function
#Compute class weights using y_train
unique, counts = np.unique(y_train, return_counts=True)
print('Class distribution:')
print(dict(zip(unique, counts)))
class_weight_vec = compute_class_weight('balanced',np.array(range(12)),y_train)
class_weight_dict = {ii: w for ii, w in enumerate(class_weight_vec)}
Convert training and dev labels to one hot matrices
# Convert a list of characters into one-hot encoding numpy array
def convert_to_one_hot(x, n):
'''
Arguments:
x -- array of integers.
n -- number of unique values. scalar value.
Returns:
y -- one hot encoded matrix of integers. shape = (len(x), n).
'''
y = np.eye(n)[x.reshape(-1)]
return y
y_train_onehot = convert_to_one_hot(y_train, n_c)
y_dev_onehot = convert_to_one_hot(y_dev, n_c)
print('y_train_onehot is of type {} and of shape {}'.format(type(y_train_onehot), y_train_onehot.shape))
print('y_dev_onehot is of type {} and of shape {}'.format(type(y_dev_onehot), y_dev_onehot.shape))
Build the machine learning model as follows:
CNN (4 layer) –> LSTM (2 layer) –> Dense (1 layer) –> Softmax –> Cross entropy loss
- CNN is suitable for character level sequence (or time series) classification. It helps to extract relevant patterns from the sequences along the feature and time dimensions.
- LSTM helps to recognise sequential information
- Fully connected Dense layer with Softmax reduces the output to 12 desired class probability distribution
- Cross entropy loss is used as the cost for this multiclass classification problem.
def one_hot(x, numOfVocab):
'''
Arguments:
x -- input array of shape (m,Tx) where m = len(x)
numOfVocab -- vocabulary size
Returns:
return -- output array of shape (m, Tx, numOfVocab)
'''
import tensorflow as tf
return tf.to_float(tf.one_hot(x, numOfVocab, on_value=1, off_value=0, axis=-1))
def one_hot_outshape(x):
'''
Arguments:
x -- input shape
Returns:
return -- one hot encoded shape of x, i.e., (m, Tx, numOfVocab)
'''
return x[0], x[1], 26
print('Building the model')
kernel_length = [ 5, 3, 3, 3]
filter_num = [64, 64, 128, 128]
pool_length = [ 2, 2, 2, 2]
x_input = Input(shape=(Tx,), dtype='int64')
# The one-hot encoded vectors for each input string is built on the fly using the Lambda layer
# Desired shape of input data X = (m, Tx, n_v)
x_one_hot = Lambda(one_hot, output_shape=one_hot_outshape, arguments={'numOfVocab': 26})(x_input)
x_conv = x_one_hot
for ii in range(len(filter_num)):
x_conv = Conv1D(filters=filter_num[ii],
kernel_size=kernel_length[ii],
padding='same')(x_conv)
x_conv = Activation('relu')(x_conv)
x_conv = MaxPooling1D(pool_size=pool_length[ii])(x_conv)
x_lstm1 = Bidirectional(LSTM(128, return_sequences=True))(x_conv)
x_lstm1_dropout = Dropout(0.3)(x_lstm1)
x_lstm2 = LSTM(128, return_sequences=False)(x_lstm1_dropout)
x_lstm2_dropout = Dropout(0.3)(x_lstm2)
x_out = Dense(n_c, activation='softmax')(x_lstm2_dropout)
model = Model(inputs=x_input, outputs=x_out, name='textClassifier')
# Check model summary
model.summary()
Building the model
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 448) 0
_________________________________________________________________
lambda_1 (Lambda) (None, 448, 26) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 448, 64) 8384
_________________________________________________________________
activation_1 (Activation) (None, 448, 64) 0
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 224, 64) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 224, 64) 12352
_________________________________________________________________
activation_2 (Activation) (None, 224, 64) 0
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 112, 64) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 112, 128) 24704
_________________________________________________________________
activation_3 (Activation) (None, 112, 128) 0
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 56, 128) 0
_________________________________________________________________
conv1d_4 (Conv1D) (None, 56, 128) 49280
_________________________________________________________________
activation_4 (Activation) (None, 56, 128) 0
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 28, 128) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 28, 256) 263168
_________________________________________________________________
dropout_1 (Dropout) (None, 28, 256) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 128) 197120
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 12) 1548
=================================================================
Total params: 556,556
Trainable params: 556,556
Non-trainable params: 0
_________________________________________________________________
Define an optimizer and compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Save the model at regular epoch intervals
checkpoint = keras.callbacks.ModelCheckpoint('./cnn_lstm'+'-{epoch:02d}-{val_acc:.2f}.hdf5',
monitor='val_loss',
verbose=1,
save_best_only=False,
save_weights_only=False,
mode='auto',
period=10)
Fit the model
print('Fitting the model now')
model.fit(x_train, y_train_onehot,
validation_data=(x_dev, y_dev_onehot),
batch_size=256,
epochs=130,
shuffle=True,
callbacks=[checkpoint],
class_weight=class_weight_dict)
# Save model to continue later
model.save('./cnn_lstm.hdf5')
We have already trained the above model over 180 epochs using our full training set, which yielded a training set accuracy of ~99% and validation set accuracy of ~87%.
Hence, we now load the trained model containing the trained parameters.
#Load model to continue testing
#Our saved file is named "cnn_lstm-180-0.87.hdf5"
model = load_model('./cnn_lstm-180-0.87.hdf5')
First, check accuracy on training and validation set.
_, acc_train = model.evaluate(x_train, y_train_onehot)
_, acc_dev = model.evaluate(x_dev, y_dev_onehot)
print('Train set accuracy: {}'.format(acc_train))
print('Validation set accuracy: {}'.format(acc_dev))
Read in the test set
#Read in the test set
x_test = readIn('./xtest.txt')
x_test = listOfStr_to_array(x_test, Tx, lambda x: char_to_ix[x])
# Check type and shape x_test
print('x_test is of type {} and of shape {}'.format(type(x_test), x_test.shape))
Make prediction on test set and write the predcited class labels of test sample into file
def write2file(filename,resultList):
with open(filename,'w',newline='') as outputFile:
writer = csv.writer(outputFile)
for result in resultList:
writer.writerow([result])
#Make predictions
pred = model.predict(x_test)
#Write predictions to file
pred_labels = np.argmax(pred,axis=1)
write2file('ytest.txt',pred_labels)
The end
Leave a comment