vrijdag 15 maart 2019

Some experiments with the Keras model

Today while learning about machine learning with Keras, using the splendid book 'Deep Learning with Python', from François Chollet, I tried some model improvements: after all, that is what deep-learning is all about.

Currently the accuracy is slightly above 0.6.

The model used for this is based on Convolutional layers:

model = Sequential()
model.add(Lambda(K.one_hot,
               arguments={'num_classes': model_max},output_shape=output_shape, input_shape=input_shape, dtype='int32'))
model.add(Conv1D(256, 6, activation='relu'))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(labels_max, activation='softmax'))

Looking at our data, it seemed to me that the 'from' address of emails could play a large role. However, in our current preprocessing, each word used less than 20 times is ignored. I decided to try to assure that the 'from' is always part of the small vocab:

    vocab_short = [key for key,_ in z.most_common()]
    ...
    for i, x in enumerate(docs):
        vocab_short.append(x[0])

    vocab_short=set(vocab_short)
    ...
    model_max   = len(vocab_short) # Maximal number of words in
                                     reduced wordlist


However, this change didn't increase the accuracy much, if any.
The second attempt was adding an Embedding layer instead of the Lambda layer as input. This layer should add learning capabilities by making a multidimensional vector for each word. The distance between 2 word vectors should become an indication of their relative 'meaning'.
However, not much improvement using just that:

model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))


Reading further, in the textbook, it seems that LSTM might give better results. This learning model has a mchanism that should consider historical context of a word, so in a sence the previous words used. Now this appeared to be helpful, en increased the accuracy to close to 0.73!

model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))
model.add(LSTM(256))
model.add(Dense(labels_max, activation='softmax'))



Geen opmerkingen:

Een reactie posten