Currently the accuracy is slightly above 0.6.
The model used for this is based on Convolutional layers:
model = Sequential()
model.add(Lambda(K.one_hot,
arguments={'num_classes': model_max},output_shape=output_shape, input_shape=input_shape, dtype='int32'))
model.add(Conv1D(256, 6, activation='relu'))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(labels_max, activation='softmax'))
model.add(Lambda(K.one_hot,
arguments={'num_classes': model_max},output_shape=output_shape, input_shape=input_shape, dtype='int32'))
model.add(Conv1D(256, 6, activation='relu'))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(labels_max, activation='softmax'))
Looking at our data, it seemed to me that the 'from' address of emails could play a large role. However, in our current preprocessing, each word used less than 20 times is ignored. I decided to try to assure that the 'from' is always part of the small vocab:
vocab_short = [key for key,_ in z.most_common()]
...
for i, x in enumerate(docs):
vocab_short.append(x[0])
vocab_short=set(vocab_short)
...
model_max = len(vocab_short) # Maximal number of words in
reduced wordlist
However, this change didn't increase the accuracy much, if any.
The second attempt was adding an Embedding layer instead of the Lambda layer as input. This layer should add learning capabilities by making a multidimensional vector for each word. The distance between 2 word vectors should become an indication of their relative 'meaning'.
However, not much improvement using just that:
model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))
Reading further, in the textbook, it seems that LSTM might give better results. This learning model has a mchanism that should consider historical context of a word, so in a sence the previous words used. Now this appeared to be helpful, en increased the accuracy to close to 0.73!
model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))
model.add(LSTM(256))
model.add(Dense(labels_max, activation='softmax'))
Geen opmerkingen:
Een reactie posten