dinsdag 26 maart 2019

Python for dummies

If a Python script runs in the test environment, but not on the production server, chances are that the environment differ: During testing an extra 'pip install' is easily forgotten. Often the script starts to complain, and points you to a missing package. However, this seems to be not always the case.

I found some code to check the which packages are installed, and adapted it slightly:

import os
import pkg_resources
import platform

print("=== python version:")
print(platform.sys.version_info)

print("=== python packages:")

dists = [d for d in pkg_resources.working_set]
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
     for i in dists])

for i, dist in enumerate(installed_packages_list):
    print(str(i)+' - '+dist  )

It appeared that many more packages where installed in the (MS-SQL) Python environment on the server, but almost none of the packages from the test environment where either of the same version (often higher on the test environment) or even installed.
The first AI model had run however on both. I only got problems with newer AI models.

What i didn't know that pip has a possibility to 'sync' environments:

  1. Run pip freeze > requirements.txt on the remote machine
  2. Copy that requirements.txt file to your local machine
  3.  Important: if no GPU is present on the target machine, change tensorflow-gpu to tensorflow
  4. In your local virtual environment, run pip install -r requirements.txt
The versions below are important to run to avoid errors because of the absence of a GPU on the server (so I guess this can also be changed in the requirements.txt file):

python -m pip install --ignore-installed tensorflow==1.4.0

python -m pip install --upgrade keras==2.1.3

dinsdag 19 maart 2019

IFilter troubles

For AI one needs indexed documents. Therefore, each document should be converted to plain text. My company has a nice component to convert .msg files to the individual parts. For other file types I have found a project that uses IFilters to parse them.

https://www.codeproject.com/articles/13391/using-ifilter-in-c

However, this alone doesn't do the job (anymore?). This has mainly to do with the change to 64 bits systems. To let it use the Adobe pdf 64 bit IFilter I had to
  •  Ensure my project compiled to 64 bit instead of 32.
  •  Ensure the registry code looked in the 64 part of the registry, by using the RegistryView:
            using (var hklm = RegistryKey.OpenBaseKey(RegistryHive.LocalMachine,
                                                       RegistryView.Registry64))
            {
                RegistryKey rk = hklm.OpenSubKey(key);
                ....
  • Finally: The Adobe 11 Ifilter hasn't all the necessary methods implemented! Very strange. Luckily I found a link where I could download the 9 version: 
    ftp://ftp.adobe.com/pub/adobe/acrobat/win/9.x/

vrijdag 15 maart 2019

Some experiments with the Keras model

Today while learning about machine learning with Keras, using the splendid book 'Deep Learning with Python', from François Chollet, I tried some model improvements: after all, that is what deep-learning is all about.

Currently the accuracy is slightly above 0.6.

The model used for this is based on Convolutional layers:

model = Sequential()
model.add(Lambda(K.one_hot,
               arguments={'num_classes': model_max},output_shape=output_shape, input_shape=input_shape, dtype='int32'))
model.add(Conv1D(256, 6, activation='relu'))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(labels_max, activation='softmax'))

Looking at our data, it seemed to me that the 'from' address of emails could play a large role. However, in our current preprocessing, each word used less than 20 times is ignored. I decided to try to assure that the 'from' is always part of the small vocab:

    vocab_short = [key for key,_ in z.most_common()]
    ...
    for i, x in enumerate(docs):
        vocab_short.append(x[0])

    vocab_short=set(vocab_short)
    ...
    model_max   = len(vocab_short) # Maximal number of words in
                                     reduced wordlist


However, this change didn't increase the accuracy much, if any.
The second attempt was adding an Embedding layer instead of the Lambda layer as input. This layer should add learning capabilities by making a multidimensional vector for each word. The distance between 2 word vectors should become an indication of their relative 'meaning'.
However, not much improvement using just that:

model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))


Reading further, in the textbook, it seems that LSTM might give better results. This learning model has a mchanism that should consider historical context of a word, so in a sence the previous words used. Now this appeared to be helpful, en increased the accuracy to close to 0.73!

model = Sequential()
model.add(Embedding(model_max, 256, input_length=totShape))
model.add(LSTM(256))
model.add(Dense(labels_max, activation='softmax'))