vrijdag 24 april 2020

Use of common dictionaries on top of private word list

We investigated what part of our AI functionality makes use of common words, available in public word lists, and therefore what part relies highly on custom words.

Introduction
We use the content of emails and other documents to let AI suggest filing locations in our DMS. For this, already present documents are crawled, and individual words are hold against an ever extending client specific word list, thereby replacing the text parts with lists of indices.
Using this method we arrive at macro F1 shores of 80% or higher, depending on several configuration choices. The word list contains also very client specific words, like email addresses, names and even numbers.
By checking which words are also present in commonly available word lists, we were able to run the AI training either with the complete wordlist, or only those form these common wordlists.

Method
Because most of the text in our DMS are either in English or in Dutch, we used the word list below to mark which of the words we already had could be considered common. Then we could train our model either with all words, or only with the 'common' words.

Dutch: https://github.com/OpenTaal/opentaal-wordlist
English: https://github.com/dwyl/english-words/


Preliminary analytics

Our word list contained about 800.000 different words from the crawling proces. After marking using the common libraries, it appeared that only 13% could be parked as common. However, for training our AI ignores words that are not often used (20 is currently the limit for this). When only looking at these remaining relevant words 60% appeared to be common.

All Words:
Used Origin Count
Yes Tenant 694547
Yes Common 102701
No Common 724001
Sum 1521249


Fraction from all used word in common dictionary  0,128819
Often used words (>20)
Origin Language Words
Common English 4435
Tenant Unknown 6330
Common Dutch 5329
Common Unknown 31
Sum 16125
Fraction from often used words in common dictionary 0,605519


Results

When only training using the words from the common library, we arrived at an F1 value of 75% instead of 80%. As expected, this score is lower than when including the proprietary words for training, the value is still quite good. For document type classification, we have far to few documents labeled. Without knowing the significance at these levels, it seems however that the F1 is even slightly better when only common words are used: 24% instead of 21%.  


donderdag 2 januari 2020

Back to the configuration of the testing machine

Because our computer for testing AI is used for multiple sorts of tests, now and then the configuration is compromized, resulting in trainings that doesn't use the GPU.
This configuration however is not trivial. Many online tutorials are written using older versions of software, while newer versions of some parts work only in special environments: not on Windows.

This configuration seems to work now on our Windows 10 system:

  • Cuda 10.0
  • Tensorflow-GPU==1.14.0
  • Python 3.6.5
  • Keras 2.2.4

There are several things no be aware of:
  • Remove all other Python installations
  • Cuda also needs CuDNN. A zip file corresponding to the Cuda installation can be downloaded. Then files the folder should be copied to the corresponding folders in the Cuda installation.
  • The path to the Cuda folders ..\CUDA\v10.0\libnvvp and ..\CUDA\v10.0\bin should be included in the systems Path variable
  • - Visual Studio Build Tools 2017 and
    - Nvidia Nsight Visual Studio Edition
    should be de-installed before installing Cuda, otherwise the installation will fail.
  • Don't forget to update Pip to the latest version before installing from requirements.txt 

maandag 14 oktober 2019

Azure again

It looks like for using one ore more GPU's on Azure, one cannot just use a NVidea based docker with normal tensorflow-gpu: special  azure Python packages and code are needed to be able to access the GPU's.

First the azureml-sdk must be piped:

pip install -upgrade azureml-sdk
According to this blog these libs should be included in python

import azuremlfrom azureml.core import Experimentfrom azureml.core import Workspace, Runfrom azureml.core.compute import ComputeTarget, AmlComputefrom azureml.core.compute_target import ComputeTargetException

ws = Workspace.from_config()

exp = Experiment(workspace=ws, name='my experiment name')

Running on a GPU-enabled Azure Machine Learning compute cluster:

cluster_name = "gpucluster"
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print(
'Found existing compute target')except ComputeTargetException:
    print(
'Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size=
'STANDARD_NC6',
                                                           max_nodes=
4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=
True, min_node_count=None, timeout_in_minutes=20)



Tensorflow is retrieved from azureml: from azureml.train.dnn import TensorFlow
script_params = {
   
'--data-folder': dataset.as_named_input('mnist').as_mount(),
   
'--batch-size': 50,
   
'--first-layer-neurons': 300,
   
'--second-layer-neurons': 100,
   
'--learning-rate': 0.001
}

est = TensorFlow(source_directory=script_folder,
                 entry_script=
'keras_mnist.py',
                 script_params=script_params,
                 compute_target=compute_target,
                 pip_packages=[
'keras', 'matplotlib'],
                 use_gpu=
True)


run = exp.submit(est)
run.wait_for_completion(show_output=
True)

vrijdag 27 september 2019

Docker II

Some Docker commands:

Build a new container:
docker build -t mycontainer .
Start the container
docker run mycontainer
Start the container detached (returning containerID)
docker run --detach mycontainer
Stop container
docker stop mycontainer
Restart container (silent)
docker start mycontainer
Restart container (with console attached)
docker attach mycontainer
Keep data changes made on container
docker commit mycontainer
Execute command on running container

docker exec -it mycontainerID bash remove all containers: docker rm $(docker ps -a -q)

To show only running containers use the given command:
Commit a change made in a running container: docker commit CONTAINER_ID myContainer

docker ps

To show all containers use the given command:

docker ps -a
With environment vars: docker run --env-file ./env.list mycontainer

dinsdag 24 september 2019

Docker

Docker seems to be a sophisticated way to test server applications, and to bring them in production. Docker can host various operating systems.

To have Docker on a Windows 10 system, one should have Windows Pro on hardware that supports virtualization (I guess most current hardware will support it. Here are the exact specs.)

The Docker container should eventuallu run on Azure, so I followed this tutorial. After installing Docker Desktop, it is necessary to download a basic template of a Docker container, probably most often retrieved from GitHub. So  after installing GIT, I choose to download the Diango container, as mentioned in the tutorial. This has the Python and Flask software included.

The Dockerfile in this download contains the build configuration for the container. After updating the requirements.txt file for Python I tried to build the container. 
However I had an error installing the pyodbc: In the end I needed a quite complicated command to have ODBC installed correctly.


FROM python:3.6
# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
ADD . /app

# unixODBC
RUN apt-get update && \
    apt-get install -y apt-transport-https && \
    curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
    curl https://packages.microsoft.com/config/debian/9/prod.list > /etc/apt/sources.list.d/mssql-release.list && \
    apt-get update && \
    ACCEPT_EULA=Y apt-get install msodbcsql17 unixodbc-dev -y
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 8080

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]



The build commando is done in a powershell prompt (don't forget the dot at the end):

PS E:\docker_prep\docker-django-webapp-linux> docker build -t mycontainer_ai .

And to run:

PS E:\docker_prep\docker-django-webapp-linux> docker run -p 8080:8080 mycontainer_ai


dinsdag 17 september 2019

And then to Azure II

We want  customers to specify their own place to have their database, and to store the machine learning file. 
The database connection is clear: just specify the connection string to the Azure database, and you're ready to go (well, use pyodbc)

Azure now has many services to store something, and it is not immediately clear which one to use for this purpose.

After looking around, it seemed that the Workspace I created earlier is also a blob storage. Here you can find the accountname to use, and accesskey:


So after installing the correct packages, the code below worked.


pip install azure-storage-blob azure-mgmt-storage from azure.storage.blob import BlockBlobService, PublicAccess

blob_service
= BlockBlobService('[your account name]','[your access key]')

blob_service
.create_container(
   
'mycontainername',
    public_access
=PublicAccess.Blob
)

blob_service
.create_blob_from_bytes(
   
'mycontainername',
   
'myblobname',
    b
'hello from my python file'
)
print(blob_service.make_blob_url('mycontainername', 'myblobname'))
The example code from here shows how to download to file
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)

and upload:

block_blob_service.create_blob_from_path(container_name, local_file_name, full_path_to_file)