AI on Microsoft Platform

vrijdag 24 april 2020

Use of common dictionaries on top of private word list

We investigated what part of our AI functionality makes use of common words, available in public word lists, and therefore what part relies highly on custom words.

Introduction
We use the content of emails and other documents to let AI suggest filing locations in our DMS. For this, already present documents are crawled, and individual words are hold against an ever extending client specific word list, thereby replacing the text parts with lists of indices.
Using this method we arrive at macro F1 shores of 80% or higher, depending on several configuration choices. The word list contains also very client specific words, like email addresses, names and even numbers.
By checking which words are also present in commonly available word lists, we were able to run the AI training either with the complete wordlist, or only those form these common wordlists.

Method
Because most of the text in our DMS are either in English or in Dutch, we used the word list below to mark which of the words we already had could be considered common. Then we could train our model either with all words, or only with the 'common' words.

Dutch: https://github.com/OpenTaal/opentaal-wordlist
English: https://github.com/dwyl/english-words/

Preliminary analytics

Our word list contained about 800.000 different words from the crawling proces. After marking using the common libraries, it appeared that only 13% could be parked as common. However, for training our AI ignores words that are not often used (20 is currently the limit for this). When only looking at these remaining relevant words 60% appeared to be common.

All Words:
Used	Origin	Count
Yes	Tenant	694547
Yes	Common	102701
No	Common	724001
	Sum	1521249

Fraction from all used word in common dictionary

0,128819

Often used words (>20)
Origin	Language	Words
Common	English	4435
Tenant	Unknown	6330
Common	Dutch	5329
Common	Unknown	31
	Sum	16125

Fraction from often used words in common dictionary

0,605519

Results

When only training using the words from the common library, we arrived at an F1 value of 75% instead of 80%. As expected, this score is lower than when including the proprietary words for training, the value is still quite good. For document type classification, we have far to few documents labeled. Without knowing the significance at these levels, it seems however that the F1 is even slightly better when only common words are used: 24% instead of 21%.

donderdag 2 januari 2020

Back to the configuration of the testing machine

Because our computer for testing AI is used for multiple sorts of tests, now and then the configuration is compromized, resulting in trainings that doesn't use the GPU.
This configuration however is not trivial. Many online tutorials are written using older versions of software, while newer versions of some parts work only in special environments: not on Windows.

This configuration seems to work now on our Windows 10 system:

Cuda 10.0
Tensorflow-GPU==1.14.0
Python 3.6.5
Keras 2.2.4

There are several things no be aware of:

Remove all other Python installations
Cuda also needs CuDNN. A zip file corresponding to the Cuda installation can be downloaded. Then files the folder should be copied to the corresponding folders in the Cuda installation.
The path to the Cuda folders ..\CUDA\v10.0\libnvvp and ..\CUDA\v10.0\bin should be included in the systems Path variable
- Visual Studio Build Tools 2017 and
- Nvidia Nsight Visual Studio Edition
should be de-installed before installing Cuda, otherwise the installation will fail.
Don't forget to update Pip to the latest version before installing from requirements.txt

maandag 14 oktober 2019

Azure again

It looks like for using one ore more GPU's on Azure, one cannot just use a NVidea based docker with normal tensorflow-gpu: special azure Python packages and code are needed to be able to access the GPU's.

First the azureml-sdk must be piped:

pip install -upgrade azureml-sdk
According to this blog these libs should be included in python

import azuremlfrom azureml.core import Experimentfrom azureml.core import Workspace, Runfrom azureml.core.compute import ComputeTarget, AmlComputefrom azureml.core.compute_target import ComputeTargetException

ws = Workspace.from_config()

exp = Experiment(workspace=ws, name='my experiment name')

Running on a GPU-enabled Azure Machine Learning compute cluster:

cluster_name = "gpucluster"
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Tensorflow is retrieved from azureml: from azureml.train.dnn import TensorFlow
script_params = {
    '--data-folder': dataset.as_named_input('mnist').as_mount(),
    '--batch-size': 50,
    '--first-layer-neurons': 300,
    '--second-layer-neurons': 100,
    '--learning-rate': 0.001
}

est = TensorFlow(source_directory=script_folder,
                 entry_script='keras_mnist.py',
                 script_params=script_params,
                 compute_target=compute_target,
                 pip_packages=['keras', 'matplotlib'],
                 use_gpu=True)

run = exp.submit(est)
run.wait_for_completion(show_output=True)

vrijdag 27 september 2019

Docker II

Some Docker commands:

Build a new container:
docker build -t mycontainer .
Start the container
docker run mycontainer

Start the container detached (returning containerID)

docker run --detach mycontainer

Stop container
docker stop mycontainer

Restart container (silent)

docker start mycontainer

Restart container (with console attached)

docker attach mycontainer

Keep data changes made on container
docker commit mycontainer
Execute command on running container

docker exec -it mycontainerID bash remove all containers: docker rm $(docker ps -a -q)

To show only running containers use the given command:

Commit a change made in a running container: docker commit CONTAINER_ID myContainer

docker ps

To show all containers use the given command:

docker ps -a

With environment vars: docker run --env-file ./env.list mycontainer

dinsdag 24 september 2019

Docker

Docker seems to be a sophisticated way to test server applications, and to bring them in production. Docker can host various operating systems.

To have Docker on a Windows 10 system, one should have Windows Pro on hardware that supports virtualization (I guess most current hardware will support it. Here are the exact specs.)

The Docker container should eventuallu run on Azure, so I followed this tutorial. After installing Docker Desktop, it is necessary to download a basic template of a Docker container, probably most often retrieved from GitHub. So after installing GIT, I choose to download the Diango container, as mentioned in the tutorial. This has the Python and Flask software included.

The Dockerfile in this download contains the build configuration for the container. After updating the requirements.txt file for Python I tried to build the container.
However I had an error installing the pyodbc: In the end I needed a quite complicated command to have ODBC installed correctly.

FROM python:3.6
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
ADD . /app
# unixODBC
RUN apt-get update && \
    apt-get install -y apt-transport-https && \
    curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
    curl https://packages.microsoft.com/config/debian/9/prod.list > /etc/apt/sources.list.d/mssql-release.list && \
    apt-get update && \
    ACCEPT_EULA=Y apt-get install msodbcsql17 unixodbc-dev -y# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 8080
# Define environment variable
ENV NAME World
# Run app.py when the container launches
CMD ["python", "app.py"]

The build commando is done in a powershell prompt (don't forget the dot at the end):

PS E:\docker_prep\docker-django-webapp-linux> docker build -t mycontainer_ai .

And to run:

PS E:\docker_prep\docker-django-webapp-linux> docker run -p 8080:8080 mycontainer_ai

dinsdag 17 september 2019

And then to Azure II

We want customers to specify their own place to have their database, and to store the machine learning file.

The database connection is clear: just specify the connection string to the Azure database, and you're ready to go (well, use pyodbc)

Azure now has many services to store something, and it is not immediately clear which one to use for this purpose.

After looking around, it seemed that the Workspace I created earlier is also a blob storage. Here you can find the accountname to use, and accesskey:

So after installing the correct packages, the code below worked.

pip install azure-storage-blob azure-mgmt-storage from azure.storage.blob import BlockBlobService, PublicAccess

blob_service = BlockBlobService('[your account name]','[your access key]')

blob_service.create_container(
    'mycontainername',
    public_access=PublicAccess.Blob
)

blob_service.create_blob_from_bytes(
    'mycontainername',
    'myblobname',
    b'hello from my python file'
)
print(blob_service.make_blob_url('mycontainername', 'myblobname')) The example code from here shows how to download to file

block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)

and upload:

block_blob_service.create_blob_from_path(container_name, local_file_name, full_path_to_file)