AI on Microsoft Platform: februari 2019

dinsdag 26 februari 2019

Python microservice

To get rid of the SQL dependency for the production webservice, and to be able to keep the learned model in memory between calls I tried to use a Python microservice.

A microservice works as as a small and simple webserver. For this, many internet resources suggest 'FLASK'. Indeed, it appeared quite simple to use:

Of course first FLASK has to be installed in the Python environment:
pip install flask

After that, running a Python script with

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!" if __name__ == '__main__':
    app.run() will lauch a process listening on port 5000. The parameter of app.route can be more specific and have a method assigned:

@app.route("/email", methods=["POST"])
When json is posted, the content can be read by:

    emailFrom = request.json['From']

I used 'Postman' to test the service

The returning json took some experimentation. In the end it should be a json type array with only strings:

res=[]

for i, idx in enumerate(best5idx):

t=str(idx)

res.append({'code': t})

return jsonify(res)

A trick is necessary to reuse the trained model, so to keep it in memory. It has to do with Tensorflow supporting multiple threads, and you have the make sure you use the correct session:

So in the root you will need:

global graph

graph = tf.get_default_graph()

…

model =get_model( )

Then in the method where you want to do the prediction:

lmodel=model

with graph.as_default():

res = lmodel.predict(docs)

Machine learning in SQL2017 II

I succeeded to implement the machine learning on my laptop. My previous post on SQL server describes the required configuration. For the 2019 version (and possibly also for the 2017 version) the possibility becomes available to use different versions of Python, using a special tool. Its seems that for each version a different PythonLauncher.dll is needed, so probably the process depends on the preparation of different versions of this file by Microsoft.

There are many ways to access SQL from C#. I used the internally developed Data Access Layer for this. Here is the preparation of the SQL statement:

string startSql = EscapeQuote(string.Format(@"Select words = '{0}', emailFrom = '{1}', emailTo = '{2}', emailCc = '{3}', emailFileRef = '{4}'", CheckNullOrEmpty(d.Words), CheckNullOrEmpty(d.EmailFrom), CheckNullOrEmpty(d.EmailTo), CheckNullOrEmpty(d.EmailCC), CheckNullOrEmpty(d.FileName)));

//Apply learning model:

string pythonScript = GetPythonScript();

string sql = string.Format(@"EXECUTE sp_execute_external_script

@language = N'Python'

, @script = N'{0}'

, @input_data_1 = N'{1}'

, @input_data_1_name = N'doc_data'

WITH RESULT SETS(([Matter] CHAR(100) NOT NULL));

(one has to be sure to also escape the quote in the Python script string)

It appeared that for accessing Python from SQL and back, the 'panda' structure is used:

import pandas as pd

…

#Input params:

words=str(doc_data.iat[0,0])

emailFrom=str(doc_data.iat[0,1])

emailTo=str(doc_data.iat[0,2])

emailCc=str(doc_data.iat[0,3])

fileName=str(doc_data.iat[0,4])

…

OutputDataSet = pd.DataFrame(best5idx)

Somehow assigning the results to the OutputDataSet is enough to return it to SQL. The 'WITH RESULT SETS' will assign headers to the output table.

However I encountered 2 issues:

When deployed on the server I got a 'memory error'. The model has to load more than 600MB of weight-values. I found that when more then 70% of the servers memory is used, one can get these kind of errors. The problem could possibly be resolved just by ordering more memory for the virtual server (7,50 euro/month).
It looked like for each prediction the model had to be recompiled, and the weights-file reloaded, because the SQL server starts the Python script, executes it, and then closes the session after getting the results. A sort-of solution for this could have been to speed up the process by declaring a virtual disk.

The developed system involved going from the rest service in C# to SQL to Python and back. Apart from the 2 issues it seemed overcomplicated, so I decided to try another approach: the Python Microservice.

vrijdag 22 februari 2019

Costs for MS AI

In a production environment one will want to retrain the model when new data have been added. For the case at hand, emails and other new documents will be added while using the DMS, and thereby new documents and metadata can be used for training.

In production one would like to use a server for this. The problem is that servers usually don't have graphic cards to do the learning. Virtual Windows servers can be rented for 25-50 euro a month, and I one would like to find an AI solution for an amount in the same range.

So what can one do?

A top of the class NVIDIA DGX-1 with Tesla V100 costs 150.000 euro, and offers a 1000 TFlops performance.
Windows Azure seems to offer a reasonable solution for about 100 euro/month. The machine-learning-studio S1-tier includes 100,000 API transactions and 25 API compute hours.
LeaderGPU offers servers with GPUs that starts with 365 euro/pm (2x1080GTX)

Nvidia changed the EULA: Geforce cards are no longer allowed in datacenters.

Nvidia Lowers DGX Station Pricing

SQL server 2017

From version 2017 SQL server supports running Python script from within SQL. This means that you can provide some values from sql server, run the Python script, and then use the result again. This looks like below:

execute sp_execute_external_script
@language = N'Python',
@script = N'
import sys
print(sys.version)'

Before you can do this the scripts feature should be enabled:

sp_configure 'external scripts enabled', 1;
RECONFIGURE WITH OVERRIDE;

You can either install the 'Advanced Analytics' using the options when installing or expanding an SQL server instance, or stand alone (don't use both possibilities together!).
The SQL settings for Python can be found in:
C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\Binn\pythonlauncher.config

This is the default of the content of this file:

PYTHONHOME=C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES
ENV_PYTHONIOENCODING=UTF-8
ENV_ExaMpiCommDllPath=C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\Lib\site-packages\revoscalepy\rxLibs\
MPI_HOME=C:\Program Files\Microsoft MPI
INSTANCE_NAME=MSSQLSERVER
TRACE_LEVEL=1
JOB_CLEANUP_ON_EXIT=1
USER_POOL_SIZE=0
WORKING_DIRECTORY=C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\ExtensibilityData
PKG_MGMT_MODE=0

Installed the C++ 2015 redistributable (needed for TensorFlow).

I wanted to use another Python version because Tensorflow doesn't correctly install in the current (and I use 3.6 for the current AI experiments).

Unfortunately one cannot change the above PHYTONHOME folder to another Python version. One needs the revoscalepy package, and this only works with the provided 3.5.2 version of Python.

Eventually I discovered that the TensorFlow version caused the issue on the server. Installing version 1.4 works.

(use admin command promt)

python -m pip install --ignore-installed tensorflow==1.4.0

python -m pip install --upgrade keras==2.1.3 (error on newer version)

The numpy is out of date:
python -m pip uninstall numpy

python -m pip install --upgrade numpy

There is this feature called 'binding' to upgrade an R or Python instance. This will probably be mainly available for SQL server 2019.

donderdag 21 februari 2019

The AI Quest

Early 2019 the task was assigned to me to find out how we could incorporate artificial intelligence (AI) in our product line.
The main parts of the product consist of a document management system (DMS) that keeps office documents (Word, Excel, Pdf, email) on SharePoint. A component in the various Microsoft Office product allows the user to assign metadata to the document, and use SharePoint as organization wide file system.
The AI challenge was to be able to suggest both metadata as storage location (which is also a kind of metadata tag) to users for new documents and emails.

The AI challenge was to be able to suggest both metadata as storage location to users for new documents and emails.

Most of our clients are from the legal sector, where 'Matters' are used as the primary classification system for documents.

So the environment to work with is

A client component hosted in MS Office applications
SharePoint (both onpremise and online)

I had no previous knowledge of AI, but I soon discovered that most AI research is done using Python, and only a few components and libraries where available within my comfort zone (and that of the Microsoft Platform), which is C#.

Part of that is because AI research requires using the GPU for parallel processing, if you want to spent minutes instead of hours for each AI experiment. This is well incorporated in Python, where Nvidia with the Cuda software is most prominently used.

So while for accessing SharePoint, MSOffice documents and providing web services on IIS one would like to use a .NET language, for the AI environment one has to use Python. This is a nasty gab to bridge. Fortunately Microsoft realized this also only a few years ago, and started to incorporate Python integration possibilities in several ways. Then again, this is all quite new, an not all matured.