AI on Microsoft Platform: Use of common dictionaries on top of private word list

We investigated what part of our AI functionality makes use of common words, available in public word lists, and therefore what part relies highly on custom words.

Introduction
We use the content of emails and other documents to let AI suggest filing locations in our DMS. For this, already present documents are crawled, and individual words are hold against an ever extending client specific word list, thereby replacing the text parts with lists of indices.
Using this method we arrive at macro F1 shores of 80% or higher, depending on several configuration choices. The word list contains also very client specific words, like email addresses, names and even numbers.
By checking which words are also present in commonly available word lists, we were able to run the AI training either with the complete wordlist, or only those form these common wordlists.

Method
Because most of the text in our DMS are either in English or in Dutch, we used the word list below to mark which of the words we already had could be considered common. Then we could train our model either with all words, or only with the 'common' words.

Dutch: https://github.com/OpenTaal/opentaal-wordlist
English: https://github.com/dwyl/english-words/

Preliminary analytics

Our word list contained about 800.000 different words from the crawling proces. After marking using the common libraries, it appeared that only 13% could be parked as common. However, for training our AI ignores words that are not often used (20 is currently the limit for this). When only looking at these remaining relevant words 60% appeared to be common.

All Words:
Used	Origin	Count
Yes	Tenant	694547
Yes	Common	102701
No	Common	724001
	Sum	1521249

Fraction from all used word in common dictionary

0,128819

Often used words (>20)
Origin	Language	Words
Common	English	4435
Tenant	Unknown	6330
Common	Dutch	5329
Common	Unknown	31
	Sum	16125

Fraction from often used words in common dictionary

0,605519

Results

When only training using the words from the common library, we arrived at an F1 value of 75% instead of 80%. As expected, this score is lower than when including the proprietary words for training, the value is still quite good. For document type classification, we have far to few documents labeled. Without knowing the significance at these levels, it seems however that the F1 is even slightly better when only common words are used: 24% instead of 21%.

AI on Microsoft Platform

vrijdag 24 april 2020

Use of common dictionaries on top of private word list

Geen opmerkingen:

Een reactie posten