vrijdag 24 april 2020

Use of common dictionaries on top of private word list

We investigated what part of our AI functionality makes use of common words, available in public word lists, and therefore what part relies highly on custom words.

Introduction
We use the content of emails and other documents to let AI suggest filing locations in our DMS. For this, already present documents are crawled, and individual words are hold against an ever extending client specific word list, thereby replacing the text parts with lists of indices.
Using this method we arrive at macro F1 shores of 80% or higher, depending on several configuration choices. The word list contains also very client specific words, like email addresses, names and even numbers.
By checking which words are also present in commonly available word lists, we were able to run the AI training either with the complete wordlist, or only those form these common wordlists.

Method
Because most of the text in our DMS are either in English or in Dutch, we used the word list below to mark which of the words we already had could be considered common. Then we could train our model either with all words, or only with the 'common' words.

Dutch: https://github.com/OpenTaal/opentaal-wordlist
English: https://github.com/dwyl/english-words/


Preliminary analytics

Our word list contained about 800.000 different words from the crawling proces. After marking using the common libraries, it appeared that only 13% could be parked as common. However, for training our AI ignores words that are not often used (20 is currently the limit for this). When only looking at these remaining relevant words 60% appeared to be common.

All Words:
Used Origin Count
Yes Tenant 694547
Yes Common 102701
No Common 724001
Sum 1521249


Fraction from all used word in common dictionary  0,128819
Often used words (>20)
Origin Language Words
Common English 4435
Tenant Unknown 6330
Common Dutch 5329
Common Unknown 31
Sum 16125
Fraction from often used words in common dictionary 0,605519


Results

When only training using the words from the common library, we arrived at an F1 value of 75% instead of 80%. As expected, this score is lower than when including the proprietary words for training, the value is still quite good. For document type classification, we have far to few documents labeled. Without knowing the significance at these levels, it seems however that the F1 is even slightly better when only common words are used: 24% instead of 21%.