Introduction
We use the content of emails and other documents to let AI suggest filing locations in our DMS. For this, already present documents are crawled, and individual words are hold against an ever extending client specific word list, thereby replacing the text parts with lists of indices.
Using this method we arrive at macro F1 shores of 80% or higher, depending on several configuration choices. The word list contains also very client specific words, like email addresses, names and even numbers.
By checking which words are also present in commonly available word lists, we were able to run the AI training either with the complete wordlist, or only those form these common wordlists.
Method
Because most of the text in our DMS are either in English or in Dutch, we used the word list below to mark which of the words we already had could be considered common. Then we could train our model either with all words, or only with the 'common' words.
Dutch: https://github.com/OpenTaal/opentaal-wordlist
English: https://github.com/dwyl/english-words/
Preliminary analytics
Our word list contained about 800.000 different words from the crawling proces. After marking using the common libraries, it appeared that only 13% could be parked as common. However, for training our AI ignores words that are not often used (20 is currently the limit for this). When only looking at these remaining relevant words 60% appeared to be common.
All Words: | ||
Used | Origin | Count |
Yes | Tenant | 694547 |
Yes | Common | 102701 |
No | Common | 724001 |
Sum | 1521249 |
Fraction from all used word in common dictionary | 0,128819 |
Often used words (>20) | ||
Origin | Language | Words |
Common | English | 4435 |
Tenant | Unknown | 6330 |
Common | Dutch | 5329 |
Common | Unknown | 31 |
Sum | 16125 |
Fraction from often used words in common dictionary | 0,605519 |
Results
When only training using the words from the common library, we arrived at an F1 value of 75% instead of 80%. As expected, this score is lower than when including the proprietary words for training, the value is still quite good. For document type classification, we have far to few documents labeled. Without knowing the significance at these levels, it seems however that the F1 is even slightly better when only common words are used: 24% instead of 21%.