Hello all,

We needed to estimate heap consumption for running TokenNameFinder models. This 
was quite easy in the beginning, check compressed model file size and measure 
max heap consumption after full GC, doing it for various (at first latin script 
only) models and get an average. It was about 13x size of the compressed file.

It became different when we added multibyte scripts, including the latin 
scripts the average went up to 19-20x the original file size. Why is that the 
case? In Java strings and chars are always UTF16 right?

A second more difficult problem is estimating minimal heap usage for training 
these models. Our datasets can be, even when fuzzy deduplicated, huge. Varying 
from a few dozen MBs to over several GB, requiring humongous heaps. Indexing 
the training data can take a very long time so it makes trial and error time 
costly. It also doesn't really seem to have a rule for number of sentences or 
MBs for a given minimal heap. Multibyte scripts by the way also appear to 
require additional heap.

Are there any hints the community can share that would help me to estimate the 
heap for training? For each training file we produce, we have exact statistics 
on number of sentences and per entity type number of unique and total 
annotations.

We use the default features.xml for all models at this moment.

Many thanks,
Markus

Reply via email to