Hello all, We needed to estimate heap consumption for running TokenNameFinder models. This was quite easy in the beginning, check compressed model file size and measure max heap consumption after full GC, doing it for various (at first latin script only) models and get an average. It was about 13x size of the compressed file.
It became different when we added multibyte scripts, including the latin scripts the average went up to 19-20x the original file size. Why is that the case? In Java strings and chars are always UTF16 right? A second more difficult problem is estimating minimal heap usage for training these models. Our datasets can be, even when fuzzy deduplicated, huge. Varying from a few dozen MBs to over several GB, requiring humongous heaps. Indexing the training data can take a very long time so it makes trial and error time costly. It also doesn't really seem to have a rule for number of sentences or MBs for a given minimal heap. Multibyte scripts by the way also appear to require additional heap. Are there any hints the community can share that would help me to estimate the heap for training? For each training file we produce, we have exact statistics on number of sentences and per entity type number of unique and total annotations. We use the default features.xml for all models at this moment. Many thanks, Markus