Re: UIMA internals memory footprint

Marshall Schor Sat, 19 May 2007 17:59:27 -0700

Thilo Goetz wrote:

Kirk True wrote:

Hi Adam,

Kirk,

In this test are you running a CPE or just an AnalysisEngine?  If it
is a CPE do you know what your CAS Pool size is?


It's an AnalysisEngine.

When a CAS is created it does allocate a large heap which is then
filled as you create annotations.  By default I believe this is
500,000 cells (2MB) per CAS, but this can be overridden (see
UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
defintely be one source of memory overhead.  As you saw it does not
grow with larger documents, it will only grow if you create enough
annotations to fill up the allocated space.


I noticed that this is tweak-able and set it to something insanely
small (like 100). But, as you said, it grows as the number of
annotations grow. Since the parameter is under the umbrella of
performance, I'd assume that it would actually be better to
pre-allocate close to what we're going to use.

[...]

Yes.

You can estimate data use on the heap as follows. Each FS uses atleast oneint for the type information, plus whatever features it has. So avanillaannotation is 3 ints, one for the type, and one for the start and endfeatures,respectively. If you have two additional features, that's 5 ints, so20 bytes.If you use the JCas,

or you create FeatureStructure Java objects (which are Java Objects),

you incur an additional overhead of a Java object for
each annotation. It's small, but I can't say off the top of my headhow smallexactly.

Both the Feature Structure Java object and the JCas Java Object have 2fields:a Java "int" (4 bytes) and a Java reference (4 bytes, unless it's a 64bit Java, I think).Plus you have to add the Java overhead for an object, which might be 8bytes, but I'm

not sure.

Plus, the JCas objects are held in a HashMap (or some such, Marshall
correct me if I'm wrong), which incurs additional memory overhead.

True. The key is a wrapped "int", the value is a Java "ref", and thenyou have thehash table overhead.

In my experience, the CAS can easily reach 10 to 20 times the size ofthe inputdocument. If you have information-rich token annotations, that's notreally surprising. (And this is without using JCas). Imagine youwere to manuallycreate Java objects that carry the same information, you would seeroughly
the same kind of overhead.

Two more points:

If you have variable sized documents, you might want to consider"chunking" - that is, breakingvery large documents up into multiple CASes. A CAS Consumer cancollect the chunks at the

end of the processing pipeline and re-assemble things.

Finally, when you "reset" a CAS, if it had expanded itself due to anunusually large number of feature structures, itwill gradually shrink back down to a more nominal size. There is codein the reset that does this adjustment.


-Marshall

Re: UIMA internals memory footprint

Reply via email to