Re: UIMA internals memory footprint

Marshall Schor Tue, 29 May 2007 13:04:39 -0700

Hi Kirk -

Thanks for posting your test case. It did point up an inefficiency inhow one of the hash maps was being used - this is now being improved.Your basic question concerned expectations for storing things in theCAS. Here's the basic model.


The CAS stores most things in int[] arrays.

Feature Structures take up a number of entries in the int[]: one plusthe number of features. So an annotation takes 1 + 3 = 4 words. (1word = 4 bytes). Your annotation type added 2 more features, so thesetake 6 words in the int[].

Feature Structure features which are Strings take another 4 slots in theint[] arrays (16 bytes), per string being referenced, in addition towhatever storage Java uses for strings. Sun Java 6_01 appears to use 32+ 2 * number_of_characters to store a string. In your test case, eachString in Java took 42 bytes.

Indexes take one word per annotation indexed, typically.

All of the int[] objects grow as needed, by quantum jumps, so at anyparticular time, the number of words allocated is often larger than thenumber used.To reference CAS objects from a Java program, one of 2 interfaces isused: the "JCas" interface or the plain "CAS" Java interface. Both ofthese create a 2 field Java object for each referenced CAS object. InSun's Java 6_01, there is an always-present overhead of 8 bytes per Javaobject, so these Java objects take 16 bytes.

In addition, the JCas implementation keeps a hash map where the keys arethe CAS object reference (an int), and the values are the correspondingJCas object, once it is created. This hash map takes additional space:in the runs you did, this took about 46 bytes per entry. We've donesome redesign and reduced this to about 10 bytes additional, perannotation.

So your numbers are basically correct, except we've now changed theentries due to the JCas hash map overhead from:


   Integer                   1,600,000 (UIMA internal (Annotation))
   java.util.HashMap$Entry   2,400,000 (UIMA internal)
   Object[]                    600,000 (a guess- for the table part of the hash 
table
                                        100K entries (1 per annotation) * 4 
bytes plus
                                        some for the 75% load factor of this 
table +
                                        extra due to the table expanding by a 
factor of 2
                                        when growing


The new implementation makes these numbers look more like this:

   Integer                   0         (UIMA internal (Annotation)) - no longer 
needed
   java.util.HashMap$Entry   0         (UIMA internal) - no longer used
   Object[]                  1,000,000 (an approximation-for the table part of 
the hash table
                                        100K entries (1 per annotation) * 4 
bytes plus
                                        some for the 50% load factor of this 
table

This reduces 4.6 MB down to 1 MB overhead for 100K annotations.

When I measured the int[] use for Java 6_01 from Sun, using Sun'sheapDump and "jhat" tool, I found the int[] size to be8.1 MB (versus your measurement of 9.3 MB). I'm not sure where thedifference comes from.If I add all these up, I get a "UIMA overhead" of 8.1 MB + 1MB =9.1MB. Before the "fix" for the JCas instance hashmap,this would be closer to your reported overhead: 8.1 MB + 4.6 MB = 12.7MB. So your reporting triggered a significant

improvement in the implementation (will be in the next release) - thank you!

I hope this helps in having a better model of what to expect in terms ofspace utilization.


-Marshall

Kirk True wrote:

Hi all,

I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.

It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?

Details:

In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.

I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:

For a 1,500,000 byte document:

    Original document         1,500,000
    HTML document             2,800,000
    TestCaseAnnotation        1,600,000

Annotation strings 4,800,000Annotation char[]s 2,400,000

    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             26,400,000

For a 15,000,000 byte document:

    Original document        15,000,000
    HTML document            28,000,000
    TestCaseAnnotation        1,600,000

Annotation strings 4,800,000Annotation char[]s 2,400,000

    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             65,100,000

I can post the code for the test cases if it helps.

Thanks,
Kirk

Re: UIMA internals memory footprint

Reply via email to