Hi Kirk -

Thanks for posting your test case. It did point up an inefficiency in how one of the hash maps was being used - this is now being improved. Your basic question concerned expectations for storing things in the CAS. Here's the basic model.

The CAS stores most things in int[] arrays.

Feature Structures take up a number of entries in the int[]: one plus the number of features. So an annotation takes 1 + 3 = 4 words. (1 word = 4 bytes). Your annotation type added 2 more features, so these take 6 words in the int[].

Feature Structure features which are Strings take another 4 slots in the int[] arrays (16 bytes), per string being referenced, in addition to whatever storage Java uses for strings. Sun Java 6_01 appears to use 32 + 2 * number_of_characters to store a string. In your test case, each String in Java took 42 bytes.
Indexes take one word per annotation indexed, typically.

All of the int[] objects grow as needed, by quantum jumps, so at any particular time, the number of words allocated is often larger than the number used. To reference CAS objects from a Java program, one of 2 interfaces is used: the "JCas" interface or the plain "CAS" Java interface. Both of these create a 2 field Java object for each referenced CAS object. In Sun's Java 6_01, there is an always-present overhead of 8 bytes per Java object, so these Java objects take 16 bytes.

In addition, the JCas implementation keeps a hash map where the keys are the CAS object reference (an int), and the values are the corresponding JCas object, once it is created. This hash map takes additional space: in the runs you did, this took about 46 bytes per entry. We've done some redesign and reduced this to about 10 bytes additional, per annotation.

So your numbers are basically correct, except we've now changed the entries due to the JCas hash map overhead from:

   Integer                   1,600,000 (UIMA internal (Annotation))
   java.util.HashMap$Entry   2,400,000 (UIMA internal)
   Object[]                    600,000 (a guess- for the table part of the hash 
table
                                        100K entries (1 per annotation) * 4 
bytes plus
                                        some for the 75% load factor of this 
table +
                                        extra due to the table expanding by a 
factor of 2
                                        when growing


The new implementation makes these numbers look more like this:

   Integer                   0         (UIMA internal (Annotation)) - no longer 
needed
   java.util.HashMap$Entry   0         (UIMA internal) - no longer used
   Object[]                  1,000,000 (an approximation-for the table part of 
the hash table
                                        100K entries (1 per annotation) * 4 
bytes plus
                                        some for the 50% load factor of this 
table

This reduces 4.6 MB down to 1 MB overhead for 100K annotations.

When I measured the int[] use for Java 6_01 from Sun, using Sun's heapDump and "jhat" tool, I found the int[] size to be 8.1 MB (versus your measurement of 9.3 MB). I'm not sure where the difference comes from. If I add all these up, I get a "UIMA overhead" of 8.1 MB + 1MB = 9.1MB. Before the "fix" for the JCas instance hashmap, this would be closer to your reported overhead: 8.1 MB + 4.6 MB = 12.7 MB. So your reporting triggered a significant
improvement in the implementation (will be in the next release) - thank you!

I hope this helps in having a better model of what to expect in terms of space utilization.

-Marshall

Kirk True wrote:
Hi all,

I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.

It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?

Details:

In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.

I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:

For a 1,500,000 byte document:

    Original document         1,500,000
    HTML document             2,800,000
    TestCaseAnnotation        1,600,000
Annotation strings 4,800,000 Annotation char[]s 2,400,000
    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             26,400,000

For a 15,000,000 byte document:

    Original document        15,000,000
    HTML document            28,000,000
    TestCaseAnnotation        1,600,000
Annotation strings 4,800,000 Annotation char[]s 2,400,000
    Integer                   1,600,000 (UIMA internal (Annotation))
    int[]                     9,300,000 (UIMA internal)
    java.util.HashMap$Entry   2,400,000 (UIMA internal)
    -----------------------------------
                             65,100,000

I can post the code for the test cases if it helps.

Thanks,
Kirk



Reply via email to