Hi Kirk -
Thanks for posting your test case. It did point up an inefficiency in
how one of the hash maps was being used - this is now being improved.
Your basic question concerned expectations for storing things in the
CAS. Here's the basic model.
The CAS stores most things in int[] arrays.
Feature Structures take up a number of entries in the int[]: one plus
the number of features. So an annotation takes 1 + 3 = 4 words. (1
word = 4 bytes). Your annotation type added 2 more features, so these
take 6 words in the int[].
Feature Structure features which are Strings take another 4 slots in the
int[] arrays (16 bytes), per string being referenced, in addition to
whatever storage Java uses for strings. Sun Java 6_01 appears to use 32
+ 2 * number_of_characters to store a string. In your test case, each
String in Java took 42 bytes.
Indexes take one word per annotation indexed, typically.
All of the int[] objects grow as needed, by quantum jumps, so at any
particular time, the number of words allocated is often larger than the
number used.
To reference CAS objects from a Java program, one of 2 interfaces is
used: the "JCas" interface or the plain "CAS" Java interface. Both of
these create a 2 field Java object for each referenced CAS object. In
Sun's Java 6_01, there is an always-present overhead of 8 bytes per Java
object, so these Java objects take 16 bytes.
In addition, the JCas implementation keeps a hash map where the keys are
the CAS object reference (an int), and the values are the corresponding
JCas object, once it is created. This hash map takes additional space:
in the runs you did, this took about 46 bytes per entry. We've done
some redesign and reduced this to about 10 bytes additional, per
annotation.
So your numbers are basically correct, except we've now changed the
entries due to the JCas hash map overhead from:
Integer 1,600,000 (UIMA internal (Annotation))
java.util.HashMap$Entry 2,400,000 (UIMA internal)
Object[] 600,000 (a guess- for the table part of the hash
table
100K entries (1 per annotation) * 4
bytes plus
some for the 75% load factor of this
table +
extra due to the table expanding by a
factor of 2
when growing
The new implementation makes these numbers look more like this:
Integer 0 (UIMA internal (Annotation)) - no longer
needed
java.util.HashMap$Entry 0 (UIMA internal) - no longer used
Object[] 1,000,000 (an approximation-for the table part of
the hash table
100K entries (1 per annotation) * 4
bytes plus
some for the 50% load factor of this
table
This reduces 4.6 MB down to 1 MB overhead for 100K annotations.
When I measured the int[] use for Java 6_01 from Sun, using Sun's
heapDump and "jhat" tool, I found the int[] size to be
8.1 MB (versus your measurement of 9.3 MB). I'm not sure where the
difference comes from.
If I add all these up, I get a "UIMA overhead" of 8.1 MB + 1MB =
9.1MB. Before the "fix" for the JCas instance hashmap,
this would be closer to your reported overhead: 8.1 MB + 4.6 MB = 12.7
MB. So your reporting triggered a significant
improvement in the implementation (will be in the next release) - thank you!
I hope this helps in having a better model of what to expect in terms of
space utilization.
-Marshall
Kirk True wrote:
Hi all,
I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.
It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?
Details:
In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.
I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:
For a 1,500,000 byte document:
Original document 1,500,000
HTML document 2,800,000
TestCaseAnnotation 1,600,000
Annotation strings 4,800,000
Annotation char[]s 2,400,000
Integer 1,600,000 (UIMA internal (Annotation))
int[] 9,300,000 (UIMA internal)
java.util.HashMap$Entry 2,400,000 (UIMA internal)
-----------------------------------
26,400,000
For a 15,000,000 byte document:
Original document 15,000,000
HTML document 28,000,000
TestCaseAnnotation 1,600,000
Annotation strings 4,800,000
Annotation char[]s 2,400,000
Integer 1,600,000 (UIMA internal (Annotation))
int[] 9,300,000 (UIMA internal)
java.util.HashMap$Entry 2,400,000 (UIMA internal)
-----------------------------------
65,100,000
I can post the code for the test cases if it helps.
Thanks,
Kirk