Hi all,
I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.
It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?
Details:
In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.
I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:
For a 1,500,000 byte document:
Original document 1,500,000
HTML document 2,800,000
TestCaseAnnotation 1,600,000
Annotation strings 4,800,000
Annotation char[]s 2,400,000
Integer 1,600,000 (UIMA internal (Annotation))
int[] 9,300,000 (UIMA internal)
java.util.HashMap$Entry 2,400,000 (UIMA internal)
-----------------------------------
26,400,000
For a 15,000,000 byte document:
Original document 15,000,000
HTML document 28,000,000
TestCaseAnnotation 1,600,000
Annotation strings 4,800,000
Annotation char[]s 2,400,000
Integer 1,600,000 (UIMA internal (Annotation))
int[] 9,300,000 (UIMA internal)
java.util.HashMap$Entry 2,400,000 (UIMA internal)
-----------------------------------
65,100,000
I can post the code for the test cases if it helps.
Thanks,
Kirk