The indexes use int[] arrays. Kirk - what indexes do you have defined (if any)? Do you "addToIndexes..." any of
the annotations you create?

-Marshall

Adam Lally wrote:
On 5/18/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
You can estimate data use on the heap as follows. Each FS uses at least one int for the type information, plus whatever features it has. So a vanilla annotation is 3 ints, one for the type, and one for the start and end features, respectively. If you have two additional features, that's 5 ints, so 20 bytes. If you use the JCas, you incur an additional overhead of a Java object for each annotation. It's small, but I can't say off the top of my head how small exactly. Plus, the JCas objects are held in a HashMap (or some such, Marshall
correct me if I'm wrong), which incurs additional memory overhead.

In my experience, the CAS can easily reach 10 to 20 times the size of the input document. If you have information reach token annotations, that's not really surprising. (And this is without using JCas). Imagine you were to manually create Java objects that carry the same information, you would see roughly
the same kind of overhead.


Using these numbers can we account for the 9,300,000 bytes of integer arrays?

100,000 annotations of size 5 cells = 500,000 ints, which is exactly
the default heap size.  But with the Sofa FS this will exceed the
default heap size.  It will grow by another 500,000 (I think).

So that accounts for 1,000,000 ints = 4,000,000 bytes.

Where are the other 5,300,000?



Likewise, what about the 1,600,000 bytes of Integers.  The JCAS hash
map only accounts for one per annotation, which in this case should
only be 400,000 bytes.

Maybe it would be useful to get Kirk's test case so we can take a look
at where exactly the memory is being used.  I think it would need to
be attached to a JIRA issue with the grant license to Apache box
checked?

-Adam



Reply via email to