On 5/18/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
You can estimate data use on the heap as follows. Each FS uses at least one
int for the type information, plus whatever features it has. So a vanilla
annotation is 3 ints, one for the type, and one for the start and end features,
respectively. If you have two additional features, that's 5 ints, so 20 bytes.
If you use the JCas, you incur an additional overhead of a Java object for
each annotation. It's small, but I can't say off the top of my head how small
exactly. Plus, the JCas objects are held in a HashMap (or some such, Marshall
correct me if I'm wrong), which incurs additional memory overhead.
In my experience, the CAS can easily reach 10 to 20 times the size of the input
document. If you have information reach token annotations, that's not really
surprising. (And this is without using JCas). Imagine you were to manually
create Java objects that carry the same information, you would see roughly
the same kind of overhead.
Using these numbers can we account for the 9,300,000 bytes of integer arrays?
100,000 annotations of size 5 cells = 500,000 ints, which is exactly
the default heap size. But with the Sofa FS this will exceed the
default heap size. It will grow by another 500,000 (I think).
So that accounts for 1,000,000 ints = 4,000,000 bytes.
Where are the other 5,300,000?
Likewise, what about the 1,600,000 bytes of Integers. The JCAS hash
map only accounts for one per annotation, which in this case should
only be 400,000 bytes.
Maybe it would be useful to get Kirk's test case so we can take a look
at where exactly the memory is being used. I think it would need to
be attached to a JIRA issue with the grant license to Apache box
checked?
-Adam