Hi Marshall,
> The indexes use int[] arrays.
>
> Kirk - what indexes do you have defined (if any)? Do you
> "addToIndexes..." any of
> the annotations you create?
Yes - I'm adding all annotations to the indexes.
If it helps, here's the source code for the annotator and the shim
application from which it is run:
http://www.mustardgrain.com/files/testcaseannotator.zip
Thanks for all the feedback!
Kirk
> -Marshall
>
> Adam Lally wrote:
> > On 5/18/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
> >> You can estimate data use on the heap as follows. Each FS uses at
>
> >> least one
> >> int for the type information, plus whatever features it has. So a
>
> >> vanilla
> >> annotation is 3 ints, one for the type, and one for the start and
> end
> >> features,
> >> respectively. If you have two additional features, that's 5 ints,
> so
> >> 20 bytes.
> >> If you use the JCas, you incur an additional overhead of a Java
> >> object for
> >> each annotation. It's small, but I can't say off the top of my
> head
> >> how small
> >> exactly. Plus, the JCas objects are held in a HashMap (or some
> such,
> >> Marshall
> >> correct me if I'm wrong), which incurs additional memory overhead.
> >>
> >> In my experience, the CAS can easily reach 10 to 20 times the size
> of
> >> the input
> >> document. If you have information reach token annotations, that's
>
> >> not really
> >> surprising. (And this is without using JCas). Imagine you were
> to
> >> manually
> >> create Java objects that carry the same information, you would see
>
> >> roughly
> >> the same kind of overhead.
> >>
> >
> > Using these numbers can we account for the 9,300,000 bytes of
> integer
> > arrays?
> >
> > 100,000 annotations of size 5 cells = 500,000 ints, which is
> exactly
> > the default heap size. But with the Sofa FS this will exceed the
> > default heap size. It will grow by another 500,000 (I think).
> >
> > So that accounts for 1,000,000 ints = 4,000,000 bytes.
> >
> > Where are the other 5,300,000?
> >
> >
> >
> > Likewise, what about the 1,600,000 bytes of Integers. The JCAS
> hash
> > map only accounts for one per annotation, which in this case should
> > only be 400,000 bytes.
> >
> > Maybe it would be useful to get Kirk's test case so we can take a
> look
> > at where exactly the memory is being used. I think it would need
> to
> > be attached to a JIRA issue with the grant license to Apache box
> > checked?
> >
> > -Adam
> >
> >
>
>