Hi Marshall,

> The indexes use int[] arrays. 
> 
> Kirk - what indexes do you have defined (if any)?  Do you 
> "addToIndexes..." any of
> the annotations you create?

Yes - I'm adding all annotations to the indexes.

If it helps, here's the source code for the annotator and the shim
application from which it is run:

    http://www.mustardgrain.com/files/testcaseannotator.zip

Thanks for all the feedback!

Kirk

> -Marshall
> 
> Adam Lally wrote:
> > On 5/18/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
> >> You can estimate data use on the heap as follows.  Each FS uses at
> 
> >> least one
> >> int for the type information, plus whatever features it has.  So a
> 
> >> vanilla
> >> annotation is 3 ints, one for the type, and one for the start and
> end 
> >> features,
> >> respectively.  If you have two additional features, that's 5 ints,
> so 
> >> 20 bytes.
> >> If you use the JCas, you incur an additional overhead of a Java 
> >> object for
> >> each annotation.  It's small, but I can't say off the top of my
> head 
> >> how small
> >> exactly.  Plus, the JCas objects are held in a HashMap (or some
> such, 
> >> Marshall
> >> correct me if I'm wrong), which incurs additional memory overhead.
> >>
> >> In my experience, the CAS can easily reach 10 to 20 times the size
> of 
> >> the input
> >> document.  If you have information reach token annotations, that's
> 
> >> not really
> >> surprising.  (And this is without using JCas).  Imagine you were
> to 
> >> manually
> >> create Java objects that carry the same information, you would see
> 
> >> roughly
> >> the same kind of overhead.
> >>
> >
> > Using these numbers can we account for the 9,300,000 bytes of
> integer 
> > arrays?
> >
> > 100,000 annotations of size 5 cells = 500,000 ints, which is
> exactly
> > the default heap size.  But with the Sofa FS this will exceed the
> > default heap size.  It will grow by another 500,000 (I think).
> >
> > So that accounts for 1,000,000 ints = 4,000,000 bytes.
> >
> > Where are the other 5,300,000?
> >
> >
> >
> > Likewise, what about the 1,600,000 bytes of Integers.  The JCAS
> hash
> > map only accounts for one per annotation, which in this case should
> > only be 400,000 bytes.
> >
> > Maybe it would be useful to get Kirk's test case so we can take a
> look
> > at where exactly the memory is being used.  I think it would need
> to
> > be attached to a JIRA issue with the grant license to Apache box
> > checked?
> >
> > -Adam
> >
> >
> 
> 

Reply via email to