Hi Thilo,

In addition to the impact on binary serialization performance, there will
also be XMI serialization issues. Heap location is currently used in CAS
merging for parallel processing steps, and will be used to implement a
delta-CAS transport model that only sends out to services the data they
require and only sends back new and modified data. The same design used for
delta CAS will allow us to give users details on CAS changes for every
processing step, when desired for debugging, with little or essentially no
extra overhead. These requirements may be easily handled in a new CAS
design, but we should take them into account in the redesign process, not
after an implementation.

On the other hand, all serialization issues can be ignored in order to do
some performance testing with a new design. Java object creation may be much
faster, but there may be other issues. For example, my understanding is that
there is a fairly significant memory overhead per Java object that may
increase overall CAS space requirements, at least in some circumstances.

I'd be happy to participate in discussions, but will be unable to contribute
to coding for at least a couple months.

Eddie

On 10/17/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
>
> I'm thinking about experimenting with alternative heap
> implementations in the CAS.  In particular, I would like
> to try out a heap impl that uses regular Java objects to
> represent feature structures, as opposed to our proprietary
> binary heap.
>
> Our current heap design was created when object creation
> in Java was very expensive.  I ran experiments at the time
> that showed that creating FSs the way we do today was about
> twice as fast as creating Java objects.  However, there
> are many reasons to run this experiment again today:
>
> * Object creation in Java is a lot faster today.  The speed
>    advantage may be very much reduced, or even gone
>    completely.
>
> * FS creation is not where a typical annotator spends its
>    time.  Only for annotators that create a lot of annotations
>    with little computation effort (such as tokenizers) is this
>    at all significant.
>
> * Our current heap implementation pre-allocates a lot of
>    memory.  This works relatively well for medium size CASes,
>    but it has disadvantages both for very small and very
>    large CASes.  When using Java objects to represent FSs,
>    we leave the memory allocation to the JVM, which seems
>    like the right thing to do.
>
> * We have no garbage collection on the heap.  FSs that are
>    once created stay there for the lifetime of the heap.
>    This is not a problem for most annotators, but there are
>    situations where this behavior is highly undesirable.
>    Using Java objects instead, we would benefit from the
>    garbage collector of the JVM.
>
> So here's the rub.  Before I even start with this, I would
> like to refactor the CAS implementation so I can see what
> I'm doing.  The CASImpl class has grown organically for many
> years now, and it's due for a major overhaul.  I will not
> change any APIs, of course, but I'll probably leave not stone
> unturned in the implementation.  Any objections to that?
>
> Secondly, I will need help with the CAS serialization.  The
> current binary serialization depends completely on the
> heap layout.  Eddie, would you have time to work with me
> on that?  I would like to make the serialization independent
> of the heap implementation and only rely on the low-level
> CAS APIs.  That might be a tiny bit slower (which is still
> to be determined), but it will give us better encapsulation
> and more flexibility with various heap implementations.
>
> Let me know what you think.
>
> --Thilo
>

Reply via email to