I'm thinking about experimenting with alternative heap
implementations in the CAS.  In particular, I would like
to try out a heap impl that uses regular Java objects to
represent feature structures, as opposed to our proprietary
binary heap.

Our current heap design was created when object creation
in Java was very expensive.  I ran experiments at the time
that showed that creating FSs the way we do today was about
twice as fast as creating Java objects.  However, there
are many reasons to run this experiment again today:

 * Object creation in Java is a lot faster today.  The speed
   advantage may be very much reduced, or even gone
   completely.

 * FS creation is not where a typical annotator spends its
   time.  Only for annotators that create a lot of annotations
   with little computation effort (such as tokenizers) is this
   at all significant.

 * Our current heap implementation pre-allocates a lot of
   memory.  This works relatively well for medium size CASes,
   but it has disadvantages both for very small and very
   large CASes.  When using Java objects to represent FSs,
   we leave the memory allocation to the JVM, which seems
   like the right thing to do.

 * We have no garbage collection on the heap.  FSs that are
   once created stay there for the lifetime of the heap.
   This is not a problem for most annotators, but there are
   situations where this behavior is highly undesirable.
   Using Java objects instead, we would benefit from the
   garbage collector of the JVM.

So here's the rub.  Before I even start with this, I would
like to refactor the CAS implementation so I can see what
I'm doing.  The CASImpl class has grown organically for many
years now, and it's due for a major overhaul.  I will not
change any APIs, of course, but I'll probably leave not stone
unturned in the implementation.  Any objections to that?

Secondly, I will need help with the CAS serialization.  The
current binary serialization depends completely on the
heap layout.  Eddie, would you have time to work with me
on that?  I would like to make the serialization independent
of the heap implementation and only rely on the low-level
CAS APIs.  That might be a tiny bit slower (which is still
to be determined), but it will give us better encapsulation
and more flexibility with various heap implementations.

Let me know what you think.

--Thilo

Reply via email to