I'm thinking about experimenting with alternative heap implementations in the CAS. In particular, I would like to try out a heap impl that uses regular Java objects to represent feature structures, as opposed to our proprietary binary heap.
Our current heap design was created when object creation in Java was very expensive. I ran experiments at the time that showed that creating FSs the way we do today was about twice as fast as creating Java objects. However, there are many reasons to run this experiment again today: * Object creation in Java is a lot faster today. The speed advantage may be very much reduced, or even gone completely. * FS creation is not where a typical annotator spends its time. Only for annotators that create a lot of annotations with little computation effort (such as tokenizers) is this at all significant. * Our current heap implementation pre-allocates a lot of memory. This works relatively well for medium size CASes, but it has disadvantages both for very small and very large CASes. When using Java objects to represent FSs, we leave the memory allocation to the JVM, which seems like the right thing to do. * We have no garbage collection on the heap. FSs that are once created stay there for the lifetime of the heap. This is not a problem for most annotators, but there are situations where this behavior is highly undesirable. Using Java objects instead, we would benefit from the garbage collector of the JVM. So here's the rub. Before I even start with this, I would like to refactor the CAS implementation so I can see what I'm doing. The CASImpl class has grown organically for many years now, and it's due for a major overhaul. I will not change any APIs, of course, but I'll probably leave not stone unturned in the implementation. Any objections to that? Secondly, I will need help with the CAS serialization. The current binary serialization depends completely on the heap layout. Eddie, would you have time to work with me on that? I would like to make the serialization independent of the heap implementation and only rely on the low-level CAS APIs. That might be a tiny bit slower (which is still to be determined), but it will give us better encapsulation and more flexibility with various heap implementations. Let me know what you think. --Thilo
