I have no basic objections to refactoring the CAS Impl - but just some concerns, which are the obvious ones, being things like space and time impacts, and reliability.
I was reminded of the importance of these just today when someone at lunch mentioned they were doing runs to collect statistical info for further processing, and one run was taking 11 hours. I think one of the points which has been an incentive for UIMA adoption has been the priority it has given to being both space and time efficient. The best outcome of your refactoring would be something that, in addition to making things conceptually clearer, sped it up and had a smaller footprint ;-) Cheers. -Marshall Thilo Goetz wrote: > I'm thinking about experimenting with alternative heap > implementations in the CAS. In particular, I would like > to try out a heap impl that uses regular Java objects to > represent feature structures, as opposed to our proprietary > binary heap. > > Our current heap design was created when object creation > in Java was very expensive. I ran experiments at the time > that showed that creating FSs the way we do today was about > twice as fast as creating Java objects. However, there > are many reasons to run this experiment again today: > > * Object creation in Java is a lot faster today. The speed > advantage may be very much reduced, or even gone > completely. > > * FS creation is not where a typical annotator spends its > time. Only for annotators that create a lot of annotations > with little computation effort (such as tokenizers) is this > at all significant. > > * Our current heap implementation pre-allocates a lot of > memory. This works relatively well for medium size CASes, > but it has disadvantages both for very small and very > large CASes. When using Java objects to represent FSs, > we leave the memory allocation to the JVM, which seems > like the right thing to do. > > * We have no garbage collection on the heap. FSs that are > once created stay there for the lifetime of the heap. > This is not a problem for most annotators, but there are > situations where this behavior is highly undesirable. > Using Java objects instead, we would benefit from the > garbage collector of the JVM. > > So here's the rub. Before I even start with this, I would > like to refactor the CAS implementation so I can see what > I'm doing. The CASImpl class has grown organically for many > years now, and it's due for a major overhaul. I will not > change any APIs, of course, but I'll probably leave not stone > unturned in the implementation. Any objections to that? > > Secondly, I will need help with the CAS serialization. The > current binary serialization depends completely on the > heap layout. Eddie, would you have time to work with me > on that? I would like to make the serialization independent > of the heap implementation and only rely on the low-level > CAS APIs. That might be a tiny bit slower (which is still > to be determined), but it will give us better encapsulation > and more flexibility with various heap implementations. > > Let me know what you think. > > --Thilo > > >
