Eddie Epstein wrote: >>> Copying the behavior would be appropriate, unless there is some other >> way to >>> easily distinguish pre-existing FS. >> To my mind, the place to keep track of something like that >> is the serialization code. It has to iterate over the whole >> CAS anyway and can do that kind of tracking. It seems wrong >> to put that kind of requirement on the heap implementation.s. > > > Doing this in the serialization code will not work. There is no way for this > to efficiently detect which existing FS have had feature values changed. > More importantly, it eliminates the ability to track CAS changes for > colocated annotators, something that has been repeated asked for to improve > debugging and to track provenance.
Now wait a minute. The current heap implementation can't do that either. All we were talking about was to know which FSs were *added* since the CAS was serialized. That is something you can do now by remembering the top heap position, and I am planning to support this with the new heap impl as well. Knowing what FSs were *modified* is an entirely different proposition. > >> Given no warning against doing this from an application, the fact that it >>> works and that it is fairly intuitive to do so means that there are >> likely >>> existing UIMA applications doing it. Of course we all are willing to >> break >>> existing user code when it gets in the way of some neat improvement :) >> So you agree that maintaining this behavior is not a requirement? > > > No, not without further discussion. Maybe we should call for a vote? > >> Blob serialization, like the binary serialization used between C++ and >> Java, >>> leaves the Java Cas with a string heap rather than a string list. It >> would >>> be easy to change blob deserialization to recreate a string list >> instead, >>> and measure the performance difference. >> I'll take your word for it, though I still don't see what this >> has to do with what we were talking about. In the new heap I'm >> thinking about, there will be no such thing as a String heap or >> list. Strings will just be referenced directly from the objects >> representing FSs. >> > > It sounds like you have no concern for binary serialization performance. I don't know what makes you say that. That is not the impression I wanted to give, at least ;-) I'll admit it's not my primary concern. To repeat: I simply do not understand what you mean to show by your string heap vs. string list test. I'm not unwilling, just intellectually incapable. > Changing the heap design to enable garbage collection at the expense of > seriously degrading performance for existing users that are strongly > dependent on efficient CAS serialization does not sound viable. I agree completely. If this turns out to seriously degrade performance for *any* important scenario, it's out. However, I'm not sure it will degrade performance, not even for binary serialization. Otherwise I wouldn't be suggesting this. --Thilo > > How about re-implementing the heap as a pluggable component so that the > existing design would still be available? > > Eddie >
