> > > > > Doing this in the serialization code will not work. There is no way for > this > > to efficiently detect which existing FS have had feature values changed. > > More importantly, it eliminates the ability to track CAS changes for > > colocated annotators, something that has been repeated asked for to > improve > > debugging and to track provenance. > > Now wait a minute. The current heap implementation can't > do that either. All we were talking about was to know which > FSs were *added* since the CAS was serialized. That is > something you can do now by remembering the top heap position, > and I am planning to support this with the new heap impl as > well. Knowing what FSs were *modified* is an entirely different > proposition.
Right, recording the fact that old FS have been modified will require changes. The ability to recognize old FS quickly is key, thanks. I was mainly commenting that serialization was not a good place to do this stuff. > > >> Given no warning against doing this from an application, the fact that > it > >>> works and that it is fairly intuitive to do so means that there are > >> likely > >>> existing UIMA applications doing it. Of course we all are willing to > >> break > >>> existing user code when it gets in the way of some neat improvement :) > >> So you agree that maintaining this behavior is not a requirement? > > > > > > No, not without further discussion. > > Maybe we should call for a vote? Sure. What exactly are voting for, breaking this just for remote annotators, or for all annotators? > > >> Blob serialization, like the binary serialization used between C++ and > >> Java, > >>> leaves the Java Cas with a string heap rather than a string list. It > >> would > >>> be easy to change blob deserialization to recreate a string list > >> instead, > >>> and measure the performance difference. > >> I'll take your word for it, though I still don't see what this > >> has to do with what we were talking about. In the new heap I'm > >> thinking about, there will be no such thing as a String heap or > >> list. Strings will just be referenced directly from the objects > >> representing FSs. > >> > > > > It sounds like you have no concern for binary serialization performance. > > I don't know what makes you say that. That is not the > impression I wanted to give, at least ;-) I'll admit > it's not my primary concern. To repeat: I simply do not > understand what you mean to show by your string heap vs. > string list test. I'm not unwilling, just intellectually > incapable. My concern is that deserializing FS into a single int array is much faster than creating individual Java objects for each FS; same for strings, so doing a simple experiment with strings would be relevant. Maybe I am completely confused? > Changing the heap design to enable garbage collection at the expense of > > seriously degrading performance for existing users that are strongly > > dependent on efficient CAS serialization does not sound viable. > > I agree completely. If this turns out to seriously degrade > performance for *any* important scenario, it's out. However, > I'm not sure it will degrade performance, not even for binary > serialization. Otherwise I wouldn't be suggesting this. > Oh good, my worries are over :) Eddie
