Thank you very much for looking into this. It is really appreciated and I think it touches upon something important, which is about data migration in general.
I agree that some of these solutions can appear specific, awkward or complex and the way forward is not to address our use case alone. I think there is a need for a compact and efficient binary serialization format for the CAS when dealing with large amounts of data because this is directly visible in costs of processing and storing, and I found the compressed binary format to be much better than XMI in this regard, although I have to admit it’s been a while since I benchmarked this. Given that UIMA already has a well described type system then maybe it just lacks a way to describe schema evolution similar to Apache Avro or similar serialisation frameworks. I think a more formal approach to data migration would be critical to any larger operational setup. Regarding XMI I like to provide some input to the problem we are observing, so that it can be solved. We are primarily using XMI for inspection/debugging purposes, and we are sometimes not able to do this because of this error. I will try to extract a minimum example to avoid involving parts that has to do with our pipeline and type system, and I think this would also be the best way to illustrate that the problem exists outside of this context. However, converting all our data to XMI first in order to do the conversion in our example would not be very practical for us, because it involves a large amount of data. Cheers, Mario > On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected]> wrote: > > In this case, the original looks kind-of like this: > > Container > features -> FSArray of FeatureAnnotation each of which > has 5 slots: sofaRef, begin, end, name, value > > the new TypeSystem has > > Container > features -> FSArray of FeatureRecord each of which > has 2 slots: name, value > > The deserializer code would need some way to decide how to > 1) create an FSArray of FeatureRecord, > 2) for each element, > map the FeatureAnnotation to a new instance of FeatureRecord > > I guess I could imagine a default mapping (for item 2 above) of > 1) change the type from A to B > 2) set equal-named features from A to B, drop other features > > This mapping would need to apply to a subset of the A's and B's, namely, only > those referenced by the FSArray where the element type changed. Seems complex > and specific to this use case though. > > -Marshall > > > On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected]> wrote: >>> I can reproduce the problem, and see what is happening. The deserialization >>> code compares the two type systems, and allows for some mismatches (things >>> present in one and not in the other), but it doesn't allow for having a >>> feature >>> whose range (value) is type XXXX in one type system and type YYYY in the >>> other. >>> See CasTypeSystemMapper lines 299 - 315. >> Without reading the code in detail - could we not relax this check such that >> the element type of FSArrays is not checked and the code simply assumes that >> the source element type has the same features as the target element type >> (with the usual lenient handling of missing features in the target type)? - >> Kind of a "duck typing" approach? >> >> Cheers, >> >> -- Richard
