Re: Migrating type system of form 6 compressed CAS binaries

Mario Juric Tue, 17 Sep 2019 00:38:07 -0700

Thank you very much for looking into this. It is really appreciated and I think 
it touches upon something important, which is about data migration in general.

I agree that some of these solutions can appear specific, awkward or complex 
and the way forward is not to address our use case alone. I think there is a 
need for a compact and efficient binary serialization format for the CAS when 
dealing with large amounts of data because this is directly visible in costs of 
processing and storing, and I found the compressed binary format to be much 
better than XMI in this regard, although I have to admit it’s been a while 
since I benchmarked this. Given that UIMA already has a well described type 
system then maybe it just lacks a way to describe schema evolution similar to 
Apache Avro or similar serialisation frameworks. I think a more formal approach 
to data migration would be critical to any larger operational setup.

Regarding XMI I like to provide some input to the problem we are observing, so 
that it can be solved. We are primarily using XMI for inspection/debugging 
purposes, and we are sometimes not able to do this because of this error. I 
will try to extract a minimum example to avoid involving parts that has to do 
with our pipeline and type system, and I think this would also be the best way 
to illustrate that the problem exists outside of this context. However, 
converting all our data to XMI first in order to do the conversion in our 
example would not be very practical for us, because it involves a large amount 
of data.

Cheers,
Mario

> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected]> wrote:
> 
> In this case, the original looks kind-of like this:
> 
> Container
>    features -> FSArray of FeatureAnnotation each of which
>                              has 5 slots: sofaRef, begin, end, name, value
> 
> the new TypeSystem has
> 
> Container
>    features -> FSArray of FeatureRecord each of which
>                               has 2 slots: name, value
> 
> The deserializer code would need some way to decide how to
>    1) create an FSArray of FeatureRecord,
>    2) for each element,
>       map the FeatureAnnotation to a new instance of FeatureRecord
> 
> I guess I could imagine a default mapping (for item 2 above) of
>   1) change the type from A to B
>   2) set equal-named features from A to B, drop other features
> 
> This mapping would need to apply to a subset of the A's and B's, namely, only
> those referenced by the FSArray where the element type changed.  Seems complex
> and specific to this use case though.
> 
> -Marshall
> 
> 
> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected]> wrote:
>>> I can reproduce the problem, and see what is happening.  The deserialization
>>> code compares the two type systems, and allows for some mismatches (things
>>> present in one and not in the other), but it doesn't allow for having a 
>>> feature
>>> whose range (value) is type XXXX in one type system and type YYYY in the 
>>> other. 
>>> See CasTypeSystemMapper lines 299 - 315.
>> Without reading the code in detail - could we not relax this check such that 
>> the element type of FSArrays is not checked and the code simply assumes that 
>> the source element type has the same features as the target element type 
>> (with the usual lenient handling of missing features in the target type)? - 
>> Kind of a "duck typing" approach?
>> 
>> Cheers,
>> 
>> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to