Re: Migrating type system of form 6 compressed CAS binaries

Mario Juric Tue, 24 Sep 2019 03:47:47 -0700

Hi Marshall,

I added the following feature request to Apache Jira:


https://issues.apache.org/jira/browse/UIMA-6128

Hope it makes sense :)

Thanks a lot for the help, it’s appreciated.

Cheers,
Mario













> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote:
> 
> Re: serializing using XML 1.1
> 
> This was not thought of, when setting up the CasIOUtils.
> 
> The way it was done (above) was using some more "primitive/lower level" APIs,
> rather than the CasIOUtils.
> 
> Please open a Jira ticket for this, with perhaps some suggestions on how it
> might be specified in the CasIOUtils APIs.
> 
> Thanks! -Marshall
> 
> On 9/23/2019 3:45 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> Thanks for the thorough and excellent investigation.
>> 
>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>> characters, but I don’t think we can necessarily do the same for some of the 
>> other characters. It sounds to me though that serialising to XML 1.1 could 
>> also be a simple fix right now, but can this be configured? CasIOUtils 
>> doesn’t seem to have an option for this, so I assume it’s something you have 
>> working in your branch.
>> 
>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
>> and after. Do you think switching to a more recent Java version would make a 
>> difference? I think we can also try this out ourselves when we look into 
>> migrating to UIMA 3 once our current deliveries are complete. We also like 
>> to switch to Java 11, and like UIMA 3 migration it will require some 
>> thorough testing.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>> 
>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>> char, which is the \u0002.
>>> 
>>> This is in part because the xml version being used is xml 1.0.
>>> 
>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>> 
>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>>> 1.1:
>>> 
>>>        XmiCasSerializer xmiCasSerializer = new
>>> XmiCasSerializer(jCas.getTypeSystem());
>>>        OutputStream out = new FileOutputStream(new File 
>>> ("odd-doc-txt-v11.xmi"));
>>>        try {
>>>          XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>          xmiCasSerializer.serialize(jCas.getCas(),
>>> xml11Serializer.getContentHandler());
>>>        }
>>>        finally {
>>>          out.close();
>>>        }
>>> 
>>> This succeeds and serializes this using xml 1.1.
>>> 
>>> I also tried serializing some doc text which includes \u77987.  That did not
>>> serialize correctly.
>>> I could see it in the code while tracing up to some point down in the 
>>> innards of
>>> some internal
>>> sax java code
>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
>>> it was
>>> "Correct" in the Java string.
>>> 
>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>> 
>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>>> encoding:
>>>        1110 xxxx 10xx xxxx 10xx xxxx
>>> 
>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>> 
>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>> java code.
>>> 
>>> I looked for a bug report and found some
>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>> 
>>> Bottom line, is, I think to clean out these characters early :-) .
>>> 
>>> -Marshall
>>> 
>>> 
>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>> here's an idea.
>>>> 
>>>> If you have a string, with the surrogate pair &#77987 at position 10, and 
>>>> you
>>>> have some Java code, which is iterating through the string and getting the
>>>> code-point at each character offset, then that code will produce:
>>>> 
>>>> at position 10:  the code-point 77987
>>>> at position 11:  the code-point 56483
>>>> 
>>>> Of course, it's a "bug" to iterate through a string of characters, 
>>>> assuming you
>>>> have characters at each point, if you don't handle surrogate pairs.
>>>> 
>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>>> https://tools.ietf.org/html/rfc2781 )
>>>> 
>>>> I worry that even tools like the CVD or similar may not work properly, 
>>>> since
>>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>>> if
>>>> they would work well enough for you.
>>>> 
>>>> I'll poke around some more to see if I can enable the conversion for 
>>>> document
>>>> strings.
>>>> 
>>>> -Marshall
>>>> 
>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>> Thanks Marshall,
>>>>> 
>>>>> Encoding the characters like you suggest should work just fine for us as 
>>>>> long as we can serialize and deserialise the XMI, so that we can open the 
>>>>> content in a tool like the CVD or similar. These characters are just 
>>>>> noise from the original content that happen to remain in the CAS, but 
>>>>> they are not visible in our final output because they are basically 
>>>>> filtered out one way or the other by downstream components. They become a 
>>>>> problem though when they make it more difficult for us to inspect the 
>>>>> content.
>>>>> 
>>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>>> getting a different XMI output for the same character in our actual 
>>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the 
>>>>> value in the debugger again, and like you are illustrating it is also 
>>>>> just a single codepoint with the value 77987. We are simply not able to 
>>>>> load this XMI because of this, but unfortunately I couldn’t reproduce it 
>>>>> in my small example.
>>>>> 
>>>>> Cheers,
>>>>> Mario
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>> 
>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, 
>>>>>> due to
>>>>>> that unicode character.
>>>>>> 
>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>> 
>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>>> xmi:id="18"
>>>>>> name="&#77987;" value="1.0"/>
>>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>>> character
>>>>>> (taking 2 Java characters to represent),
>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>> 
>>>>>> When read in, the name field is assigned to a String, that string says 
>>>>>> it has a
>>>>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>>>>> char).
>>>>>> If you have the name string in a variable "n", and do
>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>> 
>>>>>> So, the string value serialization and deserialization seems to be 
>>>>>> "working".
>>>>>> 
>>>>>> The other code - for the sofa (document) serialization, is throwing that 
>>>>>> error,
>>>>>> because as currently designed, the
>>>>>> serialization code checks for these kinds of characters, and if found 
>>>>>> throws
>>>>>> that exception.  The code checking is
>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>> 
>>>>>> This is because it's highly likely that "fixing this" in the same way as 
>>>>>> the
>>>>>> other, would result in hard-to-diagnose
>>>>>> future errors, because the subject of analysis string is processed with 
>>>>>> begin /
>>>>>> end offset all over the place, and makes
>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>> 
>>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>>> &#77987; 
>>>>>> 
>>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>>> might
>>>>>> break (due to begin/end offsets no longer
>>>>>> being on character boundaries, for example).
>>>>>> 
>>>>>> -Marshall
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>> 
>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>> serialisation, and there seems to be two distinct categories of issues 
>>>>>>> we have
>>>>>>> identified so far.
>>>>>>> 
>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>> 2) Annotations with String features have values containing special 
>>>>>>> unicode
>>>>>>> characters
>>>>>>> 
>>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>>> clean up
>>>>>>> job upstream, but with the amount and variety of data we receive there 
>>>>>>> is
>>>>>>> always a chance something passes through, and some of it may in the 
>>>>>>> general
>>>>>>> case even be valid content.
>>>>>>> 
>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>> attached. In this example the text is a snippet taken from the content 
>>>>>>> of a
>>>>>>> parsed XML document.
>>>>>>> 
>>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>>> example,
>>>>>>> because I am getting slightly different output to what I have in our 
>>>>>>> real
>>>>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>>>>> shared
>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>> characters that I found in a similar data structure in our actual CAS. 
>>>>>>> The
>>>>>>> value comes from an external knowledge base holding some noisy strings, 
>>>>>>> which
>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to 
>>>>>>> XMI
>>>>>>> using the small example it only outputs the first of the two characters 
>>>>>>> in
>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. 
>>>>>>> This
>>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>>> again, but
>>>>>>> it will not work in the case where both characters are written the way 
>>>>>>> we
>>>>>>> experience it. The XMI can be manually changed, so that both character 
>>>>>>> values
>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>> SAXParserException happens.
>>>>>>> 
>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>>>> handle
>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>> <mailto:[email protected]>>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Thank you very much for looking into this. It is really appreciated 
>>>>>>>> and I
>>>>>>>> think it touches upon something important, which is about data 
>>>>>>>> migration in
>>>>>>>> general.
>>>>>>>> 
>>>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>>>> complex
>>>>>>>> and the way forward is not to address our use case alone. I think 
>>>>>>>> there is a
>>>>>>>> need for a compact and efficient binary serialization format for the 
>>>>>>>> CAS when
>>>>>>>> dealing with large amounts of data because this is directly visible in 
>>>>>>>> costs
>>>>>>>> of processing and storing, and I found the compressed binary format to 
>>>>>>>> be
>>>>>>>> much better than XMI in this regard, although I have to admit it’s 
>>>>>>>> been a
>>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>>> described
>>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a 
>>>>>>>> more
>>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>>> operational
>>>>>>>> setup.
>>>>>>>> 
>>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>>> observing,
>>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>>> inspection/debugging
>>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>>> error. I
>>>>>>>> will try to extract a minimum example to avoid involving parts that 
>>>>>>>> has to do
>>>>>>>> with our pipeline and type system, and I think this would also be the 
>>>>>>>> best
>>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>>> However,
>>>>>>>> converting all our data to XMI first in order to do the conversion in 
>>>>>>>> our
>>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>>> amount of data.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>>> <mailto:[email protected]>
>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>> 
>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>> 
>>>>>>>>> Container
>>>>>>>>>  features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>                            has 5 slots: sofaRef, begin, end, name, 
>>>>>>>>> value
>>>>>>>>> 
>>>>>>>>> the new TypeSystem has
>>>>>>>>> 
>>>>>>>>> Container
>>>>>>>>>  features -> FSArray of FeatureRecord each of which
>>>>>>>>>                             has 2 slots: name, value
>>>>>>>>> 
>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>  1) create an FSArray of FeatureRecord,
>>>>>>>>>  2) for each element,
>>>>>>>>>     map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>> 
>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>> 1) change the type from A to B
>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>> 
>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>>> namely, only
>>>>>>>>> those referenced by the FSArray where the element type changed.  
>>>>>>>>> Seems complex
>>>>>>>>> and specific to this use case though.
>>>>>>>>> 
>>>>>>>>> -Marshall
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>>> deserialization
>>>>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>>>>> (things
>>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>>> having a
>>>>>>>>>>> feature
>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY 
>>>>>>>>>>> in the
>>>>>>>>>>> other.
>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>> Without reading the code in detail - could we not relax this check 
>>>>>>>>>> such
>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>>> target
>>>>>>>>>> element type (with the usual lenient handling of missing features in 
>>>>>>>>>> the
>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> -- Richard
>>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to