Re: Migrating type system of form 6 compressed CAS binaries

Marshall Schor Tue, 24 Sep 2019 12:33:58 -0700

yes, makes sense, thanks for posting the Jira.

If no one else steps up to work on this, I'll probably take a look in a few
days. -Marshall


On 9/24/2019 6:47 AM, Mario Juric wrote:
> Hi Marshall,
>
> I added the following feature request to Apache Jira:
>
> https://issues.apache.org/jira/browse/UIMA-6128
>
> Hope it makes sense :)
>
> Thanks a lot for the help, it’s appreciated.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote:
>>
>> Re: serializing using XML 1.1
>>
>> This was not thought of, when setting up the CasIOUtils.
>>
>> The way it was done (above) was using some more "primitive/lower level" APIs,
>> rather than the CasIOUtils.
>>
>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>> might be specified in the CasIOUtils APIs.
>>
>> Thanks! -Marshall
>>
>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> Thanks for the thorough and excellent investigation.
>>>
>>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>>> characters, but I don’t think we can necessarily do the same for some of 
>>> the other characters. It sounds to me though that serialising to XML 1.1 
>>> could also be a simple fix right now, but can this be configured? 
>>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
>>> something you have working in your branch.
>>>
>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
>>> and after. Do you think switching to a more recent Java version would make 
>>> a difference? I think we can also try this out ourselves when we look into 
>>> migrating to UIMA 3 once our current deliveries are complete. We also like 
>>> to switch to Java 11, and like UIMA 3 migration it will require some 
>>> thorough testing.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>>>
>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>>>> xml
>>>> char, which is the \u0002.
>>>>
>>>> This is in part because the xml version being used is xml 1.0.
>>>>
>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>
>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>> xml 1.1:
>>>>
>>>>        XmiCasSerializer xmiCasSerializer = new
>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>        OutputStream out = new FileOutputStream(new File 
>>>> ("odd-doc-txt-v11.xmi"));
>>>>        try {
>>>>          XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>          xmiCasSerializer.serialize(jCas.getCas(),
>>>> xml11Serializer.getContentHandler());
>>>>        }
>>>>        finally {
>>>>          out.close();
>>>>        }
>>>>
>>>> This succeeds and serializes this using xml 1.1.
>>>>
>>>> I also tried serializing some doc text which includes \u77987.  That did 
>>>> not
>>>> serialize correctly.
>>>> I could see it in the code while tracing up to some point down in the 
>>>> innards of
>>>> some internal
>>>> sax java code
>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
>>>> it was
>>>> "Correct" in the Java string.
>>>>
>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>
>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>>>> encoding:
>>>>        1110 xxxx 10xx xxxx 10xx xxxx
>>>>
>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>
>>>> But I think it's out of our hands - it's somewhere deep in the sax 
>>>> transform
>>>> java code.
>>>>
>>>> I looked for a bug report and found some
>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>
>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>
>>>> -Marshall
>>>>
>>>>
>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>> here's an idea.
>>>>>
>>>>> If you have a string, with the surrogate pair &#77987 at position 10, and 
>>>>> you
>>>>> have some Java code, which is iterating through the string and getting the
>>>>> code-point at each character offset, then that code will produce:
>>>>>
>>>>> at position 10:  the code-point 77987
>>>>> at position 11:  the code-point 56483
>>>>>
>>>>> Of course, it's a "bug" to iterate through a string of characters, 
>>>>> assuming you
>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>
>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 
>>>>> (see
>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>
>>>>> I worry that even tools like the CVD or similar may not work properly, 
>>>>> since
>>>>> they're not designed to handle surrogate pairs, I think, so I have no 
>>>>> idea if
>>>>> they would work well enough for you.
>>>>>
>>>>> I'll poke around some more to see if I can enable the conversion for 
>>>>> document
>>>>> strings.
>>>>>
>>>>> -Marshall
>>>>>
>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>> Thanks Marshall,
>>>>>>
>>>>>> Encoding the characters like you suggest should work just fine for us as 
>>>>>> long as we can serialize and deserialise the XMI, so that we can open 
>>>>>> the content in a tool like the CVD or similar. These characters are just 
>>>>>> noise from the original content that happen to remain in the CAS, but 
>>>>>> they are not visible in our final output because they are basically 
>>>>>> filtered out one way or the other by downstream components. They become 
>>>>>> a problem though when they make it more difficult for us to inspect the 
>>>>>> content.
>>>>>>
>>>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>>>> getting a different XMI output for the same character in our actual 
>>>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the 
>>>>>> value in the debugger again, and like you are illustrating it is also 
>>>>>> just a single codepoint with the value 77987. We are simply not able to 
>>>>>> load this XMI because of this, but unfortunately I couldn’t reproduce it 
>>>>>> in my small example.
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>>>
>>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, 
>>>>>>> due to
>>>>>>> that unicode character.
>>>>>>>
>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>
>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>>>> xmi:id="18"
>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>>>> character
>>>>>>> (taking 2 Java characters to represent),
>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>
>>>>>>> When read in, the name field is assigned to a String, that string says 
>>>>>>> it has a
>>>>>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>>>>>> char).
>>>>>>> If you have the name string in a variable "n", and do
>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>
>>>>>>> So, the string value serialization and deserialization seems to be 
>>>>>>> "working".
>>>>>>>
>>>>>>> The other code - for the sofa (document) serialization, is throwing 
>>>>>>> that error,
>>>>>>> because as currently designed, the
>>>>>>> serialization code checks for these kinds of characters, and if found 
>>>>>>> throws
>>>>>>> that exception.  The code checking is
>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>
>>>>>>> This is because it's highly likely that "fixing this" in the same way 
>>>>>>> as the
>>>>>>> other, would result in hard-to-diagnose
>>>>>>> future errors, because the subject of analysis string is processed with 
>>>>>>> begin /
>>>>>>> end offset all over the place, and makes
>>>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>>>
>>>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>>>> &#77987; 
>>>>>>>
>>>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>>>> might
>>>>>>> break (due to begin/end offsets no longer
>>>>>>> being on character boundaries, for example).
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>
>>>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>>>> serialisation, and there seems to be two distinct categories of issues 
>>>>>>>> we have
>>>>>>>> identified so far.
>>>>>>>>
>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>> 2) Annotations with String features have values containing special 
>>>>>>>> unicode
>>>>>>>> characters
>>>>>>>>
>>>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>>>> clean up
>>>>>>>> job upstream, but with the amount and variety of data we receive there 
>>>>>>>> is
>>>>>>>> always a chance something passes through, and some of it may in the 
>>>>>>>> general
>>>>>>>> case even be valid content.
>>>>>>>>
>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>> attached. In this example the text is a snippet taken from the content 
>>>>>>>> of a
>>>>>>>> parsed XML document.
>>>>>>>>
>>>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>>>> example,
>>>>>>>> because I am getting slightly different output to what I have in our 
>>>>>>>> real
>>>>>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>>>>>> shared
>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>> characters that I found in a similar data structure in our actual CAS. 
>>>>>>>> The
>>>>>>>> value comes from an external knowledge base holding some noisy 
>>>>>>>> strings, which
>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to 
>>>>>>>> XMI
>>>>>>>> using the small example it only outputs the first of the two 
>>>>>>>> characters in
>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in 
>>>>>>>> our
>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. 
>>>>>>>> This
>>>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>>>> again, but
>>>>>>>> it will not work in the case where both characters are written the way 
>>>>>>>> we
>>>>>>>> experience it. The XMI can be manually changed, so that both character 
>>>>>>>> values
>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>> SAXParserException happens.
>>>>>>>>
>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>>>>> handle
>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you very much for looking into this. It is really appreciated 
>>>>>>>>> and I
>>>>>>>>> think it touches upon something important, which is about data 
>>>>>>>>> migration in
>>>>>>>>> general.
>>>>>>>>>
>>>>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>>>>> complex
>>>>>>>>> and the way forward is not to address our use case alone. I think 
>>>>>>>>> there is a
>>>>>>>>> need for a compact and efficient binary serialization format for the 
>>>>>>>>> CAS when
>>>>>>>>> dealing with large amounts of data because this is directly visible 
>>>>>>>>> in costs
>>>>>>>>> of processing and storing, and I found the compressed binary format 
>>>>>>>>> to be
>>>>>>>>> much better than XMI in this regard, although I have to admit it’s 
>>>>>>>>> been a
>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>>>> described
>>>>>>>>> type system then maybe it just lacks a way to describe schema 
>>>>>>>>> evolution
>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a 
>>>>>>>>> more
>>>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>>>> operational
>>>>>>>>> setup.
>>>>>>>>>
>>>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>>>> observing,
>>>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>>>> inspection/debugging
>>>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>>>> error. I
>>>>>>>>> will try to extract a minimum example to avoid involving parts that 
>>>>>>>>> has to do
>>>>>>>>> with our pipeline and type system, and I think this would also be the 
>>>>>>>>> best
>>>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>>>> However,
>>>>>>>>> converting all our data to XMI first in order to do the conversion in 
>>>>>>>>> our
>>>>>>>>> example would not be very practical for us, because it involves a 
>>>>>>>>> large
>>>>>>>>> amount of data.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>
>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>
>>>>>>>>>> Container
>>>>>>>>>>  features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>                            has 5 slots: sofaRef, begin, end, name, 
>>>>>>>>>> value
>>>>>>>>>>
>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>
>>>>>>>>>> Container
>>>>>>>>>>  features -> FSArray of FeatureRecord each of which
>>>>>>>>>>                             has 2 slots: name, value
>>>>>>>>>>
>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>  1) create an FSArray of FeatureRecord,
>>>>>>>>>>  2) for each element,
>>>>>>>>>>     map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>
>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>
>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>>>> namely, only
>>>>>>>>>> those referenced by the FSArray where the element type changed.  
>>>>>>>>>> Seems complex
>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>
>>>>>>>>>> -Marshall
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>>>> deserialization
>>>>>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>>>>>> (things
>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>>>> having a
>>>>>>>>>>>> feature
>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY 
>>>>>>>>>>>> in the
>>>>>>>>>>>> other.
>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>> Without reading the code in detail - could we not relax this check 
>>>>>>>>>>> such
>>>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>>>> target
>>>>>>>>>>> element type (with the usual lenient handling of missing features 
>>>>>>>>>>> in the
>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to