Hi Marshall, I added the following feature request to Apache Jira:
https://issues.apache.org/jira/browse/UIMA-6128 Hope it makes sense :) Thanks a lot for the help, it’s appreciated. Cheers, Mario > On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote: > > Re: serializing using XML 1.1 > > This was not thought of, when setting up the CasIOUtils. > > The way it was done (above) was using some more "primitive/lower level" APIs, > rather than the CasIOUtils. > > Please open a Jira ticket for this, with perhaps some suggestions on how it > might be specified in the CasIOUtils APIs. > > Thanks! -Marshall > > On 9/23/2019 3:45 AM, Mario Juric wrote: >> Hi Marshall, >> >> Thanks for the thorough and excellent investigation. >> >> We are looking into possible normalisation/cleanup of whitespace/invisible >> characters, but I don’t think we can necessarily do the same for some of the >> other characters. It sounds to me though that serialising to XML 1.1 could >> also be a simple fix right now, but can this be configured? CasIOUtils >> doesn’t seem to have an option for this, so I assume it’s something you have >> working in your branch. >> >> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 >> and after. Do you think switching to a more recent Java version would make a >> difference? I think we can also try this out ourselves when we look into >> migrating to UIMA 3 once our current deliveries are complete. We also like >> to switch to Java 11, and like UIMA 3 migration it will require some >> thorough testing. >> >> Cheers, >> Mario >> >> >> >> >> >> >> >> >> >> >> >> >> >>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote: >>> >>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml >>> char, which is the \u0002. >>> >>> This is in part because the xml version being used is xml 1.0. >>> >>> XML 1.1 expanded the set of valid characters to include \u0002. >>> >>> Here's a snip from the XmiCasSerializerTest class which serializes with xml >>> 1.1: >>> >>> XmiCasSerializer xmiCasSerializer = new >>> XmiCasSerializer(jCas.getTypeSystem()); >>> OutputStream out = new FileOutputStream(new File >>> ("odd-doc-txt-v11.xmi")); >>> try { >>> XMLSerializer xml11Serializer = new XMLSerializer(out); >>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); >>> xmiCasSerializer.serialize(jCas.getCas(), >>> xml11Serializer.getContentHandler()); >>> } >>> finally { >>> out.close(); >>> } >>> >>> This succeeds and serializes this using xml 1.1. >>> >>> I also tried serializing some doc text which includes \u77987. That did not >>> serialize correctly. >>> I could see it in the code while tracing up to some point down in the >>> innards of >>> some internal >>> sax java code >>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize where >>> it was >>> "Correct" in the Java string. >>> >>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. >>> >>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte >>> encoding: >>> 1110 xxxx 10xx xxxx 10xx xxxx >>> >>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me. >>> >>> But I think it's out of our hands - it's somewhere deep in the sax transform >>> java code. >>> >>> I looked for a bug report and found some >>> https://bugs.openjdk.java.net/browse/JDK-8058175 >>> >>> Bottom line, is, I think to clean out these characters early :-) . >>> >>> -Marshall >>> >>> >>> On 9/20/2019 1:28 PM, Marshall Schor wrote: >>>> here's an idea. >>>> >>>> If you have a string, with the surrogate pair 𓂣 at position 10, and >>>> you >>>> have some Java code, which is iterating through the string and getting the >>>> code-point at each character offset, then that code will produce: >>>> >>>> at position 10: the code-point 77987 >>>> at position 11: the code-point 56483 >>>> >>>> Of course, it's a "bug" to iterate through a string of characters, >>>> assuming you >>>> have characters at each point, if you don't handle surrogate pairs. >>>> >>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see >>>> https://tools.ietf.org/html/rfc2781 ) >>>> >>>> I worry that even tools like the CVD or similar may not work properly, >>>> since >>>> they're not designed to handle surrogate pairs, I think, so I have no idea >>>> if >>>> they would work well enough for you. >>>> >>>> I'll poke around some more to see if I can enable the conversion for >>>> document >>>> strings. >>>> >>>> -Marshall >>>> >>>> On 9/20/2019 11:09 AM, Mario Juric wrote: >>>>> Thanks Marshall, >>>>> >>>>> Encoding the characters like you suggest should work just fine for us as >>>>> long as we can serialize and deserialise the XMI, so that we can open the >>>>> content in a tool like the CVD or similar. These characters are just >>>>> noise from the original content that happen to remain in the CAS, but >>>>> they are not visible in our final output because they are basically >>>>> filtered out one way or the other by downstream components. They become a >>>>> problem though when they make it more difficult for us to inspect the >>>>> content. >>>>> >>>>> Regarding the feature name issue: Might you have an idea why we are >>>>> getting a different XMI output for the same character in our actual >>>>> pipeline, where it results in "𓂣�”? I investigated the >>>>> value in the debugger again, and like you are illustrating it is also >>>>> just a single codepoint with the value 77987. We are simply not able to >>>>> load this XMI because of this, but unfortunately I couldn’t reproduce it >>>>> in my small example. >>>>> >>>>> Cheers, >>>>> Mario >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote: >>>>>> >>>>>> The odd-feature-text seems to work OK, but has some unusual properties, >>>>>> due to >>>>>> that unicode character. >>>>>> >>>>>> Here's what I see: The FeatureRecord "name" field is set to a >>>>>> 1-unicode-character, that must be encoded as 2 java characters. >>>>>> >>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>>>> xmi:id="18" >>>>>> name="𓂣" value="1.0"/> >>>>>> which seems correct. The name field only has 1 (extended)unicode >>>>>> character >>>>>> (taking 2 Java characters to represent), >>>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>>>> >>>>>> When read in, the name field is assigned to a String, that string says >>>>>> it has a >>>>>> length of 2 (but that's because it takes 2 java chars to represent this >>>>>> char). >>>>>> If you have the name string in a variable "n", and do >>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>>>> >>>>>> So, the string value serialization and deserialization seems to be >>>>>> "working". >>>>>> >>>>>> The other code - for the sofa (document) serialization, is throwing that >>>>>> error, >>>>>> because as currently designed, the >>>>>> serialization code checks for these kinds of characters, and if found >>>>>> throws >>>>>> that exception. The code checking is >>>>>> in XMLUtils.checkForNonXmlCharacters >>>>>> >>>>>> This is because it's highly likely that "fixing this" in the same way as >>>>>> the >>>>>> other, would result in hard-to-diagnose >>>>>> future errors, because the subject of analysis string is processed with >>>>>> begin / >>>>>> end offset all over the place, and makes >>>>>> the assumption that the characters are all not coded as surrogate pairs. >>>>>> >>>>>> We could change the code to output these like the name, as, e.g., >>>>>> 𓂣 >>>>>> >>>>>> Would that help in your case, or do you imagine other kinds of things >>>>>> might >>>>>> break (due to begin/end offsets no longer >>>>>> being on character boundaries, for example). >>>>>> >>>>>> -Marshall >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I investigated the XMI issue as promised and these are my findings. >>>>>>> >>>>>>> It is related to special unicode characters that are not handled by XMI >>>>>>> serialisation, and there seems to be two distinct categories of issues >>>>>>> we have >>>>>>> identified so far. >>>>>>> >>>>>>> 1) The document text of the CAS contains special unicode characters >>>>>>> 2) Annotations with String features have values containing special >>>>>>> unicode >>>>>>> characters >>>>>>> >>>>>>> In both cases we could for sure solve the problem if we did a better >>>>>>> clean up >>>>>>> job upstream, but with the amount and variety of data we receive there >>>>>>> is >>>>>>> always a chance something passes through, and some of it may in the >>>>>>> general >>>>>>> case even be valid content. >>>>>>> >>>>>>> The first case is easy to reproduce with the OddDocumentText example I >>>>>>> attached. In this example the text is a snippet taken from the content >>>>>>> of a >>>>>>> parsed XML document. >>>>>>> >>>>>>> The other case was not possible to reproduce with the OddFeatureText >>>>>>> example, >>>>>>> because I am getting slightly different output to what I have in our >>>>>>> real >>>>>>> setup. The OddFeatureText example is based on the simple type system I >>>>>>> shared >>>>>>> previously. The name value of a FeatureRecord contains special unicode >>>>>>> characters that I found in a similar data structure in our actual CAS. >>>>>>> The >>>>>>> value comes from an external knowledge base holding some noisy strings, >>>>>>> which >>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to >>>>>>> XMI >>>>>>> using the small example it only outputs the first of the two characters >>>>>>> in >>>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in our >>>>>>> actual setup both character values are written as "𓂣�”. >>>>>>> This >>>>>>> means that the attached example will for some reason parse the XMI >>>>>>> again, but >>>>>>> it will not work in the case where both characters are written the way >>>>>>> we >>>>>>> experience it. The XMI can be manually changed, so that both character >>>>>>> values >>>>>>> are included the way it happens in our output, and in this case a >>>>>>> SAXParserException happens. >>>>>>> >>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to >>>>>>> handle >>>>>>> any of this, but it will be good to know in any case :) >>>>>>> >>>>>>> Cheers, >>>>>>> Mario >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] >>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>> <mailto:[email protected]>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thank you very much for looking into this. It is really appreciated >>>>>>>> and I >>>>>>>> think it touches upon something important, which is about data >>>>>>>> migration in >>>>>>>> general. >>>>>>>> >>>>>>>> I agree that some of these solutions can appear specific, awkward or >>>>>>>> complex >>>>>>>> and the way forward is not to address our use case alone. I think >>>>>>>> there is a >>>>>>>> need for a compact and efficient binary serialization format for the >>>>>>>> CAS when >>>>>>>> dealing with large amounts of data because this is directly visible in >>>>>>>> costs >>>>>>>> of processing and storing, and I found the compressed binary format to >>>>>>>> be >>>>>>>> much better than XMI in this regard, although I have to admit it’s >>>>>>>> been a >>>>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>>>> described >>>>>>>> type system then maybe it just lacks a way to describe schema evolution >>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a >>>>>>>> more >>>>>>>> formal approach to data migration would be critical to any larger >>>>>>>> operational >>>>>>>> setup. >>>>>>>> >>>>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>>>> observing, >>>>>>>> so that it can be solved. We are primarily using XMI for >>>>>>>> inspection/debugging >>>>>>>> purposes, and we are sometimes not able to do this because of this >>>>>>>> error. I >>>>>>>> will try to extract a minimum example to avoid involving parts that >>>>>>>> has to do >>>>>>>> with our pipeline and type system, and I think this would also be the >>>>>>>> best >>>>>>>> way to illustrate that the problem exists outside of this context. >>>>>>>> However, >>>>>>>> converting all our data to XMI first in order to do the conversion in >>>>>>>> our >>>>>>>> example would not be very practical for us, because it involves a large >>>>>>>> amount of data. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Mario >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>> >>>>>>>>> In this case, the original looks kind-of like this: >>>>>>>>> >>>>>>>>> Container >>>>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>>>> value >>>>>>>>> >>>>>>>>> the new TypeSystem has >>>>>>>>> >>>>>>>>> Container >>>>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>>>> has 2 slots: name, value >>>>>>>>> >>>>>>>>> The deserializer code would need some way to decide how to >>>>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>>>> 2) for each element, >>>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>>>> >>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>>>> 1) change the type from A to B >>>>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>>>> >>>>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>>>> namely, only >>>>>>>>> those referenced by the FSArray where the element type changed. >>>>>>>>> Seems complex >>>>>>>>> and specific to this use case though. >>>>>>>>> >>>>>>>>> -Marshall >>>>>>>>> >>>>>>>>> >>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>>>> deserialization >>>>>>>>>>> code compares the two type systems, and allows for some mismatches >>>>>>>>>>> (things >>>>>>>>>>> present in one and not in the other), but it doesn't allow for >>>>>>>>>>> having a >>>>>>>>>>> feature >>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY >>>>>>>>>>> in the >>>>>>>>>>> other. >>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>>>> Without reading the code in detail - could we not relax this check >>>>>>>>>> such >>>>>>>>>> that the element type of FSArrays is not checked and the code simply >>>>>>>>>> assumes that the source element type has the same features as the >>>>>>>>>> target >>>>>>>>>> element type (with the usual lenient handling of missing features in >>>>>>>>>> the >>>>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> -- Richard >>
