yes, makes sense, thanks for posting the Jira. If no one else steps up to work on this, I'll probably take a look in a few days. -Marshall
On 9/24/2019 6:47 AM, Mario Juric wrote: > Hi Marshall, > > I added the following feature request to Apache Jira: > > https://issues.apache.org/jira/browse/UIMA-6128 > > Hope it makes sense :) > > Thanks a lot for the help, it’s appreciated. > > Cheers, > Mario > > > > > > > > > > > > > >> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote: >> >> Re: serializing using XML 1.1 >> >> This was not thought of, when setting up the CasIOUtils. >> >> The way it was done (above) was using some more "primitive/lower level" APIs, >> rather than the CasIOUtils. >> >> Please open a Jira ticket for this, with perhaps some suggestions on how it >> might be specified in the CasIOUtils APIs. >> >> Thanks! -Marshall >> >> On 9/23/2019 3:45 AM, Mario Juric wrote: >>> Hi Marshall, >>> >>> Thanks for the thorough and excellent investigation. >>> >>> We are looking into possible normalisation/cleanup of whitespace/invisible >>> characters, but I don’t think we can necessarily do the same for some of >>> the other characters. It sounds to me though that serialising to XML 1.1 >>> could also be a simple fix right now, but can this be configured? >>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s >>> something you have working in your branch. >>> >>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 >>> and after. Do you think switching to a more recent Java version would make >>> a difference? I think we can also try this out ourselves when we look into >>> migrating to UIMA 3 once our current deliveries are complete. We also like >>> to switch to Java 11, and like UIMA 3 migration it will require some >>> thorough testing. >>> >>> Cheers, >>> Mario >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote: >>>> >>>> In the test "OddDocumentText", this produces a "throw" due to an invalid >>>> xml >>>> char, which is the \u0002. >>>> >>>> This is in part because the xml version being used is xml 1.0. >>>> >>>> XML 1.1 expanded the set of valid characters to include \u0002. >>>> >>>> Here's a snip from the XmiCasSerializerTest class which serializes with >>>> xml 1.1: >>>> >>>> XmiCasSerializer xmiCasSerializer = new >>>> XmiCasSerializer(jCas.getTypeSystem()); >>>> OutputStream out = new FileOutputStream(new File >>>> ("odd-doc-txt-v11.xmi")); >>>> try { >>>> XMLSerializer xml11Serializer = new XMLSerializer(out); >>>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); >>>> xmiCasSerializer.serialize(jCas.getCas(), >>>> xml11Serializer.getContentHandler()); >>>> } >>>> finally { >>>> out.close(); >>>> } >>>> >>>> This succeeds and serializes this using xml 1.1. >>>> >>>> I also tried serializing some doc text which includes \u77987. That did >>>> not >>>> serialize correctly. >>>> I could see it in the code while tracing up to some point down in the >>>> innards of >>>> some internal >>>> sax java code >>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize where >>>> it was >>>> "Correct" in the Java string. >>>> >>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. >>>> >>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte >>>> encoding: >>>> 1110 xxxx 10xx xxxx 10xx xxxx >>>> >>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me. >>>> >>>> But I think it's out of our hands - it's somewhere deep in the sax >>>> transform >>>> java code. >>>> >>>> I looked for a bug report and found some >>>> https://bugs.openjdk.java.net/browse/JDK-8058175 >>>> >>>> Bottom line, is, I think to clean out these characters early :-) . >>>> >>>> -Marshall >>>> >>>> >>>> On 9/20/2019 1:28 PM, Marshall Schor wrote: >>>>> here's an idea. >>>>> >>>>> If you have a string, with the surrogate pair 𓂣 at position 10, and >>>>> you >>>>> have some Java code, which is iterating through the string and getting the >>>>> code-point at each character offset, then that code will produce: >>>>> >>>>> at position 10: the code-point 77987 >>>>> at position 11: the code-point 56483 >>>>> >>>>> Of course, it's a "bug" to iterate through a string of characters, >>>>> assuming you >>>>> have characters at each point, if you don't handle surrogate pairs. >>>>> >>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 >>>>> (see >>>>> https://tools.ietf.org/html/rfc2781 ) >>>>> >>>>> I worry that even tools like the CVD or similar may not work properly, >>>>> since >>>>> they're not designed to handle surrogate pairs, I think, so I have no >>>>> idea if >>>>> they would work well enough for you. >>>>> >>>>> I'll poke around some more to see if I can enable the conversion for >>>>> document >>>>> strings. >>>>> >>>>> -Marshall >>>>> >>>>> On 9/20/2019 11:09 AM, Mario Juric wrote: >>>>>> Thanks Marshall, >>>>>> >>>>>> Encoding the characters like you suggest should work just fine for us as >>>>>> long as we can serialize and deserialise the XMI, so that we can open >>>>>> the content in a tool like the CVD or similar. These characters are just >>>>>> noise from the original content that happen to remain in the CAS, but >>>>>> they are not visible in our final output because they are basically >>>>>> filtered out one way or the other by downstream components. They become >>>>>> a problem though when they make it more difficult for us to inspect the >>>>>> content. >>>>>> >>>>>> Regarding the feature name issue: Might you have an idea why we are >>>>>> getting a different XMI output for the same character in our actual >>>>>> pipeline, where it results in "𓂣�”? I investigated the >>>>>> value in the debugger again, and like you are illustrating it is also >>>>>> just a single codepoint with the value 77987. We are simply not able to >>>>>> load this XMI because of this, but unfortunately I couldn’t reproduce it >>>>>> in my small example. >>>>>> >>>>>> Cheers, >>>>>> Mario >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote: >>>>>>> >>>>>>> The odd-feature-text seems to work OK, but has some unusual properties, >>>>>>> due to >>>>>>> that unicode character. >>>>>>> >>>>>>> Here's what I see: The FeatureRecord "name" field is set to a >>>>>>> 1-unicode-character, that must be encoded as 2 java characters. >>>>>>> >>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>>>>> xmi:id="18" >>>>>>> name="𓂣" value="1.0"/> >>>>>>> which seems correct. The name field only has 1 (extended)unicode >>>>>>> character >>>>>>> (taking 2 Java characters to represent), >>>>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>>>>> >>>>>>> When read in, the name field is assigned to a String, that string says >>>>>>> it has a >>>>>>> length of 2 (but that's because it takes 2 java chars to represent this >>>>>>> char). >>>>>>> If you have the name string in a variable "n", and do >>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>>>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>>>>> >>>>>>> So, the string value serialization and deserialization seems to be >>>>>>> "working". >>>>>>> >>>>>>> The other code - for the sofa (document) serialization, is throwing >>>>>>> that error, >>>>>>> because as currently designed, the >>>>>>> serialization code checks for these kinds of characters, and if found >>>>>>> throws >>>>>>> that exception. The code checking is >>>>>>> in XMLUtils.checkForNonXmlCharacters >>>>>>> >>>>>>> This is because it's highly likely that "fixing this" in the same way >>>>>>> as the >>>>>>> other, would result in hard-to-diagnose >>>>>>> future errors, because the subject of analysis string is processed with >>>>>>> begin / >>>>>>> end offset all over the place, and makes >>>>>>> the assumption that the characters are all not coded as surrogate pairs. >>>>>>> >>>>>>> We could change the code to output these like the name, as, e.g., >>>>>>> 𓂣 >>>>>>> >>>>>>> Would that help in your case, or do you imagine other kinds of things >>>>>>> might >>>>>>> break (due to begin/end offsets no longer >>>>>>> being on character boundaries, for example). >>>>>>> >>>>>>> -Marshall >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I investigated the XMI issue as promised and these are my findings. >>>>>>>> >>>>>>>> It is related to special unicode characters that are not handled by XMI >>>>>>>> serialisation, and there seems to be two distinct categories of issues >>>>>>>> we have >>>>>>>> identified so far. >>>>>>>> >>>>>>>> 1) The document text of the CAS contains special unicode characters >>>>>>>> 2) Annotations with String features have values containing special >>>>>>>> unicode >>>>>>>> characters >>>>>>>> >>>>>>>> In both cases we could for sure solve the problem if we did a better >>>>>>>> clean up >>>>>>>> job upstream, but with the amount and variety of data we receive there >>>>>>>> is >>>>>>>> always a chance something passes through, and some of it may in the >>>>>>>> general >>>>>>>> case even be valid content. >>>>>>>> >>>>>>>> The first case is easy to reproduce with the OddDocumentText example I >>>>>>>> attached. In this example the text is a snippet taken from the content >>>>>>>> of a >>>>>>>> parsed XML document. >>>>>>>> >>>>>>>> The other case was not possible to reproduce with the OddFeatureText >>>>>>>> example, >>>>>>>> because I am getting slightly different output to what I have in our >>>>>>>> real >>>>>>>> setup. The OddFeatureText example is based on the simple type system I >>>>>>>> shared >>>>>>>> previously. The name value of a FeatureRecord contains special unicode >>>>>>>> characters that I found in a similar data structure in our actual CAS. >>>>>>>> The >>>>>>>> value comes from an external knowledge base holding some noisy >>>>>>>> strings, which >>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to >>>>>>>> XMI >>>>>>>> using the small example it only outputs the first of the two >>>>>>>> characters in >>>>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in >>>>>>>> our >>>>>>>> actual setup both character values are written as "𓂣�”. >>>>>>>> This >>>>>>>> means that the attached example will for some reason parse the XMI >>>>>>>> again, but >>>>>>>> it will not work in the case where both characters are written the way >>>>>>>> we >>>>>>>> experience it. The XMI can be manually changed, so that both character >>>>>>>> values >>>>>>>> are included the way it happens in our output, and in this case a >>>>>>>> SAXParserException happens. >>>>>>>> >>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to >>>>>>>> handle >>>>>>>> any of this, but it will be good to know in any case :) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Mario >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] >>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>> <mailto:[email protected]>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Thank you very much for looking into this. It is really appreciated >>>>>>>>> and I >>>>>>>>> think it touches upon something important, which is about data >>>>>>>>> migration in >>>>>>>>> general. >>>>>>>>> >>>>>>>>> I agree that some of these solutions can appear specific, awkward or >>>>>>>>> complex >>>>>>>>> and the way forward is not to address our use case alone. I think >>>>>>>>> there is a >>>>>>>>> need for a compact and efficient binary serialization format for the >>>>>>>>> CAS when >>>>>>>>> dealing with large amounts of data because this is directly visible >>>>>>>>> in costs >>>>>>>>> of processing and storing, and I found the compressed binary format >>>>>>>>> to be >>>>>>>>> much better than XMI in this regard, although I have to admit it’s >>>>>>>>> been a >>>>>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>>>>> described >>>>>>>>> type system then maybe it just lacks a way to describe schema >>>>>>>>> evolution >>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a >>>>>>>>> more >>>>>>>>> formal approach to data migration would be critical to any larger >>>>>>>>> operational >>>>>>>>> setup. >>>>>>>>> >>>>>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>>>>> observing, >>>>>>>>> so that it can be solved. We are primarily using XMI for >>>>>>>>> inspection/debugging >>>>>>>>> purposes, and we are sometimes not able to do this because of this >>>>>>>>> error. I >>>>>>>>> will try to extract a minimum example to avoid involving parts that >>>>>>>>> has to do >>>>>>>>> with our pipeline and type system, and I think this would also be the >>>>>>>>> best >>>>>>>>> way to illustrate that the problem exists outside of this context. >>>>>>>>> However, >>>>>>>>> converting all our data to XMI first in order to do the conversion in >>>>>>>>> our >>>>>>>>> example would not be very practical for us, because it involves a >>>>>>>>> large >>>>>>>>> amount of data. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Mario >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>> >>>>>>>>>> In this case, the original looks kind-of like this: >>>>>>>>>> >>>>>>>>>> Container >>>>>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>>>>> value >>>>>>>>>> >>>>>>>>>> the new TypeSystem has >>>>>>>>>> >>>>>>>>>> Container >>>>>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>>>>> has 2 slots: name, value >>>>>>>>>> >>>>>>>>>> The deserializer code would need some way to decide how to >>>>>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>>>>> 2) for each element, >>>>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>>>>> >>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>>>>> 1) change the type from A to B >>>>>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>>>>> >>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>>>>> namely, only >>>>>>>>>> those referenced by the FSArray where the element type changed. >>>>>>>>>> Seems complex >>>>>>>>>> and specific to this use case though. >>>>>>>>>> >>>>>>>>>> -Marshall >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>>>>> deserialization >>>>>>>>>>>> code compares the two type systems, and allows for some mismatches >>>>>>>>>>>> (things >>>>>>>>>>>> present in one and not in the other), but it doesn't allow for >>>>>>>>>>>> having a >>>>>>>>>>>> feature >>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY >>>>>>>>>>>> in the >>>>>>>>>>>> other. >>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>>>>> Without reading the code in detail - could we not relax this check >>>>>>>>>>> such >>>>>>>>>>> that the element type of FSArrays is not checked and the code simply >>>>>>>>>>> assumes that the source element type has the same features as the >>>>>>>>>>> target >>>>>>>>>>> element type (with the usual lenient handling of missing features >>>>>>>>>>> in the >>>>>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> -- Richard >
