Re: Migrating type system of form 6 compressed CAS binaries

Marshall Schor Wed, 25 Sep 2019 11:46:56 -0700

Here's code that works that serializes in 1.1 format.

The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".


XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(), 
xml11Serializer.getContentHandler());
    }
finally {
  out.close();
}

This is from a test case. -Marshall

On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to 
> finish something first within the next 3-4 weeks. It would probably get me 
> faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor <[email protected]> wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote:
>>>>
>>>> Re: serializing using XML 1.1
>>>>
>>>> This was not thought of, when setting up the CasIOUtils.
>>>>
>>>> The way it was done (above) was using some more "primitive/lower level" 
>>>> APIs,
>>>> rather than the CasIOUtils.
>>>>
>>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>>> might be specified in the CasIOUtils APIs.
>>>>
>>>> Thanks! -Marshall
>>>>
>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>> Hi Marshall,
>>>>>
>>>>> Thanks for the thorough and excellent investigation.
>>>>>
>>>>> We are looking into possible normalisation/cleanup of 
>>>>> whitespace/invisible characters, but I don’t think we can necessarily do 
>>>>> the same for some of the other characters. It sounds to me though that 
>>>>> serialising to XML 1.1 could also be a simple fix right now, but can this 
>>>>> be configured? CasIOUtils doesn’t seem to have an option for this, so I 
>>>>> assume it’s something you have working in your branch.
>>>>>
>>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 
>>>>> 9 and after. Do you think switching to a more recent Java version would 
>>>>> make a difference? I think we can also try this out ourselves when we 
>>>>> look into migrating to UIMA 3 once our current deliveries are complete. 
>>>>> We also like to switch to Java 11, and like UIMA 3 migration it will 
>>>>> require some thorough testing.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>>>>>
>>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>>>>>> xml
>>>>>> char, which is the \u0002.
>>>>>>
>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>
>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>
>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>>>> xml 1.1:
>>>>>>
>>>>>>       XmiCasSerializer xmiCasSerializer = new
>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>>       OutputStream out = new FileOutputStream(new File 
>>>>>> ("odd-doc-txt-v11.xmi"));
>>>>>>       try {
>>>>>>         XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>>         xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>>         xmiCasSerializer.serialize(jCas.getCas(),
>>>>>> xml11Serializer.getContentHandler());
>>>>>>       }
>>>>>>       finally {
>>>>>>         out.close();
>>>>>>       }
>>>>>>
>>>>>> This succeeds and serializes this using xml 1.1.
>>>>>>
>>>>>> I also tried serializing some doc text which includes \u77987.  That did 
>>>>>> not
>>>>>> serialize correctly.
>>>>>> I could see it in the code while tracing up to some point down in the 
>>>>>> innards of
>>>>>> some internal
>>>>>> sax java code
>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>>>>>> where it was
>>>>>> "Correct" in the Java string.
>>>>>>
>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>>>
>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
>>>>>> byte encoding:
>>>>>>       1110 xxxx 10xx xxxx 10xx xxxx
>>>>>>
>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>>>
>>>>>> But I think it's out of our hands - it's somewhere deep in the sax 
>>>>>> transform
>>>>>> java code.
>>>>>>
>>>>>> I looked for a bug report and found some
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>>>
>>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>>> here's an idea.
>>>>>>>
>>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, 
>>>>>>> and you
>>>>>>> have some Java code, which is iterating through the string and getting 
>>>>>>> the
>>>>>>> code-point at each character offset, then that code will produce:
>>>>>>>
>>>>>>> at position 10:  the code-point 77987
>>>>>>> at position 11:  the code-point 56483
>>>>>>>
>>>>>>> Of course, it's a "bug" to iterate through a string of characters, 
>>>>>>> assuming you
>>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>>>
>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 
>>>>>>> (see
>>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>>>
>>>>>>> I worry that even tools like the CVD or similar may not work properly, 
>>>>>>> since
>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no 
>>>>>>> idea if
>>>>>>> they would work well enough for you.
>>>>>>>
>>>>>>> I'll poke around some more to see if I can enable the conversion for 
>>>>>>> document
>>>>>>> strings.
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>>> Thanks Marshall,
>>>>>>>>
>>>>>>>> Encoding the characters like you suggest should work just fine for us 
>>>>>>>> as long as we can serialize and deserialise the XMI, so that we can 
>>>>>>>> open the content in a tool like the CVD or similar. These characters 
>>>>>>>> are just noise from the original content that happen to remain in the 
>>>>>>>> CAS, but they are not visible in our final output because they are 
>>>>>>>> basically filtered out one way or the other by downstream components. 
>>>>>>>> They become a problem though when they make it more difficult for us 
>>>>>>>> to inspect the content.
>>>>>>>>
>>>>>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>>>>>> getting a different XMI output for the same character in our actual 
>>>>>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the 
>>>>>>>> value in the debugger again, and like you are illustrating it is also 
>>>>>>>> just a single codepoint with the value 77987. We are simply not able 
>>>>>>>> to load this XMI because of this, but unfortunately I couldn’t 
>>>>>>>> reproduce it in my small example.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual 
>>>>>>>>> properties, due to
>>>>>>>>> that unicode character.
>>>>>>>>>
>>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>>>
>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>>>>>> xmi:id="18"
>>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>>>>>> character
>>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>>>
>>>>>>>>> When read in, the name field is assigned to a String, that string 
>>>>>>>>> says it has a
>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent 
>>>>>>>>> this char).
>>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>>>
>>>>>>>>> So, the string value serialization and deserialization seems to be 
>>>>>>>>> "working".
>>>>>>>>>
>>>>>>>>> The other code - for the sofa (document) serialization, is throwing 
>>>>>>>>> that error,
>>>>>>>>> because as currently designed, the
>>>>>>>>> serialization code checks for these kinds of characters, and if found 
>>>>>>>>> throws
>>>>>>>>> that exception.  The code checking is
>>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>>>
>>>>>>>>> This is because it's highly likely that "fixing this" in the same way 
>>>>>>>>> as the
>>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>>> future errors, because the subject of analysis string is processed 
>>>>>>>>> with begin /
>>>>>>>>> end offset all over the place, and makes
>>>>>>>>> the assumption that the characters are all not coded as surrogate 
>>>>>>>>> pairs.
>>>>>>>>>
>>>>>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>>>>>> &#77987; 
>>>>>>>>>
>>>>>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>>>>>> might
>>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>>> being on character boundaries, for example).
>>>>>>>>>
>>>>>>>>> -Marshall
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>>>
>>>>>>>>>> It is related to special unicode characters that are not handled by 
>>>>>>>>>> XMI
>>>>>>>>>> serialisation, and there seems to be two distinct categories of 
>>>>>>>>>> issues we have
>>>>>>>>>> identified so far.
>>>>>>>>>>
>>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>>> 2) Annotations with String features have values containing special 
>>>>>>>>>> unicode
>>>>>>>>>> characters
>>>>>>>>>>
>>>>>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>>>>>> clean up
>>>>>>>>>> job upstream, but with the amount and variety of data we receive 
>>>>>>>>>> there is
>>>>>>>>>> always a chance something passes through, and some of it may in the 
>>>>>>>>>> general
>>>>>>>>>> case even be valid content.
>>>>>>>>>>
>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example 
>>>>>>>>>> I
>>>>>>>>>> attached. In this example the text is a snippet taken from the 
>>>>>>>>>> content of a
>>>>>>>>>> parsed XML document.
>>>>>>>>>>
>>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>>>>>> example,
>>>>>>>>>> because I am getting slightly different output to what I have in our 
>>>>>>>>>> real
>>>>>>>>>> setup. The OddFeatureText example is based on the simple type system 
>>>>>>>>>> I shared
>>>>>>>>>> previously. The name value of a FeatureRecord contains special 
>>>>>>>>>> unicode
>>>>>>>>>> characters that I found in a similar data structure in our actual 
>>>>>>>>>> CAS. The
>>>>>>>>>> value comes from an external knowledge base holding some noisy 
>>>>>>>>>> strings, which
>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS 
>>>>>>>>>> to XMI
>>>>>>>>>> using the small example it only outputs the first of the two 
>>>>>>>>>> characters in
>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in 
>>>>>>>>>> our
>>>>>>>>>> actual setup both character values are written as 
>>>>>>>>>> "&#77987;&#56483;”. This
>>>>>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>>>>>> again, but
>>>>>>>>>> it will not work in the case where both characters are written the 
>>>>>>>>>> way we
>>>>>>>>>> experience it. The XMI can be manually changed, so that both 
>>>>>>>>>> character values
>>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>>> SAXParserException happens.
>>>>>>>>>>
>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser 
>>>>>>>>>> to handle
>>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Mario
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for looking into this. It is really appreciated 
>>>>>>>>>>> and I
>>>>>>>>>>> think it touches upon something important, which is about data 
>>>>>>>>>>> migration in
>>>>>>>>>>> general.
>>>>>>>>>>>
>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward 
>>>>>>>>>>> or complex
>>>>>>>>>>> and the way forward is not to address our use case alone. I think 
>>>>>>>>>>> there is a
>>>>>>>>>>> need for a compact and efficient binary serialization format for 
>>>>>>>>>>> the CAS when
>>>>>>>>>>> dealing with large amounts of data because this is directly visible 
>>>>>>>>>>> in costs
>>>>>>>>>>> of processing and storing, and I found the compressed binary format 
>>>>>>>>>>> to be
>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s 
>>>>>>>>>>> been a
>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>>>>>> described
>>>>>>>>>>> type system then maybe it just lacks a way to describe schema 
>>>>>>>>>>> evolution
>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think 
>>>>>>>>>>> a more
>>>>>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>>>>>> operational
>>>>>>>>>>> setup.
>>>>>>>>>>>
>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>>>>>> observing,
>>>>>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>>>>>> inspection/debugging
>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>>>>>> error. I
>>>>>>>>>>> will try to extract a minimum example to avoid involving parts that 
>>>>>>>>>>> has to do
>>>>>>>>>>> with our pipeline and type system, and I think this would also be 
>>>>>>>>>>> the best
>>>>>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>>>>>> However,
>>>>>>>>>>> converting all our data to XMI first in order to do the conversion 
>>>>>>>>>>> in our
>>>>>>>>>>> example would not be very practical for us, because it involves a 
>>>>>>>>>>> large
>>>>>>>>>>> amount of data.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Mario
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>>                           has 5 slots: sofaRef, begin, end, name, 
>>>>>>>>>>>> value
>>>>>>>>>>>>
>>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>>                            has 2 slots: name, value
>>>>>>>>>>>>
>>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>>    map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>>>
>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>>>
>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>>>>>> namely, only
>>>>>>>>>>>> those referenced by the FSArray where the element type changed.  
>>>>>>>>>>>> Seems complex
>>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>>>
>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>>>>>> deserialization
>>>>>>>>>>>>>> code compares the two type systems, and allows for some 
>>>>>>>>>>>>>> mismatches (things
>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>>>>>> having a
>>>>>>>>>>>>>> feature
>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type 
>>>>>>>>>>>>>> YYYY in the
>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>>> Without reading the code in detail - could we not relax this 
>>>>>>>>>>>>> check such
>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code 
>>>>>>>>>>>>> simply
>>>>>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>>>>>> target
>>>>>>>>>>>>> element type (with the usual lenient handling of missing features 
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to