Here's code that works that serializes in 1.1 format.
The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".
XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
XMLSerializer xml11Serializer = new XMLSerializer(out);
xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
xmiCasSerializer.serialize(jCas.getCas(),
xml11Serializer.getContentHandler());
}
finally {
out.close();
}
This is from a test case. -Marshall
On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to
> finish something first within the next 3-4 weeks. It would probably get me
> faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor <[email protected]> wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote:
>>>>
>>>> Re: serializing using XML 1.1
>>>>
>>>> This was not thought of, when setting up the CasIOUtils.
>>>>
>>>> The way it was done (above) was using some more "primitive/lower level"
>>>> APIs,
>>>> rather than the CasIOUtils.
>>>>
>>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>>> might be specified in the CasIOUtils APIs.
>>>>
>>>> Thanks! -Marshall
>>>>
>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>> Hi Marshall,
>>>>>
>>>>> Thanks for the thorough and excellent investigation.
>>>>>
>>>>> We are looking into possible normalisation/cleanup of
>>>>> whitespace/invisible characters, but I don’t think we can necessarily do
>>>>> the same for some of the other characters. It sounds to me though that
>>>>> serialising to XML 1.1 could also be a simple fix right now, but can this
>>>>> be configured? CasIOUtils doesn’t seem to have an option for this, so I
>>>>> assume it’s something you have working in your branch.
>>>>>
>>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java
>>>>> 9 and after. Do you think switching to a more recent Java version would
>>>>> make a difference? I think we can also try this out ourselves when we
>>>>> look into migrating to UIMA 3 once our current deliveries are complete.
>>>>> We also like to switch to Java 11, and like UIMA 3 migration it will
>>>>> require some thorough testing.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>>>>>
>>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid
>>>>>> xml
>>>>>> char, which is the \u0002.
>>>>>>
>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>
>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>
>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with
>>>>>> xml 1.1:
>>>>>>
>>>>>> XmiCasSerializer xmiCasSerializer = new
>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>> OutputStream out = new FileOutputStream(new File
>>>>>> ("odd-doc-txt-v11.xmi"));
>>>>>> try {
>>>>>> XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>> xmiCasSerializer.serialize(jCas.getCas(),
>>>>>> xml11Serializer.getContentHandler());
>>>>>> }
>>>>>> finally {
>>>>>> out.close();
>>>>>> }
>>>>>>
>>>>>> This succeeds and serializes this using xml 1.1.
>>>>>>
>>>>>> I also tried serializing some doc text which includes \u77987. That did
>>>>>> not
>>>>>> serialize correctly.
>>>>>> I could see it in the code while tracing up to some point down in the
>>>>>> innards of
>>>>>> some internal
>>>>>> sax java code
>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize
>>>>>> where it was
>>>>>> "Correct" in the Java string.
>>>>>>
>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>>>
>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3
>>>>>> byte encoding:
>>>>>> 1110 xxxx 10xx xxxx 10xx xxxx
>>>>>>
>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>>>
>>>>>> But I think it's out of our hands - it's somewhere deep in the sax
>>>>>> transform
>>>>>> java code.
>>>>>>
>>>>>> I looked for a bug report and found some
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>>>
>>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>>> here's an idea.
>>>>>>>
>>>>>>> If you have a string, with the surrogate pair 𓂣 at position 10,
>>>>>>> and you
>>>>>>> have some Java code, which is iterating through the string and getting
>>>>>>> the
>>>>>>> code-point at each character offset, then that code will produce:
>>>>>>>
>>>>>>> at position 10: the code-point 77987
>>>>>>> at position 11: the code-point 56483
>>>>>>>
>>>>>>> Of course, it's a "bug" to iterate through a string of characters,
>>>>>>> assuming you
>>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>>>
>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00
>>>>>>> (see
>>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>>>
>>>>>>> I worry that even tools like the CVD or similar may not work properly,
>>>>>>> since
>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no
>>>>>>> idea if
>>>>>>> they would work well enough for you.
>>>>>>>
>>>>>>> I'll poke around some more to see if I can enable the conversion for
>>>>>>> document
>>>>>>> strings.
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>>> Thanks Marshall,
>>>>>>>>
>>>>>>>> Encoding the characters like you suggest should work just fine for us
>>>>>>>> as long as we can serialize and deserialise the XMI, so that we can
>>>>>>>> open the content in a tool like the CVD or similar. These characters
>>>>>>>> are just noise from the original content that happen to remain in the
>>>>>>>> CAS, but they are not visible in our final output because they are
>>>>>>>> basically filtered out one way or the other by downstream components.
>>>>>>>> They become a problem though when they make it more difficult for us
>>>>>>>> to inspect the content.
>>>>>>>>
>>>>>>>> Regarding the feature name issue: Might you have an idea why we are
>>>>>>>> getting a different XMI output for the same character in our actual
>>>>>>>> pipeline, where it results in "𓂣�”? I investigated the
>>>>>>>> value in the debugger again, and like you are illustrating it is also
>>>>>>>> just a single codepoint with the value 77987. We are simply not able
>>>>>>>> to load this XMI because of this, but unfortunately I couldn’t
>>>>>>>> reproduce it in my small example.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual
>>>>>>>>> properties, due to
>>>>>>>>> that unicode character.
>>>>>>>>>
>>>>>>>>> Here's what I see: The FeatureRecord "name" field is set to a
>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>>>
>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord
>>>>>>>>> xmi:id="18"
>>>>>>>>> name="𓂣" value="1.0"/>
>>>>>>>>> which seems correct. The name field only has 1 (extended)unicode
>>>>>>>>> character
>>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3";
>>>>>>>>>
>>>>>>>>> When read in, the name field is assigned to a String, that string
>>>>>>>>> says it has a
>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent
>>>>>>>>> this char).
>>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>>>
>>>>>>>>> So, the string value serialization and deserialization seems to be
>>>>>>>>> "working".
>>>>>>>>>
>>>>>>>>> The other code - for the sofa (document) serialization, is throwing
>>>>>>>>> that error,
>>>>>>>>> because as currently designed, the
>>>>>>>>> serialization code checks for these kinds of characters, and if found
>>>>>>>>> throws
>>>>>>>>> that exception. The code checking is
>>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>>>
>>>>>>>>> This is because it's highly likely that "fixing this" in the same way
>>>>>>>>> as the
>>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>>> future errors, because the subject of analysis string is processed
>>>>>>>>> with begin /
>>>>>>>>> end offset all over the place, and makes
>>>>>>>>> the assumption that the characters are all not coded as surrogate
>>>>>>>>> pairs.
>>>>>>>>>
>>>>>>>>> We could change the code to output these like the name, as, e.g.,
>>>>>>>>> 𓂣
>>>>>>>>>
>>>>>>>>> Would that help in your case, or do you imagine other kinds of things
>>>>>>>>> might
>>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>>> being on character boundaries, for example).
>>>>>>>>>
>>>>>>>>> -Marshall
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>>>
>>>>>>>>>> It is related to special unicode characters that are not handled by
>>>>>>>>>> XMI
>>>>>>>>>> serialisation, and there seems to be two distinct categories of
>>>>>>>>>> issues we have
>>>>>>>>>> identified so far.
>>>>>>>>>>
>>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>>> 2) Annotations with String features have values containing special
>>>>>>>>>> unicode
>>>>>>>>>> characters
>>>>>>>>>>
>>>>>>>>>> In both cases we could for sure solve the problem if we did a better
>>>>>>>>>> clean up
>>>>>>>>>> job upstream, but with the amount and variety of data we receive
>>>>>>>>>> there is
>>>>>>>>>> always a chance something passes through, and some of it may in the
>>>>>>>>>> general
>>>>>>>>>> case even be valid content.
>>>>>>>>>>
>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example
>>>>>>>>>> I
>>>>>>>>>> attached. In this example the text is a snippet taken from the
>>>>>>>>>> content of a
>>>>>>>>>> parsed XML document.
>>>>>>>>>>
>>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText
>>>>>>>>>> example,
>>>>>>>>>> because I am getting slightly different output to what I have in our
>>>>>>>>>> real
>>>>>>>>>> setup. The OddFeatureText example is based on the simple type system
>>>>>>>>>> I shared
>>>>>>>>>> previously. The name value of a FeatureRecord contains special
>>>>>>>>>> unicode
>>>>>>>>>> characters that I found in a similar data structure in our actual
>>>>>>>>>> CAS. The
>>>>>>>>>> value comes from an external knowledge base holding some noisy
>>>>>>>>>> strings, which
>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS
>>>>>>>>>> to XMI
>>>>>>>>>> using the small example it only outputs the first of the two
>>>>>>>>>> characters in
>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in
>>>>>>>>>> our
>>>>>>>>>> actual setup both character values are written as
>>>>>>>>>> "𓂣�”. This
>>>>>>>>>> means that the attached example will for some reason parse the XMI
>>>>>>>>>> again, but
>>>>>>>>>> it will not work in the case where both characters are written the
>>>>>>>>>> way we
>>>>>>>>>> experience it. The XMI can be manually changed, so that both
>>>>>>>>>> character values
>>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>>> SAXParserException happens.
>>>>>>>>>>
>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser
>>>>>>>>>> to handle
>>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Mario
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected]
>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]
>>>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for looking into this. It is really appreciated
>>>>>>>>>>> and I
>>>>>>>>>>> think it touches upon something important, which is about data
>>>>>>>>>>> migration in
>>>>>>>>>>> general.
>>>>>>>>>>>
>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward
>>>>>>>>>>> or complex
>>>>>>>>>>> and the way forward is not to address our use case alone. I think
>>>>>>>>>>> there is a
>>>>>>>>>>> need for a compact and efficient binary serialization format for
>>>>>>>>>>> the CAS when
>>>>>>>>>>> dealing with large amounts of data because this is directly visible
>>>>>>>>>>> in costs
>>>>>>>>>>> of processing and storing, and I found the compressed binary format
>>>>>>>>>>> to be
>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s
>>>>>>>>>>> been a
>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well
>>>>>>>>>>> described
>>>>>>>>>>> type system then maybe it just lacks a way to describe schema
>>>>>>>>>>> evolution
>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think
>>>>>>>>>>> a more
>>>>>>>>>>> formal approach to data migration would be critical to any larger
>>>>>>>>>>> operational
>>>>>>>>>>> setup.
>>>>>>>>>>>
>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are
>>>>>>>>>>> observing,
>>>>>>>>>>> so that it can be solved. We are primarily using XMI for
>>>>>>>>>>> inspection/debugging
>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this
>>>>>>>>>>> error. I
>>>>>>>>>>> will try to extract a minimum example to avoid involving parts that
>>>>>>>>>>> has to do
>>>>>>>>>>> with our pipeline and type system, and I think this would also be
>>>>>>>>>>> the best
>>>>>>>>>>> way to illustrate that the problem exists outside of this context.
>>>>>>>>>>> However,
>>>>>>>>>>> converting all our data to XMI first in order to do the conversion
>>>>>>>>>>> in our
>>>>>>>>>>> example would not be very practical for us, because it involves a
>>>>>>>>>>> large
>>>>>>>>>>> amount of data.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Mario
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected]
>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>> has 5 slots: sofaRef, begin, end, name,
>>>>>>>>>>>> value
>>>>>>>>>>>>
>>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>>>
>>>>>>>>>>>> Container
>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>> has 2 slots: name, value
>>>>>>>>>>>>
>>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>>>
>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>>>
>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's,
>>>>>>>>>>>> namely, only
>>>>>>>>>>>> those referenced by the FSArray where the element type changed.
>>>>>>>>>>>> Seems complex
>>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>>>
>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected]
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening. The
>>>>>>>>>>>>>> deserialization
>>>>>>>>>>>>>> code compares the two type systems, and allows for some
>>>>>>>>>>>>>> mismatches (things
>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for
>>>>>>>>>>>>>> having a
>>>>>>>>>>>>>> feature
>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type
>>>>>>>>>>>>>> YYYY in the
>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>>> Without reading the code in detail - could we not relax this
>>>>>>>>>>>>> check such
>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code
>>>>>>>>>>>>> simply
>>>>>>>>>>>>> assumes that the source element type has the same features as the
>>>>>>>>>>>>> target
>>>>>>>>>>>>> element type (with the usual lenient handling of missing features
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Richard
>