Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks. I will take a look at it and then I get back to you.

Cheers,
Mario













> On 25 Sep 2019, at 20:46 , Marshall Schor  wrote:
> 
> Here's code that works that serializes in 1.1 format.
> 
> The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".
> 
> XmiCasSerializer xmiCasSerializer = new 
> XmiCasSerializer(jCas.getTypeSystem());
> OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
> try {
>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>   xmiCasSerializer.serialize(jCas.getCas(), 
> xml11Serializer.getContentHandler());
> }
> finally {
>  out.close();
> }
> 
> This is from a test case. -Marshall
> 
> On 9/25/2019 2:16 PM, Mario Juric wrote:
>> Thanks Marshall,
>> 
>> If you prefer then I can also have a look at it, although I probably need to 
>> finish something first within the next 3-4 weeks. It would probably get me 
>> faster started if you could share some of your experimental sample code.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
>>> 
>>> yes, makes sense, thanks for posting the Jira.
>>> 
>>> If no one else steps up to work on this, I'll probably take a look in a few
>>> days. -Marshall
>>> 
>>> On 9/24/2019 6:47 AM, Mario Juric wrote:
 Hi Marshall,
 
 I added the following feature request to Apache Jira:
 
 https://issues.apache.org/jira/browse/UIMA-6128
 
 Hope it makes sense :)
 
 Thanks a lot for the help, it’s appreciated.
 
 Cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
 
> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
> 
> Re: serializing using XML 1.1
> 
> This was not thought of, when setting up the CasIOUtils.
> 
> The way it was done (above) was using some more "primitive/lower level" 
> APIs,
> rather than the CasIOUtils.
> 
> Please open a Jira ticket for this, with perhaps some suggestions on how 
> it
> might be specified in the CasIOUtils APIs.
> 
> Thanks! -Marshall
> 
> On 9/23/2019 3:45 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> Thanks for the thorough and excellent investigation.
>> 
>> We are looking into possible normalisation/cleanup of 
>> whitespace/invisible characters, but I don’t think we can necessarily do 
>> the same for some of the other characters. It sounds to me though that 
>> serialising to XML 1.1 could also be a simple fix right now, but can 
>> this be configured? CasIOUtils doesn’t seem to have an option for this, 
>> so I assume it’s something you have working in your branch.
>> 
>> Regarding the other problem. It seems that the JDK bug is fixed from 
>> Java 9 and after. Do you think switching to a more recent Java version 
>> would make a difference? I think we can also try this out ourselves when 
>> we look into migrating to UIMA 3 once our current deliveries are 
>> complete. We also like to switch to Java 11, and like UIMA 3 migration 
>> it will require some thorough testing.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>> 
>>> In the test "OddDocumentText", this produces a "throw" due to an 
>>> invalid xml
>>> char, which is the \u0002.
>>> 
>>> This is in part because the xml version being used is xml 1.0.
>>> 
>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>> 
>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>> xml 1.1:
>>> 
>>>  XmiCasSerializer xmiCasSerializer = new
>>> XmiCasSerializer(jCas.getTypeSystem());
>>>  OutputStream out = new FileOutputStream(new File 
>>> ("odd-doc-txt-v11.xmi"));
>>>  try {
>>>XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>xmiCasSerializer.serialize(jCas.getCas(),
>>> xml11Serializer.getContentHandler());
>>>  }
>>>  finally {
>>>out.close();
>>>  }
>>> 
>>> This succeeds and serializes this using xml 1.1.
>>> 
>>> I also tried serializing some doc text which includes \u77987.  That 
>>> did not
>>> serialize correctly.
>>> I could see it in the code while tracing up to some point down in the 
>>> innards of
>>> some internal
>>> sax java code
>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>>> where it was
>>> "Correct" in the Java string.
>>> 
>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>> 
>>> This is 1110 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Marshall Schor
Here's code that works that serializes in 1.1 format.

The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".

XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(), 
xml11Serializer.getContentHandler());
    }
finally {
  out.close();
}

This is from a test case. -Marshall

On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to 
> finish something first within the next 3-4 weeks. It would probably get me 
> faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:

 Re: serializing using XML 1.1

 This was not thought of, when setting up the CasIOUtils.

 The way it was done (above) was using some more "primitive/lower level" 
 APIs,
 rather than the CasIOUtils.

 Please open a Jira ticket for this, with perhaps some suggestions on how it
 might be specified in the CasIOUtils APIs.

 Thanks! -Marshall

 On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of 
> whitespace/invisible characters, but I don’t think we can necessarily do 
> the same for some of the other characters. It sounds to me though that 
> serialising to XML 1.1 could also be a simple fix right now, but can this 
> be configured? CasIOUtils doesn’t seem to have an option for this, so I 
> assume it’s something you have working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 
> 9 and after. Do you think switching to a more recent Java version would 
> make a difference? I think we can also try this out ourselves when we 
> look into migrating to UIMA 3 once our current deliveries are complete. 
> We also like to switch to Java 11, and like UIMA 3 migration it will 
> require some thorough testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>> xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>> xml 1.1:
>>
>>   XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>   OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>>   try {
>> XMLSerializer xml11Serializer = new XMLSerializer(out);
>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>> xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>   }
>>   finally {
>> out.close();
>>   }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did 
>> not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>> where it was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
>> byte encoding:
>>   1110  10xx  10xx 
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax 
>> transform
>> java code.
>>
>> I looked for a bug report and found some
>> 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks Marshall,

If you prefer then I can also have a look at it, although I probably need to 
finish something first within the next 3-4 weeks. It would probably get me 
faster started if you could share some of your experimental sample code.

Cheers,
Mario













> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
> 
> yes, makes sense, thanks for posting the Jira.
> 
> If no one else steps up to work on this, I'll probably take a look in a few
> days. -Marshall
> 
> On 9/24/2019 6:47 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> I added the following feature request to Apache Jira:
>> 
>> https://issues.apache.org/jira/browse/UIMA-6128
>> 
>> Hope it makes sense :)
>> 
>> Thanks a lot for the help, it’s appreciated.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
>>> 
>>> Re: serializing using XML 1.1
>>> 
>>> This was not thought of, when setting up the CasIOUtils.
>>> 
>>> The way it was done (above) was using some more "primitive/lower level" 
>>> APIs,
>>> rather than the CasIOUtils.
>>> 
>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>> might be specified in the CasIOUtils APIs.
>>> 
>>> Thanks! -Marshall
>>> 
>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
 Hi Marshall,
 
 Thanks for the thorough and excellent investigation.
 
 We are looking into possible normalisation/cleanup of whitespace/invisible 
 characters, but I don’t think we can necessarily do the same for some of 
 the other characters. It sounds to me though that serialising to XML 1.1 
 could also be a simple fix right now, but can this be configured? 
 CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
 something you have working in your branch.
 
 Regarding the other problem. It seems that the JDK bug is fixed from Java 
 9 and after. Do you think switching to a more recent Java version would 
 make a difference? I think we can also try this out ourselves when we look 
 into migrating to UIMA 3 once our current deliveries are complete. We also 
 like to switch to Java 11, and like UIMA 3 migration it will require some 
 thorough testing.
 
 Cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
 
> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
> 
> In the test "OddDocumentText", this produces a "throw" due to an invalid 
> xml
> char, which is the \u0002.
> 
> This is in part because the xml version being used is xml 1.0.
> 
> XML 1.1 expanded the set of valid characters to include \u0002.
> 
> Here's a snip from the XmiCasSerializerTest class which serializes with 
> xml 1.1:
> 
>   XmiCasSerializer xmiCasSerializer = new
> XmiCasSerializer(jCas.getTypeSystem());
>   OutputStream out = new FileOutputStream(new File 
> ("odd-doc-txt-v11.xmi"));
>   try {
> XMLSerializer xml11Serializer = new XMLSerializer(out);
> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
> xmiCasSerializer.serialize(jCas.getCas(),
> xml11Serializer.getContentHandler());
>   }
>   finally {
> out.close();
>   }
> 
> This succeeds and serializes this using xml 1.1.
> 
> I also tried serializing some doc text which includes \u77987.  That did 
> not
> serialize correctly.
> I could see it in the code while tracing up to some point down in the 
> innards of
> some internal
> sax java code
> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
> it was
> "Correct" in the Java string.
> 
> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
> 
> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
> byte encoding:
>   1110  10xx  10xx 
> 
> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
> 
> But I think it's out of our hands - it's somewhere deep in the sax 
> transform
> java code.
> 
> I looked for a bug report and found some
> https://bugs.openjdk.java.net/browse/JDK-8058175
> 
> Bottom line, is, I think to clean out these characters early :-) .
> 
> -Marshall
> 
> 
> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>> here's an idea.
>> 
>> If you have a string, with the surrogate pair  at position 10, 
>> and you
>> have some Java code, which is iterating through the string and getting 
>> the
>> code-point at each character offset, then that code will produce:
>> 
>> at position 10:  the code-point 77987
>> at position 11:  the code-point 56483
>> 
>> Of course, it's a "bug" to iterate through a string of characters, 
>> assuming you
>> have 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-24 Thread Marshall Schor
yes, makes sense, thanks for posting the Jira.

If no one else steps up to work on this, I'll probably take a look in a few
days. -Marshall

On 9/24/2019 6:47 AM, Mario Juric wrote:
> Hi Marshall,
>
> I added the following feature request to Apache Jira:
>
> https://issues.apache.org/jira/browse/UIMA-6128
>
> Hope it makes sense :)
>
> Thanks a lot for the help, it’s appreciated.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
>>
>> Re: serializing using XML 1.1
>>
>> This was not thought of, when setting up the CasIOUtils.
>>
>> The way it was done (above) was using some more "primitive/lower level" APIs,
>> rather than the CasIOUtils.
>>
>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>> might be specified in the CasIOUtils APIs.
>>
>> Thanks! -Marshall
>>
>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> Thanks for the thorough and excellent investigation.
>>>
>>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>>> characters, but I don’t think we can necessarily do the same for some of 
>>> the other characters. It sounds to me though that serialising to XML 1.1 
>>> could also be a simple fix right now, but can this be configured? 
>>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
>>> something you have working in your branch.
>>>
>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
>>> and after. Do you think switching to a more recent Java version would make 
>>> a difference? I think we can also try this out ourselves when we look into 
>>> migrating to UIMA 3 once our current deliveries are complete. We also like 
>>> to switch to Java 11, and like UIMA 3 migration it will require some 
>>> thorough testing.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:

 In the test "OddDocumentText", this produces a "throw" due to an invalid 
 xml
 char, which is the \u0002.

 This is in part because the xml version being used is xml 1.0.

 XML 1.1 expanded the set of valid characters to include \u0002.

 Here's a snip from the XmiCasSerializerTest class which serializes with 
 xml 1.1:

XmiCasSerializer xmiCasSerializer = new
 XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File 
 ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(),
 xml11Serializer.getContentHandler());
}
finally {
  out.close();
}

 This succeeds and serializes this using xml 1.1.

 I also tried serializing some doc text which includes \u77987.  That did 
 not
 serialize correctly.
 I could see it in the code while tracing up to some point down in the 
 innards of
 some internal
 sax java code
 com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
 it was
 "Correct" in the Java string.

 When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.

 This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
 encoding:
1110  10xx  10xx 

 of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.

 But I think it's out of our hands - it's somewhere deep in the sax 
 transform
 java code.

 I looked for a bug report and found some
 https://bugs.openjdk.java.net/browse/JDK-8058175

 Bottom line, is, I think to clean out these characters early :-) .

 -Marshall


 On 9/20/2019 1:28 PM, Marshall Schor wrote:
> here's an idea.
>
> If you have a string, with the surrogate pair  at position 10, and 
> you
> have some Java code, which is iterating through the string and getting the
> code-point at each character offset, then that code will produce:
>
> at position 10:  the code-point 77987
> at position 11:  the code-point 56483
>
> Of course, it's a "bug" to iterate through a string of characters, 
> assuming you
> have characters at each point, if you don't handle surrogate pairs.
>
> The 56483 is just the lower bits of the surrogate pair, added to xDC00 
> (see
> https://tools.ietf.org/html/rfc2781 )
>
> I worry that even tools like the CVD or similar may not work properly, 
> since
> they're not designed to handle surrogate pairs, I think, so I have no 
> idea if
> they would work well enough for you.
>
> I'll poke around some more to see if I can enable the conversion for 
> document
> strings.
>
> 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-24 Thread Mario Juric
Hi Marshall,

I added the following feature request to Apache Jira:

https://issues.apache.org/jira/browse/UIMA-6128

Hope it makes sense :)

Thanks a lot for the help, it’s appreciated.

Cheers,
Mario













> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
> 
> Re: serializing using XML 1.1
> 
> This was not thought of, when setting up the CasIOUtils.
> 
> The way it was done (above) was using some more "primitive/lower level" APIs,
> rather than the CasIOUtils.
> 
> Please open a Jira ticket for this, with perhaps some suggestions on how it
> might be specified in the CasIOUtils APIs.
> 
> Thanks! -Marshall
> 
> On 9/23/2019 3:45 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> Thanks for the thorough and excellent investigation.
>> 
>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>> characters, but I don’t think we can necessarily do the same for some of the 
>> other characters. It sounds to me though that serialising to XML 1.1 could 
>> also be a simple fix right now, but can this be configured? CasIOUtils 
>> doesn’t seem to have an option for this, so I assume it’s something you have 
>> working in your branch.
>> 
>> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
>> and after. Do you think switching to a more recent Java version would make a 
>> difference? I think we can also try this out ourselves when we look into 
>> migrating to UIMA 3 once our current deliveries are complete. We also like 
>> to switch to Java 11, and like UIMA 3 migration it will require some 
>> thorough testing.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>> 
>>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>>> char, which is the \u0002.
>>> 
>>> This is in part because the xml version being used is xml 1.0.
>>> 
>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>> 
>>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>>> 1.1:
>>> 
>>>XmiCasSerializer xmiCasSerializer = new
>>> XmiCasSerializer(jCas.getTypeSystem());
>>>OutputStream out = new FileOutputStream(new File 
>>> ("odd-doc-txt-v11.xmi"));
>>>try {
>>>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>  xmiCasSerializer.serialize(jCas.getCas(),
>>> xml11Serializer.getContentHandler());
>>>}
>>>finally {
>>>  out.close();
>>>}
>>> 
>>> This succeeds and serializes this using xml 1.1.
>>> 
>>> I also tried serializing some doc text which includes \u77987.  That did not
>>> serialize correctly.
>>> I could see it in the code while tracing up to some point down in the 
>>> innards of
>>> some internal
>>> sax java code
>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
>>> it was
>>> "Correct" in the Java string.
>>> 
>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>> 
>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>>> encoding:
>>>1110  10xx  10xx 
>>> 
>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>> 
>>> But I think it's out of our hands - it's somewhere deep in the sax transform
>>> java code.
>>> 
>>> I looked for a bug report and found some
>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>> 
>>> Bottom line, is, I think to clean out these characters early :-) .
>>> 
>>> -Marshall
>>> 
>>> 
>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
 here's an idea.
 
 If you have a string, with the surrogate pair  at position 10, and 
 you
 have some Java code, which is iterating through the string and getting the
 code-point at each character offset, then that code will produce:
 
 at position 10:  the code-point 77987
 at position 11:  the code-point 56483
 
 Of course, it's a "bug" to iterate through a string of characters, 
 assuming you
 have characters at each point, if you don't handle surrogate pairs.
 
 The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
 https://tools.ietf.org/html/rfc2781 )
 
 I worry that even tools like the CVD or similar may not work properly, 
 since
 they're not designed to handle surrogate pairs, I think, so I have no idea 
 if
 they would work well enough for you.
 
 I'll poke around some more to see if I can enable the conversion for 
 document
 strings.
 
 -Marshall
 
 On 9/20/2019 11:09 AM, Mario Juric wrote:
> Thanks Marshall,
> 
> Encoding the characters like you suggest should work just fine for us as 
> long as we can serialize and deserialise the XMI, so that we can open the 
> content in a tool like the CVD or similar. These characters are 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor
re: using a later Java - that might make a difference, since fixes keep getting
added.

For some fixes, however, as you've noted, the fixes are backported to previous
versions.

-Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>>
>> XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>> OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>> try {
>>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>   xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>> }
>> finally {
>>   out.close();
>> }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>> 1110  10xx  10xx 
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair  at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
 Thanks Marshall,

 Encoding the characters like you suggest should work just fine for us as 
 long as we can serialize and deserialise the XMI, so that we can open the 
 content in a tool like the CVD or similar. These characters are just noise 
 from the original content that happen to remain in the CAS, but they are 
 not visible in our final output because they are basically filtered out 
 one way or the other by downstream components. They become a problem 
 though when they make it more difficult for us to inspect the content.

 Regarding the feature name issue: Might you have an idea why we are 
 getting a different XMI output for the same character in our actual 
 pipeline, where it results in "”? I investigated the value 
 in the debugger again, and like you are illustrating it is also 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor
Re: serializing using XML 1.1

This was not thought of, when setting up the CasIOUtils.

The way it was done (above) was using some more "primitive/lower level" APIs,
rather than the CasIOUtils.

Please open a Jira ticket for this, with perhaps some suggestions on how it
might be specified in the CasIOUtils APIs.

Thanks! -Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>>
>> XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>> OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>> try {
>>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>   xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>> }
>> finally {
>>   out.close();
>> }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>> 1110  10xx  10xx 
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair  at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
 Thanks Marshall,

 Encoding the characters like you suggest should work just fine for us as 
 long as we can serialize and deserialise the XMI, so that we can open the 
 content in a tool like the CVD or similar. These characters are just noise 
 from the original content that happen to remain in the CAS, but they are 
 not visible in our final output because they are basically filtered out 
 one way or the other by downstream components. They become a problem 
 though when they make it more difficult for us to inspect the content.

 Regarding the feature name issue: Might you have an idea why we are 
 getting a different XMI output for the same character in 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Mario Juric
Hi Marshall,

Seems the bug was already resolved for 8u92 in one of the backports:

https://bugs.openjdk.java.net/browse/JDK-8141098 


Cheers,
Mario













> On 23 Sep 2019, at 09:45 , Mario Juric  wrote:
> 
> Hi Marshall,
> 
> Thanks for the thorough and excellent investigation.
> 
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
> 
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 20 Sep 2019, at 20:52 , Marshall Schor > > wrote:
>> 
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>> 
>> This is in part because the xml version being used is xml 1.0.
>> 
>> XML 1.1 expanded the set of valid characters to include \u0002.
>> 
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>> 
>> XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>> OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>> try {
>>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>   xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>> }
>> finally {
>>   out.close();
>> }
>> 
>> This succeeds and serializes this using xml 1.1.
>> 
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>> 
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>> 
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>> 1110  10xx  10xx 
>> 
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>> 
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>> 
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175 
>> 
>> 
>> Bottom line, is, I think to clean out these characters early :-) .
>> 
>> -Marshall
>> 
>> 
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>> 
>>> If you have a string, with the surrogate pair  at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>> 
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>> 
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>> 
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>> 
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>> 
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>> 
>>> -Marshall
>>> 
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
 Thanks Marshall,
 
 Encoding the characters like you suggest should work just fine for us as 
 long as we can serialize and deserialise the XMI, so that we can open the 
 content in a tool like the CVD or similar. These characters are just noise 
 from the original content that happen to remain in the CAS, but they are 
 not visible in our final output because they are basically filtered out 
 one way or the other by downstream components. They become a problem 
 though when they make it more difficult for us to inspect the content.
 
 Regarding the feature name issue: Might you have an idea why we are 
 getting a different XMI output for the same 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Mario Juric
Hi Marshall,

Thanks for the thorough and excellent investigation.

We are looking into possible normalisation/cleanup of whitespace/invisible 
characters, but I don’t think we can necessarily do the same for some of the 
other characters. It sounds to me though that serialising to XML 1.1 could also 
be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem 
to have an option for this, so I assume it’s something you have working in your 
branch.

Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and 
after. Do you think switching to a more recent Java version would make a 
difference? I think we can also try this out ourselves when we look into 
migrating to UIMA 3 once our current deliveries are complete. We also like to 
switch to Java 11, and like UIMA 3 migration it will require some thorough 
testing.

Cheers,
Mario













> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
> 
> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
> char, which is the \u0002.
> 
> This is in part because the xml version being used is xml 1.0.
> 
> XML 1.1 expanded the set of valid characters to include \u0002.
> 
> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
> 1.1:
> 
> XmiCasSerializer xmiCasSerializer = new
> XmiCasSerializer(jCas.getTypeSystem());
> OutputStream out = new FileOutputStream(new File 
> ("odd-doc-txt-v11.xmi"));
> try {
>   XMLSerializer xml11Serializer = new XMLSerializer(out);
>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>   xmiCasSerializer.serialize(jCas.getCas(),
> xml11Serializer.getContentHandler());
> }
> finally {
>   out.close();
> }
> 
> This succeeds and serializes this using xml 1.1.
> 
> I also tried serializing some doc text which includes \u77987.  That did not
> serialize correctly.
> I could see it in the code while tracing up to some point down in the innards 
> of
> some internal
> sax java code
> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
> was
> "Correct" in the Java string.
> 
> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
> 
> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
> encoding:
> 1110  10xx  10xx 
> 
> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
> 
> But I think it's out of our hands - it's somewhere deep in the sax transform
> java code.
> 
> I looked for a bug report and found some
> https://bugs.openjdk.java.net/browse/JDK-8058175
> 
> Bottom line, is, I think to clean out these characters early :-) .
> 
> -Marshall
> 
> 
> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>> here's an idea.
>> 
>> If you have a string, with the surrogate pair  at position 10, and you
>> have some Java code, which is iterating through the string and getting the
>> code-point at each character offset, then that code will produce:
>> 
>> at position 10:  the code-point 77987
>> at position 11:  the code-point 56483
>> 
>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>> you
>> have characters at each point, if you don't handle surrogate pairs.
>> 
>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>> https://tools.ietf.org/html/rfc2781 )
>> 
>> I worry that even tools like the CVD or similar may not work properly, since
>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>> they would work well enough for you.
>> 
>> I'll poke around some more to see if I can enable the conversion for document
>> strings.
>> 
>> -Marshall
>> 
>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>> Thanks Marshall,
>>> 
>>> Encoding the characters like you suggest should work just fine for us as 
>>> long as we can serialize and deserialise the XMI, so that we can open the 
>>> content in a tool like the CVD or similar. These characters are just noise 
>>> from the original content that happen to remain in the CAS, but they are 
>>> not visible in our final output because they are basically filtered out one 
>>> way or the other by downstream components. They become a problem though 
>>> when they make it more difficult for us to inspect the content.
>>> 
>>> Regarding the feature name issue: Might you have an idea why we are getting 
>>> a different XMI output for the same character in our actual pipeline, where 
>>> it results in "”? I investigated the value in the debugger 
>>> again, and like you are illustrating it is also just a single codepoint 
>>> with the value 77987. We are simply not able to load this XMI because of 
>>> this, but unfortunately I couldn’t reproduce it in my small example.
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
 On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
 
 The odd-feature-text seems to work 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor
In the test "OddDocumentText", this produces a "throw" due to an invalid xml
char, which is the \u0002.

This is in part because the xml version being used is xml 1.0.

XML 1.1 expanded the set of valid characters to include \u0002.

Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:

        XmiCasSerializer xmiCasSerializer = new
XmiCasSerializer(jCas.getTypeSystem());
    OutputStream out = new FileOutputStream(new File 
("odd-doc-txt-v11.xmi"));
    try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(),
xml11Serializer.getContentHandler());
    }
    finally {
  out.close();
    }

This succeeds and serializes this using xml 1.1.

I also tried serializing some doc text which includes \u77987.  That did not
serialize correctly.
I could see it in the code while tracing up to some point down in the innards of
some internal
sax java code
com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
"Correct" in the Java string.

When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.

This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
encoding:
    1110  10xx  10xx 

of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.

But I think it's out of our hands - it's somewhere deep in the sax transform
java code.

I looked for a bug report and found some
https://bugs.openjdk.java.net/browse/JDK-8058175

Bottom line, is, I think to clean out these characters early :-) .

-Marshall


On 9/20/2019 1:28 PM, Marshall Schor wrote:
> here's an idea.
>
> If you have a string, with the surrogate pair  at position 10, and you
> have some Java code, which is iterating through the string and getting the
> code-point at each character offset, then that code will produce:
>
> at position 10:  the code-point 77987
> at position 11:  the code-point 56483
>
> Of course, it's a "bug" to iterate through a string of characters, assuming 
> you
> have characters at each point, if you don't handle surrogate pairs.
>
> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
> https://tools.ietf.org/html/rfc2781 )
>
> I worry that even tools like the CVD or similar may not work properly, since
> they're not designed to handle surrogate pairs, I think, so I have no idea if
> they would work well enough for you.
>
> I'll poke around some more to see if I can enable the conversion for document
> strings.
>
> -Marshall
>
> On 9/20/2019 11:09 AM, Mario Juric wrote:
>> Thanks Marshall,
>>
>> Encoding the characters like you suggest should work just fine for us as 
>> long as we can serialize and deserialise the XMI, so that we can open the 
>> content in a tool like the CVD or similar. These characters are just noise 
>> from the original content that happen to remain in the CAS, but they are not 
>> visible in our final output because they are basically filtered out one way 
>> or the other by downstream components. They become a problem though when 
>> they make it more difficult for us to inspect the content.
>>
>> Regarding the feature name issue: Might you have an idea why we are getting 
>> a different XMI output for the same character in our actual pipeline, where 
>> it results in "”? I investigated the value in the debugger 
>> again, and like you are illustrating it is also just a single codepoint with 
>> the value 77987. We are simply not able to load this XMI because of this, 
>> but unfortunately I couldn’t reproduce it in my small example.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
>>>
>>> The odd-feature-text seems to work OK, but has some unusual properties, due 
>>> to
>>> that unicode character.
>>>
>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>
>>> When output, it shows up in the xmi as >> xmi:id="18"
>>> name="" value="1.0"/>
>>> which seems correct.  The name field only has 1 (extended)unicode character
>>> (taking 2 Java characters to represent),
>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>
>>> When read in, the name field is assigned to a String, that string says it 
>>> has a
>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>> char).
>>> If you have the name string in a variable "n", and do
>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>
>>> So, the string value serialization and deserialization seems to be 
>>> "working".
>>>
>>> The other code - for the sofa (document) serialization, is throwing that 
>>> error,
>>> because as currently designed, the
>>> serialization code checks for these kinds of characters, 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor
here's an idea.

If you have a string, with the surrogate pair  at position 10, and you
have some Java code, which is iterating through the string and getting the
code-point at each character offset, then that code will produce:

at position 10:  the code-point 77987
at position 11:  the code-point 56483

Of course, it's a "bug" to iterate through a string of characters, assuming you
have characters at each point, if you don't handle surrogate pairs.

The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
https://tools.ietf.org/html/rfc2781 )

I worry that even tools like the CVD or similar may not work properly, since
they're not designed to handle surrogate pairs, I think, so I have no idea if
they would work well enough for you.

I'll poke around some more to see if I can enable the conversion for document
strings.

-Marshall

On 9/20/2019 11:09 AM, Mario Juric wrote:
> Thanks Marshall,
>
> Encoding the characters like you suggest should work just fine for us as long 
> as we can serialize and deserialise the XMI, so that we can open the content 
> in a tool like the CVD or similar. These characters are just noise from the 
> original content that happen to remain in the CAS, but they are not visible 
> in our final output because they are basically filtered out one way or the 
> other by downstream components. They become a problem though when they make 
> it more difficult for us to inspect the content.
>
> Regarding the feature name issue: Might you have an idea why we are getting a 
> different XMI output for the same character in our actual pipeline, where it 
> results in "”? I investigated the value in the debugger 
> again, and like you are illustrating it is also just a single codepoint with 
> the value 77987. We are simply not able to load this XMI because of this, but 
> unfortunately I couldn’t reproduce it in my small example.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
>>
>> The odd-feature-text seems to work OK, but has some unusual properties, due 
>> to
>> that unicode character.
>>
>> Here's what I see:  The FeatureRecord "name" field is set to a
>> 1-unicode-character, that must be encoded as 2 java characters.
>>
>> When output, it shows up in the xmi as > name="" value="1.0"/>
>> which seems correct.  The name field only has 1 (extended)unicode character
>> (taking 2 Java characters to represent),
>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>
>> When read in, the name field is assigned to a String, that string says it 
>> has a
>> length of 2 (but that's because it takes 2 java chars to represent this 
>> char).
>> If you have the name string in a variable "n", and do
>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>> n.codePointCount(0, n.length()) is, as expected, 1.
>>
>> So, the string value serialization and deserialization seems to be "working".
>>
>> The other code - for the sofa (document) serialization, is throwing that 
>> error,
>> because as currently designed, the
>> serialization code checks for these kinds of characters, and if found throws
>> that exception.  The code checking is
>> in XMLUtils.checkForNonXmlCharacters
>>
>> This is because it's highly likely that "fixing this" in the same way as the
>> other, would result in hard-to-diagnose
>> future errors, because the subject of analysis string is processed with 
>> begin /
>> end offset all over the place, and makes
>> the assumption that the characters are all not coded as surrogate pairs.
>>
>> We could change the code to output these like the name, as, e.g.,   
>>
>> Would that help in your case, or do you imagine other kinds of things might
>> break (due to begin/end offsets no longer
>> being on character boundaries, for example).
>>
>> -Marshall
>>
>>
>>
>>
>>
>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> I investigated the XMI issue as promised and these are my findings.
>>>
>>> It is related to special unicode characters that are not handled by XMI
>>> serialisation, and there seems to be two distinct categories of issues we 
>>> have
>>> identified so far.
>>>
>>> 1) The document text of the CAS contains special unicode characters
>>> 2) Annotations with String features have values containing special unicode
>>> characters
>>>
>>> In both cases we could for sure solve the problem if we did a better clean 
>>> up
>>> job upstream, but with the amount and variety of data we receive there is
>>> always a chance something passes through, and some of it may in the general
>>> case even be valid content.
>>>
>>> The first case is easy to reproduce with the OddDocumentText example I
>>> attached. In this example the text is a snippet taken from the content of a
>>> parsed XML document.
>>>
>>> The other case was not possible to reproduce with the OddFeatureText 
>>> example,
>>> because I am getting slightly different output to what I have in our real
>>> setup. The 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Mario Juric
Thanks Marshall,

Encoding the characters like you suggest should work just fine for us as long 
as we can serialize and deserialise the XMI, so that we can open the content in 
a tool like the CVD or similar. These characters are just noise from the 
original content that happen to remain in the CAS, but they are not visible in 
our final output because they are basically filtered out one way or the other 
by downstream components. They become a problem though when they make it more 
difficult for us to inspect the content.

Regarding the feature name issue: Might you have an idea why we are getting a 
different XMI output for the same character in our actual pipeline, where it 
results in "”? I investigated the value in the debugger again, 
and like you are illustrating it is also just a single codepoint with the value 
77987. We are simply not able to load this XMI because of this, but 
unfortunately I couldn’t reproduce it in my small example.

Cheers,
Mario












> On 19 Sep 2019, at 22:41 , Marshall Schor  wrote:
> 
> The odd-feature-text seems to work OK, but has some unusual properties, due to
> that unicode character.
> 
> Here's what I see:  The FeatureRecord "name" field is set to a
> 1-unicode-character, that must be encoded as 2 java characters.
> 
> When output, it shows up in the xmi as  name="" value="1.0"/>
> which seems correct.  The name field only has 1 (extended)unicode character
> (taking 2 Java characters to represent),
> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
> 
> When read in, the name field is assigned to a String, that string says it has 
> a
> length of 2 (but that's because it takes 2 java chars to represent this char).
> If you have the name string in a variable "n", and do
> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
> n.codePointCount(0, n.length()) is, as expected, 1.
> 
> So, the string value serialization and deserialization seems to be "working".
> 
> The other code - for the sofa (document) serialization, is throwing that 
> error,
> because as currently designed, the
> serialization code checks for these kinds of characters, and if found throws
> that exception.  The code checking is
> in XMLUtils.checkForNonXmlCharacters
> 
> This is because it's highly likely that "fixing this" in the same way as the
> other, would result in hard-to-diagnose
> future errors, because the subject of analysis string is processed with begin 
> /
> end offset all over the place, and makes
> the assumption that the characters are all not coded as surrogate pairs.
> 
> We could change the code to output these like the name, as, e.g.,   
> 
> Would that help in your case, or do you imagine other kinds of things might
> break (due to begin/end offsets no longer
> being on character boundaries, for example).
> 
> -Marshall
> 
> 
> 
> 
> 
> On 9/18/2019 11:41 AM, Mario Juric wrote:
>> Hi,
>> 
>> I investigated the XMI issue as promised and these are my findings.
>> 
>> It is related to special unicode characters that are not handled by XMI
>> serialisation, and there seems to be two distinct categories of issues we 
>> have
>> identified so far.
>> 
>> 1) The document text of the CAS contains special unicode characters
>> 2) Annotations with String features have values containing special unicode
>> characters
>> 
>> In both cases we could for sure solve the problem if we did a better clean up
>> job upstream, but with the amount and variety of data we receive there is
>> always a chance something passes through, and some of it may in the general
>> case even be valid content.
>> 
>> The first case is easy to reproduce with the OddDocumentText example I
>> attached. In this example the text is a snippet taken from the content of a
>> parsed XML document.
>> 
>> The other case was not possible to reproduce with the OddFeatureText example,
>> because I am getting slightly different output to what I have in our real
>> setup. The OddFeatureText example is based on the simple type system I shared
>> previously. The name value of a FeatureRecord contains special unicode
>> characters that I found in a similar data structure in our actual CAS. The
>> value comes from an external knowledge base holding some noisy strings, which
>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>> using the small example it only outputs the first of the two characters in
>> "\uD80C\uDCA3”, which yields the value "” in the XMI, but in our
>> actual setup both character values are written as "”. This
>> means that the attached example will for some reason parse the XMI again, but
>> it will not work in the case where both characters are written the way we
>> experience it. The XMI can be manually changed, so that both character values
>> are included the way it happens in our output, and in this case a
>> SAXParserException happens.
>> 
>> I don’t know whether it is outside the scope of the XMI serialiser to handle
>> any of this, but 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-19 Thread Marshall Schor
The odd-feature-text seems to work OK, but has some unusual properties, due to
that unicode character.

Here's what I see:  The FeatureRecord "name" field is set to a
1-unicode-character, that must be encoded as 2 java characters.

When output, it shows up in the xmi as 
which seems correct.  The name field only has 1 (extended)unicode character
(taking 2 Java characters to represent),
due to setting it with this code:   String oddName = "\uD80C\uDCA3";

When read in, the name field is assigned to a String, that string says it has a
length of 2 (but that's because it takes 2 java chars to represent this char).
If you have the name string in a variable "n", and do
System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
n.codePointCount(0, n.length()) is, as expected, 1.

So, the string value serialization and deserialization seems to be "working".

The other code - for the sofa (document) serialization, is throwing that error,
because as currently designed, the
serialization code checks for these kinds of characters, and if found throws
that exception.  The code checking is
in XMLUtils.checkForNonXmlCharacters

This is because it's highly likely that "fixing this" in the same way as the
other, would result in hard-to-diagnose
future errors, because the subject of analysis string is processed with begin /
end offset all over the place, and makes
the assumption that the characters are all not coded as surrogate pairs.

We could change the code to output these like the name, as, e.g.,   

Would that help in your case, or do you imagine other kinds of things might
break (due to begin/end offsets no longer
being on character boundaries, for example).

-Marshall





On 9/18/2019 11:41 AM, Mario Juric wrote:
> Hi,
>
> I investigated the XMI issue as promised and these are my findings.
>
> It is related to special unicode characters that are not handled by XMI
> serialisation, and there seems to be two distinct categories of issues we have
> identified so far.
>
> 1) The document text of the CAS contains special unicode characters
> 2) Annotations with String features have values containing special unicode
> characters
>
> In both cases we could for sure solve the problem if we did a better clean up
> job upstream, but with the amount and variety of data we receive there is
> always a chance something passes through, and some of it may in the general
> case even be valid content.
>
> The first case is easy to reproduce with the OddDocumentText example I
> attached. In this example the text is a snippet taken from the content of a
> parsed XML document.
>
> The other case was not possible to reproduce with the OddFeatureText example,
> because I am getting slightly different output to what I have in our real
> setup. The OddFeatureText example is based on the simple type system I shared
> previously. The name value of a FeatureRecord contains special unicode
> characters that I found in a similar data structure in our actual CAS. The
> value comes from an external knowledge base holding some noisy strings, which
> in this case is a hieroglyph entity. However, when I write the CAS to XMI
> using the small example it only outputs the first of the two characters in
> "\uD80C\uDCA3”, which yields the value "” in the XMI, but in our
> actual setup both character values are written as "”. This
> means that the attached example will for some reason parse the XMI again, but
> it will not work in the case where both characters are written the way we
> experience it. The XMI can be manually changed, so that both character values
> are included the way it happens in our output, and in this case a
> SAXParserException happens.
>
> I don’t know whether it is outside the scope of the XMI serialiser to handle
> any of this, but it will be good to know in any case :)
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 17 Sep 2019, at 09:36 , Mario Juric > >
>> wrote:
>>
>> Thank you very much for looking into this. It is really appreciated and I
>> think it touches upon something important, which is about data migration in
>> general.
>>
>> I agree that some of these solutions can appear specific, awkward or complex
>> and the way forward is not to address our use case alone. I think there is a
>> need for a compact and efficient binary serialization format for the CAS when
>> dealing with large amounts of data because this is directly visible in costs
>> of processing and storing, and I found the compressed binary format to be
>> much better than XMI in this regard, although I have to admit it’s been a
>> while since I benchmarked this. Given that UIMA already has a well described
>> type system then maybe it just lacks a way to describe schema evolution
>> similar to Apache Avro or similar serialisation frameworks. I think a more
>> formal approach to data migration would be critical to any larger operational
>> setup.
>>
>> Regarding XMI I like to provide some input to the problem we are 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-18 Thread Mario Juric
Hi,I investigated the XMI issue as promised and these are my findings.It is related to special unicode characters that are not handled by XMI serialisation, and there seems to be two distinct categories of issues we have identified so far.1) The document text of the CAS contains special unicode characters2) Annotations with String features have values containing special unicode charactersIn both cases we could for sure solve the problem if we did a better clean up job upstream, but with the amount and variety of data we receive there is always a chance something passes through, and some of it may in the general case even be valid content.The first case is easy to reproduce with the OddDocumentText example I attached. In this example the text is a snippet taken from the content of a parsed XML document.The other case was not possible to reproduce with the OddFeatureText example, because I am getting slightly different output to what I have in our real setup. The OddFeatureText example is based on the simple type system I shared previously. The name value of a FeatureRecord contains special unicode characters that I found in a similar data structure in our actual CAS. The value comes from an external knowledge base holding some noisy strings, which in this case is a hieroglyph entity. However, when I write the CAS to XMI using the small example it only outputs the first of the two characters in "\uD80C\uDCA3”, which yields the value "” in the XMI, but in our actual setup both character values are written as "”. This means that the attached example will for some reason parse the XMI again, but it will not work in the case where both characters are written the way we experience it. The XMI can be manually changed, so that both character values are included the way it happens in our output, and in this case a SAXParserException happens.I don’t know whether it is outside the scope of the XMI serialiser to handle any of this, but it will be good to know in any case :)
Cheers,Mario

OddDocumentText.java
Description: Binary data


OddFeatureText.java
Description: Binary data


On 17 Sep 2019, at 09:36 , Mario Juric  wrote:Thank you very much for looking into this. It is really appreciated and I think it touches upon something important, which is about data migration in general.I agree that some of these solutions can appear specific, awkward or complex and the way forward is not to address our use case alone. I think there is a need for a compact and efficient binary serialization format for the CAS when dealing with large amounts of data because this is directly visible in costs of processing and storing, and I found the compressed binary format to be much better than XMI in this regard, although I have to admit it’s been a while since I benchmarked this. Given that UIMA already has a well described type system then maybe it just lacks a way to describe schema evolution similar to Apache Avro or similar serialisation frameworks. I think a more formal approach to data migration would be critical to any larger operational setup.Regarding XMI I like to provide some input to the problem we are observing, so that it can be solved. We are primarily using XMI for inspection/debugging purposes, and we are sometimes not able to do this because of this error. I will try to extract a minimum example to avoid involving parts that has to do with our pipeline and type system, and I think this would also be the best way to illustrate that the problem exists outside of this context. However, converting all our data to XMI first in order to do the conversion in our example would not be very practical for us, because it involves a large amount of data.Cheers,Mario

On 16 Sep 2019, at 23:02 , Marshall Schor  wrote:In this case, the original looks kind-of like this:Container   features -> FSArray of FeatureAnnotation each of which has 5 slots: sofaRef, begin, end, name, valuethe new TypeSystem hasContainer   features -> FSArray of FeatureRecord each of which                              has 2 slots: name, valueThe deserializer code would need some way to decide how to   1) create an FSArray of FeatureRecord,   2) for each element,  map the FeatureAnnotation to a new instance of FeatureRecordI guess I could imagine a default mapping (for item 2 above) of  1) change the type from A to B  2) set equal-named features from A to B, drop other featuresThis mapping would need to apply to a subset of the A's and B's, namely, onlythose referenced by the FSArray where the element type changed.  Seems complexand specific to this use case though.-MarshallOn 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:On 16. Sep 2019, at 19:05, Marshall Schor  wrote:I can reproduce the problem, and see what is happening.  The deserializationcode compares the two type systems, and allows for some mismatches (thingspresent in one and not in the other), but it doesn't allow for 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
In this case, the original looks kind-of like this:

Container
   features -> FSArray of FeatureAnnotation each of which
 has 5 slots: sofaRef, begin, end, name, value

the new TypeSystem has

Container
   features -> FSArray of FeatureRecord each of which
                              has 2 slots: name, value

The deserializer code would need some way to decide how to
   1) create an FSArray of FeatureRecord,
   2) for each element,
  map the FeatureAnnotation to a new instance of FeatureRecord

I guess I could imagine a default mapping (for item 2 above) of
  1) change the type from A to B
  2) set equal-named features from A to B, drop other features

This mapping would need to apply to a subset of the A's and B's, namely, only
those referenced by the FSArray where the element type changed.  Seems complex
and specific to this use case though.

-Marshall


On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
> On 16. Sep 2019, at 19:05, Marshall Schor  wrote:
>> I can reproduce the problem, and see what is happening.  The deserialization
>> code compares the two type systems, and allows for some mismatches (things
>> present in one and not in the other), but it doesn't allow for having a 
>> feature
>> whose range (value) is type  in one type system and type  in the 
>> other. 
>> See CasTypeSystemMapper lines 299 - 315.
> Without reading the code in detail - could we not relax this check such that 
> the element type of FSArrays is not checked and the code simply assumes that 
> the source element type has the same features as the target element type 
> (with the usual lenient handling of missing features in the target type)? - 
> Kind of a "duck typing" approach?
>
> Cheers,
>
> -- Richard


Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Richard Eckart de Castilho
On 16. Sep 2019, at 19:05, Marshall Schor  wrote:
> 
> I can reproduce the problem, and see what is happening.  The deserialization
> code compares the two type systems, and allows for some mismatches (things
> present in one and not in the other), but it doesn't allow for having a 
> feature
> whose range (value) is type  in one type system and type  in the 
> other. 
> See CasTypeSystemMapper lines 299 - 315.

Without reading the code in detail - could we not relax this check such that 
the element type of FSArrays is not checked and the code simply assumes that 
the source element type has the same features as the target element type (with 
the usual lenient handling of missing features in the target type)? - Kind of a 
"duck typing" approach?

Cheers,

-- Richard

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
I can reproduce the problem, and see what is happening.  The deserialization
code compares the two type systems, and allows for some mismatches (things
present in one and not in the other), but it doesn't allow for having a feature
whose range (value) is type  in one type system and type  in the other. 
See CasTypeSystemMapper lines 299 - 315.

It may not be easy to fix.  Basically, the deserialization routines are set up
with a lenient kind of accommodation for different type systems, where they can
"skip" over types and features that are missing. 

This particular transformation needs to run a value conversion - from
FeatureAnnotation to FeatureRecord. 

I'm thinking of various approaches, and putting these out for others to expand
upon, etc.

1) Along the lines of Richard's remark, fix the xmi serialization to work with
all binary data, perhaps by base-64 encoding problematic (or specified by
feature name, or all) values, or - if it turns out to just be some "bug" -
fixing the bug.

2) Allow the user to specify some kind of call-back function, in the
deserializer, when the range of the feature doesn't match.  This would take some
kind of representation of the feature value in typesystem1, and the type of the
feature value in type system 2, and would need to produce the value in type
system 2.  This may be quite problematic/awkward to carry out in all the
generalized edge cases, for instance if there are "forward" references to things
not yet deserialized, etc.

At this point, I think #1 could be quite feasible.  To investigate further, it
would help to have a small test case where the xmi serialization currently is
not readable (due to - as you think - character coding issues).

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* 
> mobile:  +45 3082 4100
>
> skype: mario.juric.dk 
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric > >
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor >> > wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem 
>>> with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, 
>>> and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch 
>>> between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
 Just a quick follow up.

 I played a bit around with the CasIOUtils, and it seems that it is possible
 to load and use the embedded type system, i.e. the old type system with X,
 but I found no way to replace it with the new type system and make the
 necessary mappings to Y. I tried to see if I could use the CasCopier in a
 separate step but it expectedly fails when it reaches to the FSArray of X
 in the source CAS because the destination type system requires elements of
 type Y. I could make my own modified version of the CasCopier that could
 take some mapping functions for each pair of source 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric
Yes, these where just generated from the type system file using JCasGen.



> On 16 Sep 2019, at 15:32 , Marshall Schor  wrote:
> 
> oops, ignore that - I see Container is a JCas class ...  -M
> 
> On 9/16/2019 9:30 AM, Marshall Schor wrote:
>> I may have some version pblms.  The LoadCompressedBinary has refs to a class
>> "Container", but I don't seem to have that class - where is it coming from?
>> 
>> -Marshall
>> 
>> On 9/16/2019 8:11 AM, Mario Juric wrote:
>>> Best Regards,
>>> 
>>> Mario Juric
>>> Principal Engineer
>>> *UNSILO.ai* 
>>> mobile:  +45 3082 4100
>>> 
>>>skype: mario.juric.dk 
>>> 
>>> 
>>> 
>>> 
>>> Hi Marshall,
>>> 
>>> I have a small test case  with 3 files excluding any JCasGen generated types
>>> and UIMAfit types file.
>>> 
>>> First you will have to generate the types and run the SaveCompressedBinary 
>>> to
>>> produce the 3 binaries forms I have been experimenting with. Yo should then 
>>> be
>>> able to run LoadCompressedBinaries as expected.
>>> 
>>> Next you need to change the element type of Container.features from
>>> FeatureAnnotation to FeatureRecord in the type system and generate the type
>>> system again. Also change the FeatureAnnotation reference In
>>> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
>>> previously stored binaries again without saving them first using the new 
>>> type
>>> system.
>>> 
>>> You can see I have played with different ways of loading just to see if
>>> anything worked, but much of it seems to result in exactly the same calls in
>>> the lower layers. I didn’t get entirely the same results with the CAS we
>>> actually store as in this example. E.g. I experienced some EOF with the
>>> compressed filtered whereas I only get a class cast exception during
>>> verification in this example. Note also that we keep both types in the new
>>> type system, but we want to change the element type of the FSArray in the
>>> Container.
>>> 
>>> Hope this will yield some useful insights and thanks a lot :)
>>> 
>>> Cheers
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
 On 13 Sep 2019, at 21:55 , Mario Juric >>> >
 wrote:
 
 Thanks Marshall,
 
 I’ll get back to you with a small sample as soon I get the time to do it.
 This will also get me a better understanding of the the format.
 
 
 Cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
> On 13 Sep 2019, at 19:32 , Marshall Schor  > wrote:
> 
> I'm wondering if you could post a very small test case showing this 
> problem with
> a small type system. 
> 
> With that, I could run in the debugger and see exactly what was 
> happening, and
> see whether or not some small fix would make this work.
> 
> The Deserializer for this already supports a certain type of mismatch 
> between
> type systems, but mainly one where one is a subset of the other - see the
> javadoc for the method
> 
> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
> 
> But it must not currently cover this particular case.
> 
> -Marshall
> 
> On 9/13/2019 10:48 AM, Mario Juric wrote:
>> Just a quick follow up.
>> 
>> I played a bit around with the CasIOUtils, and it seems that it is 
>> possible
>> to load and use the embedded type system, i.e. the old type system with 
>> X,
>> but I found no way to replace it with the new type system and make the
>> necessary mappings to Y. I tried to see if I could use the CasCopier in a
>> separate step but it expectedly fails when it reaches to the FSArray of X
>> in the source CAS because the destination type system requires elements 
>> of
>> type Y. I could make my own modified version of the CasCopier that could
>> take some mapping functions for each pair of source and destination types
>> that need to be mapped, but this is where it starts to get too 
>> complicated,
>> so I found it not to be worth it at this point, since we might then want 
>> to
>> reprocess everything from scratch anyway.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 12 Sep 2019, at 10:41 , Mario Juric >> > wrote:
>>> 
>>> Hi,
>>> 
>>> We use form 6 compressed binaries to persist the CAS. We now want to 
>>> make
>>> a change to the type system that is not directly compatible, although in
>>> principle the new type system is really a subset from a data 
>>> perspective,
>>> so we want to migrate existing binaries to the new type system, but we
>>> don’t know how. The change is as follows:
>>> 
>>> In the existing type system we have a type A with a FSArray feature of
>>> element 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
oops, ignore that - I see Container is a JCas class ...  -M

On 9/16/2019 9:30 AM, Marshall Schor wrote:
> I may have some version pblms.  The LoadCompressedBinary has refs to a class
> "Container", but I don't seem to have that class - where is it coming from?
>
> -Marshall
>
> On 9/16/2019 8:11 AM, Mario Juric wrote:
>> Best Regards,
>>
>> Mario Juric
>> Principal Engineer
>> *UNSILO.ai* 
>> mobile:  +45 3082 4100
>>
>> skype: mario.juric.dk 
>>
>>
>>
>>
>> Hi Marshall,
>>
>> I have a small test case  with 3 files excluding any JCasGen generated types
>> and UIMAfit types file.
>>
>> First you will have to generate the types and run the SaveCompressedBinary to
>> produce the 3 binaries forms I have been experimenting with. Yo should then 
>> be
>> able to run LoadCompressedBinaries as expected.
>>
>> Next you need to change the element type of Container.features from
>> FeatureAnnotation to FeatureRecord in the type system and generate the type
>> system again. Also change the FeatureAnnotation reference In
>> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
>> previously stored binaries again without saving them first using the new type
>> system.
>>
>> You can see I have played with different ways of loading just to see if
>> anything worked, but much of it seems to result in exactly the same calls in
>> the lower layers. I didn’t get entirely the same results with the CAS we
>> actually store as in this example. E.g. I experienced some EOF with the
>> compressed filtered whereas I only get a class cast exception during
>> verification in this example. Note also that we keep both types in the new
>> type system, but we want to change the element type of the FSArray in the
>> Container.
>>
>> Hope this will yield some useful insights and thanks a lot :)
>>
>> Cheers
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 21:55 , Mario Juric >> >
>>> wrote:
>>>
>>> Thanks Marshall,
>>>
>>> I’ll get back to you with a small sample as soon I get the time to do it.
>>> This will also get me a better understanding of the the format.
>>>
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 On 13 Sep 2019, at 19:32 , Marshall Schor >>> > wrote:

 I'm wondering if you could post a very small test case showing this 
 problem with
 a small type system. 

 With that, I could run in the debugger and see exactly what was happening, 
 and
 see whether or not some small fix would make this work.

 The Deserializer for this already supports a certain type of mismatch 
 between
 type systems, but mainly one where one is a subset of the other - see the
 javadoc for the method

 org.apache.uima.cas.impl.BinaryCasSerDes6.java.

 But it must not currently cover this particular case.

 -Marshall

 On 9/13/2019 10:48 AM, Mario Juric wrote:
> Just a quick follow up.
>
> I played a bit around with the CasIOUtils, and it seems that it is 
> possible
> to load and use the embedded type system, i.e. the old type system with X,
> but I found no way to replace it with the new type system and make the
> necessary mappings to Y. I tried to see if I could use the CasCopier in a
> separate step but it expectedly fails when it reaches to the FSArray of X
> in the source CAS because the destination type system requires elements of
> type Y. I could make my own modified version of the CasCopier that could
> take some mapping functions for each pair of source and destination types
> that need to be mapped, but this is where it starts to get too 
> complicated,
> so I found it not to be worth it at this point, since we might then want 
> to
> reprocess everything from scratch anyway.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 12 Sep 2019, at 10:41 , Mario Juric > > wrote:
>>
>> Hi,
>>
>> We use form 6 compressed binaries to persist the CAS. We now want to make
>> a change to the type system that is not directly compatible, although in
>> principle the new type system is really a subset from a data perspective,
>> so we want to migrate existing binaries to the new type system, but we
>> don’t know how. The change is as follows:
>>
>> In the existing type system we have a type A with a FSArray feature of
>> element type X, and we want to change X to Y where Y contains a genuine
>> feature subset of X. This means we basically want to replace X with Y for
>> the FSArray and ditch a few attributes of X when loading the CAS into the
>> new type system.
>>
>> Had the CAS been stored in JSON this would be trivial by just mapping the
>> attributes that they have in common, but when I try 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
I may have some version pblms.  The LoadCompressedBinary has refs to a class
"Container", but I don't seem to have that class - where is it coming from?

-Marshall

On 9/16/2019 8:11 AM, Mario Juric wrote:
>
> Best Regards,
>
> Mario Juric
> Principal Engineer
> *UNSILO.ai* 
> mobile:  +45 3082 4100
>
> skype: mario.juric.dk 
>
>
>
>
> Hi Marshall,
>
> I have a small test case  with 3 files excluding any JCasGen generated types
> and UIMAfit types file.
>
> First you will have to generate the types and run the SaveCompressedBinary to
> produce the 3 binaries forms I have been experimenting with. Yo should then be
> able to run LoadCompressedBinaries as expected.
>
> Next you need to change the element type of Container.features from
> FeatureAnnotation to FeatureRecord in the type system and generate the type
> system again. Also change the FeatureAnnotation reference In
> LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the
> previously stored binaries again without saving them first using the new type
> system.
>
> You can see I have played with different ways of loading just to see if
> anything worked, but much of it seems to result in exactly the same calls in
> the lower layers. I didn’t get entirely the same results with the CAS we
> actually store as in this example. E.g. I experienced some EOF with the
> compressed filtered whereas I only get a class cast exception during
> verification in this example. Note also that we keep both types in the new
> type system, but we want to change the element type of the FSArray in the
> Container.
>
> Hope this will yield some useful insights and thanks a lot :)
>
> Cheers
> Mario
>
>
>
>
>
>
>
>
>
>
>
>> On 13 Sep 2019, at 21:55 , Mario Juric > >
>> wrote:
>>
>> Thanks Marshall,
>>
>> I’ll get back to you with a small sample as soon I get the time to do it.
>> This will also get me a better understanding of the the format.
>>
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 13 Sep 2019, at 19:32 , Marshall Schor >> > wrote:
>>>
>>> I'm wondering if you could post a very small test case showing this problem 
>>> with
>>> a small type system. 
>>>
>>> With that, I could run in the debugger and see exactly what was happening, 
>>> and
>>> see whether or not some small fix would make this work.
>>>
>>> The Deserializer for this already supports a certain type of mismatch 
>>> between
>>> type systems, but mainly one where one is a subset of the other - see the
>>> javadoc for the method
>>>
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
>>>
>>> But it must not currently cover this particular case.
>>>
>>> -Marshall
>>>
>>> On 9/13/2019 10:48 AM, Mario Juric wrote:
 Just a quick follow up.

 I played a bit around with the CasIOUtils, and it seems that it is possible
 to load and use the embedded type system, i.e. the old type system with X,
 but I found no way to replace it with the new type system and make the
 necessary mappings to Y. I tried to see if I could use the CasCopier in a
 separate step but it expectedly fails when it reaches to the FSArray of X
 in the source CAS because the destination type system requires elements of
 type Y. I could make my own modified version of the CasCopier that could
 take some mapping functions for each pair of source and destination types
 that need to be mapped, but this is where it starts to get too complicated,
 so I found it not to be worth it at this point, since we might then want to
 reprocess everything from scratch anyway.

 Cheers,
 Mario













> On 12 Sep 2019, at 10:41 , Mario Juric  > wrote:
>
> Hi,
>
> We use form 6 compressed binaries to persist the CAS. We now want to make
> a change to the type system that is not directly compatible, although in
> principle the new type system is really a subset from a data perspective,
> so we want to migrate existing binaries to the new type system, but we
> don’t know how. The change is as follows:
>
> In the existing type system we have a type A with a FSArray feature of
> element type X, and we want to change X to Y where Y contains a genuine
> feature subset of X. This means we basically want to replace X with Y for
> the FSArray and ditch a few attributes of X when loading the CAS into the
> new type system.
>
> Had the CAS been stored in JSON this would be trivial by just mapping the
> attributes that they have in common, but when I try to load the CAS binary
> into the new target type system it chokes with an EOF, so I don’t know if
> that is at all possible with a form 6 compressed CAS binary?
>
> I pocked a bit around in the reference, API and mailing list archive but I
> was not able to find 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric

Best Regards,Mario JuricPrincipal EngineerUNSILO.aimobile:  +45 3082 4100skype: mario.juric.dkHi Marshall,I have a small test case  with 3 files excluding any JCasGen generated types and UIMAfit types file.First you will have to generate the types and run the SaveCompressedBinary to produce the 3 binaries forms I have been experimenting with. Yo should then be able to run LoadCompressedBinaries as expected.Next you need to change the element type of Container.features from FeatureAnnotation to FeatureRecord in the type system and generate the type system again. Also change the FeatureAnnotation reference In LoadCompressedBinaries l. 25 to FeatureRecord and then try to reload the previously stored binaries again without saving them first using the new type system.You can see I have played with different ways of loading just to see if anything worked, but much of it seems to result in exactly the same calls in the lower layers. I didn’t get entirely the same results with the CAS we actually store as in this example. E.g. I experienced some EOF with the compressed filtered whereas I only get a class cast exception during verification in this example. Note also that we keep both types in the new type system, but we want to change the element type of the FSArray in the Container.Hope this will yield some useful insights and thanks a lot :)CheersMario

LoadCompressedBinary.java
Description: Binary data


SaveCompressedBinary.java
Description: Binary data


SimpleTypeSystem_TS.xml
Description: XML document


On 13 Sep 2019, at 21:55 , Mario Juric  wrote:Thanks Marshall,I’ll get back to you with a small sample as soon I get the time to do it. This will also get me a better understanding of the the format.
Cheers,Mario

On 13 Sep 2019, at 19:32 , Marshall Schor  wrote:I'm wondering if you could post a very small test case showing this problem witha small type system. With that, I could run in the debugger and see exactly what was happening, andsee whether or not some small fix would make this work.The Deserializer for this already supports a certain type of mismatch betweentype systems, but mainly one where one is a subset of the other - see thejavadoc for the methodorg.apache.uima.cas.impl.BinaryCasSerDes6.java.But it must not currently cover this particular case.-MarshallOn 9/13/2019 10:48 AM, Mario Juric wrote:Just a quick follow up.I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I could use the CasCopier in a separate step but it expectedly fails when it reaches to the FSArray of X in the source CAS because the destination type system requires elements of type Y. I could make my own modified version of the CasCopier that could take some mapping functions for each pair of source and destination types that need to be mapped, but this is where it starts to get too complicated, so I found it not to be worth it at this point, since we might then want to reprocess everything from scratch anyway.Cheers,MarioOn 12 Sep 2019, at 10:41 , Mario Juric  wrote:Hi,We use form 6 compressed binaries to persist the CAS. We now want to make a change to the type system that is not directly compatible, although in principle the new type system is really a subset from a data perspective, so we want to migrate existing binaries to the new type system, but we don’t know how. The change is as follows:In the existing type system we have a type A with a FSArray feature of element type X, and we want to change X to Y where Y contains a genuine feature subset of X. This means we basically want to replace X with Y for the FSArray and ditch a few attributes of X when loading the CAS into the new type system.Had the CAS been stored in JSON this would be trivial by just mapping the attributes that they have in common, but when I try to load the CAS binary into the new target type system it chokes with an EOF, so I don’t know if that is at all possible with a form 6 compressed CAS binary?I pocked a bit around in the reference, API and mailing list archive but I was not able to find anything useful. I can of course keep parallel attributes for both X and Y and then have a separate step that makes an explicit conversion/copy, but I prefer to avoid this. I would appreciate any input to the problem, thanks :)Cheers,Mario

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric
Hi Richard,

Unfortunately no. We have experienced some instability with the XMI format 
where it wasn’t possible to read the data after writing it, and we would 
probably not be able to convert a percentage of documents this way. 
Superficially it appears to be related to encoding issues, but I will try to 
see if I can recreate a small example at some point.

Cheers,
Mario












> On 14 Sep 2019, at 01:06 , Richard Eckart de Castilho  wrote:
> 
> Hi Mario,
> 
>> On 13. Sep 2019, at 16:48, Mario Juric  wrote:
>> 
>> I tried to see if I could use the CasCopier in a separate step but it 
>> expectedly fails when it reaches to the FSArray of X in the source CAS 
>> because the destination type system requires elements of type Y.
> 
> How about converting your data to XMI, patching the name of the array element 
> type from the old to the new name, and loading that data back in leniently?
> 
> -- Richard



Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Richard Eckart de Castilho
Hi Mario,

> On 13. Sep 2019, at 16:48, Mario Juric  wrote:
> 
> I tried to see if I could use the CasCopier in a separate step but it 
> expectedly fails when it reaches to the FSArray of X in the source CAS 
> because the destination type system requires elements of type Y.

How about converting your data to XMI, patching the name of the array element 
type from the old to the new name, and loading that data back in leniently?

-- Richard

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Mario Juric
Thanks Marshall,

I’ll get back to you with a small sample as soon I get the time to do it. This 
will also get me a better understanding of the the format.


Cheers,
Mario












> On 13 Sep 2019, at 19:32 , Marshall Schor  wrote:
> 
> I'm wondering if you could post a very small test case showing this problem 
> with
> a small type system. 
> 
> With that, I could run in the debugger and see exactly what was happening, and
> see whether or not some small fix would make this work.
> 
> The Deserializer for this already supports a certain type of mismatch between
> type systems, but mainly one where one is a subset of the other - see the
> javadoc for the method
> 
> org.apache.uima.cas.impl.BinaryCasSerDes6.java.
> 
> But it must not currently cover this particular case.
> 
> -Marshall
> 
> On 9/13/2019 10:48 AM, Mario Juric wrote:
>> Just a quick follow up.
>> 
>> I played a bit around with the CasIOUtils, and it seems that it is possible 
>> to load and use the embedded type system, i.e. the old type system with X, 
>> but I found no way to replace it with the new type system and make the 
>> necessary mappings to Y. I tried to see if I could use the CasCopier in a 
>> separate step but it expectedly fails when it reaches to the FSArray of X in 
>> the source CAS because the destination type system requires elements of type 
>> Y. I could make my own modified version of the CasCopier that could take 
>> some mapping functions for each pair of source and destination types that 
>> need to be mapped, but this is where it starts to get too complicated, so I 
>> found it not to be worth it at this point, since we might then want to 
>> reprocess everything from scratch anyway.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 12 Sep 2019, at 10:41 , Mario Juric  wrote:
>>> 
>>> Hi,
>>> 
>>> We use form 6 compressed binaries to persist the CAS. We now want to make a 
>>> change to the type system that is not directly compatible, although in 
>>> principle the new type system is really a subset from a data perspective, 
>>> so we want to migrate existing binaries to the new type system, but we 
>>> don’t know how. The change is as follows:
>>> 
>>> In the existing type system we have a type A with a FSArray feature of 
>>> element type X, and we want to change X to Y where Y contains a genuine 
>>> feature subset of X. This means we basically want to replace X with Y for 
>>> the FSArray and ditch a few attributes of X when loading the CAS into the 
>>> new type system.
>>> 
>>> Had the CAS been stored in JSON this would be trivial by just mapping the 
>>> attributes that they have in common, but when I try to load the CAS binary 
>>> into the new target type system it chokes with an EOF, so I don’t know if 
>>> that is at all possible with a form 6 compressed CAS binary?
>>> 
>>> I pocked a bit around in the reference, API and mailing list archive but I 
>>> was not able to find anything useful. I can of course keep parallel 
>>> attributes for both X and Y and then have a separate step that makes an 
>>> explicit conversion/copy, but I prefer to avoid this. I would appreciate 
>>> any input to the problem, thanks :)
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 



Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Marshall Schor
I'm wondering if you could post a very small test case showing this problem with
a small type system. 

With that, I could run in the debugger and see exactly what was happening, and
see whether or not some small fix would make this work.

The Deserializer for this already supports a certain type of mismatch between
type systems, but mainly one where one is a subset of the other - see the
javadoc for the method

org.apache.uima.cas.impl.BinaryCasSerDes6.java.

But it must not currently cover this particular case.

-Marshall

On 9/13/2019 10:48 AM, Mario Juric wrote:
> Just a quick follow up.
>
> I played a bit around with the CasIOUtils, and it seems that it is possible 
> to load and use the embedded type system, i.e. the old type system with X, 
> but I found no way to replace it with the new type system and make the 
> necessary mappings to Y. I tried to see if I could use the CasCopier in a 
> separate step but it expectedly fails when it reaches to the FSArray of X in 
> the source CAS because the destination type system requires elements of type 
> Y. I could make my own modified version of the CasCopier that could take some 
> mapping functions for each pair of source and destination types that need to 
> be mapped, but this is where it starts to get too complicated, so I found it 
> not to be worth it at this point, since we might then want to reprocess 
> everything from scratch anyway.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 12 Sep 2019, at 10:41 , Mario Juric  wrote:
>>
>> Hi,
>>
>> We use form 6 compressed binaries to persist the CAS. We now want to make a 
>> change to the type system that is not directly compatible, although in 
>> principle the new type system is really a subset from a data perspective, so 
>> we want to migrate existing binaries to the new type system, but we don’t 
>> know how. The change is as follows:
>>
>> In the existing type system we have a type A with a FSArray feature of 
>> element type X, and we want to change X to Y where Y contains a genuine 
>> feature subset of X. This means we basically want to replace X with Y for 
>> the FSArray and ditch a few attributes of X when loading the CAS into the 
>> new type system.
>>
>> Had the CAS been stored in JSON this would be trivial by just mapping the 
>> attributes that they have in common, but when I try to load the CAS binary 
>> into the new target type system it chokes with an EOF, so I don’t know if 
>> that is at all possible with a form 6 compressed CAS binary?
>>
>> I pocked a bit around in the reference, API and mailing list archive but I 
>> was not able to find anything useful. I can of course keep parallel 
>> attributes for both X and Y and then have a separate step that makes an 
>> explicit conversion/copy, but I prefer to avoid this. I would appreciate any 
>> input to the problem, thanks :)
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Mario Juric
Just a quick follow up.

I played a bit around with the CasIOUtils, and it seems that it is possible to 
load and use the embedded type system, i.e. the old type system with X, but I 
found no way to replace it with the new type system and make the necessary 
mappings to Y. I tried to see if I could use the CasCopier in a separate step 
but it expectedly fails when it reaches to the FSArray of X in the source CAS 
because the destination type system requires elements of type Y. I could make 
my own modified version of the CasCopier that could take some mapping functions 
for each pair of source and destination types that need to be mapped, but this 
is where it starts to get too complicated, so I found it not to be worth it at 
this point, since we might then want to reprocess everything from scratch 
anyway.

Cheers,
Mario













> On 12 Sep 2019, at 10:41 , Mario Juric  wrote:
> 
> Hi,
> 
> We use form 6 compressed binaries to persist the CAS. We now want to make a 
> change to the type system that is not directly compatible, although in 
> principle the new type system is really a subset from a data perspective, so 
> we want to migrate existing binaries to the new type system, but we don’t 
> know how. The change is as follows:
> 
> In the existing type system we have a type A with a FSArray feature of 
> element type X, and we want to change X to Y where Y contains a genuine 
> feature subset of X. This means we basically want to replace X with Y for the 
> FSArray and ditch a few attributes of X when loading the CAS into the new 
> type system.
> 
> Had the CAS been stored in JSON this would be trivial by just mapping the 
> attributes that they have in common, but when I try to load the CAS binary 
> into the new target type system it chokes with an EOF, so I don’t know if 
> that is at all possible with a form 6 compressed CAS binary?
> 
> I pocked a bit around in the reference, API and mailing list archive but I 
> was not able to find anything useful. I can of course keep parallel 
> attributes for both X and Y and then have a separate step that makes an 
> explicit conversion/copy, but I prefer to avoid this. I would appreciate any 
> input to the problem, thanks :)
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>