Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks. I will take a look at it and then I get back to you.

Cheers,
Mario













> On 25 Sep 2019, at 20:46 , Marshall Schor  wrote:
> 
> Here's code that works that serializes in 1.1 format.
> 
> The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".
> 
> XmiCasSerializer xmiCasSerializer = new 
> XmiCasSerializer(jCas.getTypeSystem());
> OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
> try {
>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>   xmiCasSerializer.serialize(jCas.getCas(), 
> xml11Serializer.getContentHandler());
> }
> finally {
>  out.close();
> }
> 
> This is from a test case. -Marshall
> 
> On 9/25/2019 2:16 PM, Mario Juric wrote:
>> Thanks Marshall,
>> 
>> If you prefer then I can also have a look at it, although I probably need to 
>> finish something first within the next 3-4 weeks. It would probably get me 
>> faster started if you could share some of your experimental sample code.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
>>> 
>>> yes, makes sense, thanks for posting the Jira.
>>> 
>>> If no one else steps up to work on this, I'll probably take a look in a few
>>> days. -Marshall
>>> 
>>> On 9/24/2019 6:47 AM, Mario Juric wrote:
 Hi Marshall,
 
 I added the following feature request to Apache Jira:
 
 https://issues.apache.org/jira/browse/UIMA-6128
 
 Hope it makes sense :)
 
 Thanks a lot for the help, it’s appreciated.
 
 Cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
 
> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
> 
> Re: serializing using XML 1.1
> 
> This was not thought of, when setting up the CasIOUtils.
> 
> The way it was done (above) was using some more "primitive/lower level" 
> APIs,
> rather than the CasIOUtils.
> 
> Please open a Jira ticket for this, with perhaps some suggestions on how 
> it
> might be specified in the CasIOUtils APIs.
> 
> Thanks! -Marshall
> 
> On 9/23/2019 3:45 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> Thanks for the thorough and excellent investigation.
>> 
>> We are looking into possible normalisation/cleanup of 
>> whitespace/invisible characters, but I don’t think we can necessarily do 
>> the same for some of the other characters. It sounds to me though that 
>> serialising to XML 1.1 could also be a simple fix right now, but can 
>> this be configured? CasIOUtils doesn’t seem to have an option for this, 
>> so I assume it’s something you have working in your branch.
>> 
>> Regarding the other problem. It seems that the JDK bug is fixed from 
>> Java 9 and after. Do you think switching to a more recent Java version 
>> would make a difference? I think we can also try this out ourselves when 
>> we look into migrating to UIMA 3 once our current deliveries are 
>> complete. We also like to switch to Java 11, and like UIMA 3 migration 
>> it will require some thorough testing.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>> 
>>> In the test "OddDocumentText", this produces a "throw" due to an 
>>> invalid xml
>>> char, which is the \u0002.
>>> 
>>> This is in part because the xml version being used is xml 1.0.
>>> 
>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>> 
>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>> xml 1.1:
>>> 
>>>  XmiCasSerializer xmiCasSerializer = new
>>> XmiCasSerializer(jCas.getTypeSystem());
>>>  OutputStream out = new FileOutputStream(new File 
>>> ("odd-doc-txt-v11.xmi"));
>>>  try {
>>>XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>xmiCasSerializer.serialize(jCas.getCas(),
>>> xml11Serializer.getContentHandler());
>>>  }
>>>  finally {
>>>out.close();
>>>  }
>>> 
>>> This succeeds and serializes this using xml 1.1.
>>> 
>>> I also tried serializing some doc text which includes \u77987.  That 
>>> did not
>>> serialize correctly.
>>> I could see it in the code while tracing up to some point down in the 
>>> innards of
>>> some internal
>>> sax java code
>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>>> where it was
>>> "Correct" in the Java string.
>>> 
>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>> 
>>> This is 1110 

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Marshall Schor
Here's code that works that serializes in 1.1 format.

The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".

XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem());
OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
try {
  XMLSerializer xml11Serializer = new XMLSerializer(out);
  xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
  xmiCasSerializer.serialize(jCas.getCas(), 
xml11Serializer.getContentHandler());
    }
finally {
  out.close();
}

This is from a test case. -Marshall

On 9/25/2019 2:16 PM, Mario Juric wrote:
> Thanks Marshall,
>
> If you prefer then I can also have a look at it, although I probably need to 
> finish something first within the next 3-4 weeks. It would probably get me 
> faster started if you could share some of your experimental sample code.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
>>
>> yes, makes sense, thanks for posting the Jira.
>>
>> If no one else steps up to work on this, I'll probably take a look in a few
>> days. -Marshall
>>
>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>> Hi Marshall,
>>>
>>> I added the following feature request to Apache Jira:
>>>
>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>
>>> Hope it makes sense :)
>>>
>>> Thanks a lot for the help, it’s appreciated.
>>>
>>> Cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:

 Re: serializing using XML 1.1

 This was not thought of, when setting up the CasIOUtils.

 The way it was done (above) was using some more "primitive/lower level" 
 APIs,
 rather than the CasIOUtils.

 Please open a Jira ticket for this, with perhaps some suggestions on how it
 might be specified in the CasIOUtils APIs.

 Thanks! -Marshall

 On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of 
> whitespace/invisible characters, but I don’t think we can necessarily do 
> the same for some of the other characters. It sounds to me though that 
> serialising to XML 1.1 could also be a simple fix right now, but can this 
> be configured? CasIOUtils doesn’t seem to have an option for this, so I 
> assume it’s something you have working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 
> 9 and after. Do you think switching to a more recent Java version would 
> make a difference? I think we can also try this out ourselves when we 
> look into migrating to UIMA 3 once our current deliveries are complete. 
> We also like to switch to Java 11, and like UIMA 3 migration it will 
> require some thorough testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>> xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>> xml 1.1:
>>
>>   XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>   OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>>   try {
>> XMLSerializer xml11Serializer = new XMLSerializer(out);
>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>> xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>   }
>>   finally {
>> out.close();
>>   }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did 
>> not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>> where it was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
>> byte encoding:
>>   1110  10xx  10xx 
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax 
>> transform
>> java code.
>>
>> I looked for a bug report and found some
>> 

Re: Ruta 2.7.0 SeedLexer issue with special unicode characters

2019-09-25 Thread Mario Juric
Hi Peter,

Just one more thing that came to my mind. Is there a particular reason for 
throwing a java.lang.Error instead of an exception?

Normally that is something only thrown by the JVM when it’s really impossible 
to continue the process, e.g. out of memory, linkage errors or fatal VM 
failures. It is normally not meant to be caught so our UIMA runtime environment 
exits because of this, although it’s not a big issue when we run the process as 
a service since it is then restarted automatically. I just thought it’s maybe a 
bit drastic behaviour when only the document in question needs to fail.

Cheers,
Mario













> On 23 Sep 2019, at 09:48 , Mario Juric  wrote:
> 
> Thanks Peter,
> 
> I will await your confirmation of the fix, but I guess we will then stick 
> with 2.6.1 until the next Ruta release :)
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 20 Sep 2019, at 18:09 , Peter Klügl > > wrote:
>> 
>> Hi Mario,
>> 
>> 
>> I did not have the chance to have a look at your example yet...
>> 
>> 
>> Most likely, this problem is already fixed in the current trunk, but I
>> was not able to find the time for a new release. In 2.7.0, there was a
>> small modification in the lexer rules for the seeding, which had
>> unfortunately some unintended side effects in the generated code
>> especially with unusual unicode characters. I'll try to verify that with
>> your example the next days.
>> 
>> 
>> Best,
>> 
>> 
>> Peter
>> 
>> Am 19.09.2019 um 12:35 schrieb Mario Juric:
>>> Hi Peter,
>>> 
>>> After upgrading to Ruta 2.7.0 a while ago we started getting some
>>> errors from the SeedLexer, which we didn’t have before. It appears
>>> related to odd unicode characters that we haven’t cleaned properly
>>> upstream, but it is consumed by the previous version 2.6.1 where our
>>> pipeline completes without error. I attached a small sample program
>>> with a dummy ruta script to reproduce it.
>>> 
>>> Which version has the correct behaviour in such cases? 2.7.0 or 2.6.1?
>>> 
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> -- 
>> Dr. Peter Klügl
>> R Text Mining/Machine Learning
>> 
>> Averbis GmbH
>> Salzstr. 15
>> 79098 Freiburg
>> Germany
>> 
>> Fon: +49 761 708 394 0
>> Fax: +49 761 708 394 10
>> Email: peter.klu...@averbis.com 
>> Web: https://averbis.com
>> 
>> Headquarters: Freiburg im Breisgau
>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>> 
> 



Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks Marshall,

If you prefer then I can also have a look at it, although I probably need to 
finish something first within the next 3-4 weeks. It would probably get me 
faster started if you could share some of your experimental sample code.

Cheers,
Mario













> On 24 Sep 2019, at 21:32 , Marshall Schor  wrote:
> 
> yes, makes sense, thanks for posting the Jira.
> 
> If no one else steps up to work on this, I'll probably take a look in a few
> days. -Marshall
> 
> On 9/24/2019 6:47 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> I added the following feature request to Apache Jira:
>> 
>> https://issues.apache.org/jira/browse/UIMA-6128
>> 
>> Hope it makes sense :)
>> 
>> Thanks a lot for the help, it’s appreciated.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 23 Sep 2019, at 16:33 , Marshall Schor  wrote:
>>> 
>>> Re: serializing using XML 1.1
>>> 
>>> This was not thought of, when setting up the CasIOUtils.
>>> 
>>> The way it was done (above) was using some more "primitive/lower level" 
>>> APIs,
>>> rather than the CasIOUtils.
>>> 
>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>> might be specified in the CasIOUtils APIs.
>>> 
>>> Thanks! -Marshall
>>> 
>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
 Hi Marshall,
 
 Thanks for the thorough and excellent investigation.
 
 We are looking into possible normalisation/cleanup of whitespace/invisible 
 characters, but I don’t think we can necessarily do the same for some of 
 the other characters. It sounds to me though that serialising to XML 1.1 
 could also be a simple fix right now, but can this be configured? 
 CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
 something you have working in your branch.
 
 Regarding the other problem. It seems that the JDK bug is fixed from Java 
 9 and after. Do you think switching to a more recent Java version would 
 make a difference? I think we can also try this out ourselves when we look 
 into migrating to UIMA 3 once our current deliveries are complete. We also 
 like to switch to Java 11, and like UIMA 3 migration it will require some 
 thorough testing.
 
 Cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
 
> On 20 Sep 2019, at 20:52 , Marshall Schor  wrote:
> 
> In the test "OddDocumentText", this produces a "throw" due to an invalid 
> xml
> char, which is the \u0002.
> 
> This is in part because the xml version being used is xml 1.0.
> 
> XML 1.1 expanded the set of valid characters to include \u0002.
> 
> Here's a snip from the XmiCasSerializerTest class which serializes with 
> xml 1.1:
> 
>   XmiCasSerializer xmiCasSerializer = new
> XmiCasSerializer(jCas.getTypeSystem());
>   OutputStream out = new FileOutputStream(new File 
> ("odd-doc-txt-v11.xmi"));
>   try {
> XMLSerializer xml11Serializer = new XMLSerializer(out);
> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
> xmiCasSerializer.serialize(jCas.getCas(),
> xml11Serializer.getContentHandler());
>   }
>   finally {
> out.close();
>   }
> 
> This succeeds and serializes this using xml 1.1.
> 
> I also tried serializing some doc text which includes \u77987.  That did 
> not
> serialize correctly.
> I could see it in the code while tracing up to some point down in the 
> innards of
> some internal
> sax java code
> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
> it was
> "Correct" in the Java string.
> 
> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
> 
> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
> byte encoding:
>   1110  10xx  10xx 
> 
> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
> 
> But I think it's out of our hands - it's somewhere deep in the sax 
> transform
> java code.
> 
> I looked for a bug report and found some
> https://bugs.openjdk.java.net/browse/JDK-8058175
> 
> Bottom line, is, I think to clean out these characters early :-) .
> 
> -Marshall
> 
> 
> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>> here's an idea.
>> 
>> If you have a string, with the surrogate pair  at position 10, 
>> and you
>> have some Java code, which is iterating through the string and getting 
>> the
>> code-point at each character offset, then that code will produce:
>> 
>> at position 10:  the code-point 77987
>> at position 11:  the code-point 56483
>> 
>> Of course, it's a "bug" to iterate through a string of characters, 
>> assuming you
>> have