Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks. I will take a look at it and then I get back to you. Cheers, Mario > On 25 Sep 2019, at 20:46 , Marshall Schor wrote: > > Here's code that works that serializes in 1.1 format. > > The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1". > > XmiCasSerializer

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Marshall Schor
Here's code that works that serializes in 1.1 format. The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1". XmiCasSerializer xmiCasSerializer = new XmiCasSerializer(jCas.getTypeSystem()); OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi")); try {

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-25 Thread Mario Juric
Thanks Marshall, If you prefer then I can also have a look at it, although I probably need to finish something first within the next 3-4 weeks. It would probably get me faster started if you could share some of your experimental sample code. Cheers, Mario > On 24 Sep 2019, at

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-24 Thread Marshall Schor
yes, makes sense, thanks for posting the Jira. If no one else steps up to work on this, I'll probably take a look in a few days. -Marshall On 9/24/2019 6:47 AM, Mario Juric wrote: > Hi Marshall, > > I added the following feature request to Apache Jira: > >

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-24 Thread Mario Juric
Hi Marshall, I added the following feature request to Apache Jira: https://issues.apache.org/jira/browse/UIMA-6128 Hope it makes sense :) Thanks a lot for the help, it’s appreciated. Cheers, Mario > On 23 Sep 2019, at 16:33 , Marshall Schor wrote: > > Re: serializing using XML

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor
re: using a later Java - that might make a difference, since fixes keep getting added. For some fixes, however, as you've noted, the fixes are backported to previous versions. -Marshall On 9/23/2019 3:45 AM, Mario Juric wrote: > Hi Marshall, > > Thanks for the thorough and excellent

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Marshall Schor
Re: serializing using XML 1.1 This was not thought of, when setting up the CasIOUtils. The way it was done (above) was using some more "primitive/lower level" APIs, rather than the CasIOUtils. Please open a Jira ticket for this, with perhaps some suggestions on how it might be specified in the

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Mario Juric
Hi Marshall, Seems the bug was already resolved for 8u92 in one of the backports: https://bugs.openjdk.java.net/browse/JDK-8141098 Cheers, Mario > On 23 Sep 2019, at 09:45 , Mario Juric wrote: > > Hi Marshall, > > Thanks for

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-23 Thread Mario Juric
Hi Marshall, Thanks for the thorough and excellent investigation. We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor
In the test "OddDocumentText", this produces a "throw" due to an invalid xml char, which is the \u0002. This is in part because the xml version being used is xml 1.0. XML 1.1 expanded the set of valid characters to include \u0002. Here's a snip from the XmiCasSerializerTest class which

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Marshall Schor
here's an idea. If you have a string, with the surrogate pair at position 10, and you have some Java code, which is iterating through the string and getting the code-point at each character offset, then that code will produce: at position 10:  the code-point 77987 at position 11:  the

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-20 Thread Mario Juric
Thanks Marshall, Encoding the characters like you suggest should work just fine for us as long as we can serialize and deserialise the XMI, so that we can open the content in a tool like the CVD or similar. These characters are just noise from the original content that happen to remain in the

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-19 Thread Marshall Schor
The odd-feature-text seems to work OK, but has some unusual properties, due to that unicode character. Here's what I see:  The FeatureRecord "name" field is set to a 1-unicode-character, that must be encoded as 2 java characters. When output, it shows up in the xmi as which seems correct.  The

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-18 Thread Mario Juric
Hi,I investigated the XMI issue as promised and these are my findings.It is related to special unicode characters that are not handled by XMI serialisation, and there seems to be two distinct categories of issues we have identified so far.1) The document text of the CAS contains special unicode

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
In this case, the original looks kind-of like this: Container    features -> FSArray of FeatureAnnotation each of which has 5 slots: sofaRef, begin, end, name, value the new TypeSystem has Container    features -> FSArray of FeatureRecord each of which              

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Richard Eckart de Castilho
On 16. Sep 2019, at 19:05, Marshall Schor wrote: > > I can reproduce the problem, and see what is happening. The deserialization > code compares the two type systems, and allows for some mismatches (things > present in one and not in the other), but it doesn't allow for having a > feature >

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
I can reproduce the problem, and see what is happening.  The deserialization code compares the two type systems, and allows for some mismatches (things present in one and not in the other), but it doesn't allow for having a feature whose range (value) is type in one type system and type

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric
Yes, these where just generated from the type system file using JCasGen. > On 16 Sep 2019, at 15:32 , Marshall Schor wrote: > > oops, ignore that - I see Container is a JCas class ... -M > > On 9/16/2019 9:30 AM, Marshall Schor wrote: >> I may have some version pblms. The

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
oops, ignore that - I see Container is a JCas class ...  -M On 9/16/2019 9:30 AM, Marshall Schor wrote: > I may have some version pblms.  The LoadCompressedBinary has refs to a class > "Container", but I don't seem to have that class - where is it coming from? > > -Marshall > > On 9/16/2019 8:11

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Marshall Schor
I may have some version pblms.  The LoadCompressedBinary has refs to a class "Container", but I don't seem to have that class - where is it coming from? -Marshall On 9/16/2019 8:11 AM, Mario Juric wrote: > > Best Regards, > > Mario Juric > Principal Engineer > *UNSILO.ai* >

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric
Best Regards,Mario JuricPrincipal EngineerUNSILO.aimobile:  +45 3082 4100skype: mario.juric.dkHi Marshall,I have a small test case  with 3 files excluding any JCasGen generated types and UIMAfit types file.First you will have to generate the types and run the SaveCompressedBinary to produce the 3

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-16 Thread Mario Juric
Hi Richard, Unfortunately no. We have experienced some instability with the XMI format where it wasn’t possible to read the data after writing it, and we would probably not be able to convert a percentage of documents this way. Superficially it appears to be related to encoding issues, but I

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Richard Eckart de Castilho
Hi Mario, > On 13. Sep 2019, at 16:48, Mario Juric wrote: > > I tried to see if I could use the CasCopier in a separate step but it > expectedly fails when it reaches to the FSArray of X in the source CAS > because the destination type system requires elements of type Y. How about converting

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Mario Juric
Thanks Marshall, I’ll get back to you with a small sample as soon I get the time to do it. This will also get me a better understanding of the the format. Cheers, Mario > On 13 Sep 2019, at 19:32 , Marshall Schor wrote: > > I'm wondering if you could post a very small test case

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Marshall Schor
I'm wondering if you could post a very small test case showing this problem with a small type system.  With that, I could run in the debugger and see exactly what was happening, and see whether or not some small fix would make this work. The Deserializer for this already supports a certain type

Re: Migrating type system of form 6 compressed CAS binaries

2019-09-13 Thread Mario Juric
Just a quick follow up. I played a bit around with the CasIOUtils, and it seems that it is possible to load and use the embedded type system, i.e. the old type system with X, but I found no way to replace it with the new type system and make the necessary mappings to Y. I tried to see if I