Since these characters are above the basic UTF-16 limit they are represented as 2 UTF-16 characters with high & low surrogate prefixes. So 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as 6980, and after adding 2*16 (since only characters above this need surrogate pairs) we have the expected x16980. So one mystery is: their appearance in the CAS with the &# notation. When I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character, e.g. in hex f096 a680 f096 a690. What collection reader are you using?
On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera <[email protected]> wrote: > This is the cas serialize to xmi before send to uima-as service, > serialize with XmiCasSerializer.serialize(cas.getCas(), outStream). > The representation of the characters In this serialization does not > match with the representation of characters with problems. It's being > serialized the code points escape sequences corresponding to the Bamum > characters, two code point by each character. > Why can this happen? Any suggestions > > <?xml version="1.0" encoding="UTF-8"?><xmi:XMI > xmlns:cas="http:///uima/cas.ecore" xmlns:xmi="http://www.omg.org/XMI" > xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore" > xmlns:tcas="http:///uima/tcas.ecore" > xmlns:api="http:///cu/datys/xinetica/uima/api.ecore" > xmi:version="2.0"><cas:NULL xmi:id="0"/><tcas:DocumentAnnotation > xmi:id="8" sofa="1" begin="0" end="12" > language="x-unspecified"/><cas:Sofa xmi:id="1" sofaNum="1" > sofaID="_InitialView" mimeType="text" sofaString="�� > �� � �"/><cas:View sofa="1" members="8"/></xmi:XMI> > > > 2016-12-16 14:06 GMT-05:00, Burn Lewis <[email protected]>: > > Sorry, I missed the supplement set. So the tests I did with x16980 & > > x16990 are valid. runRemoteAsyncAE uses the same > > FileSystemCollectionReader as runAE does ... did you use a different > > collection reader? If a custom one perhaps you could serialize the cas > to > > a file as XMI and verify that the XMI is legal. > > > > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera <[email protected] > > > > wrote: > > > >> In Wikipedia the Bamum > >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another > >> valid range is U+16800–U+16A3F, any of theses characters generate the > >> same log trace. I will continue to test the Marshall Schor > >> suggestion. > >> > >> 2016-12-14 18:07 GMT-05:00, Burn Lewis <[email protected]>: > >> > I think there's another problem ... the characters we have tested with > >> are > >> > not in the Bamum unicode set. The first 2 that Marshall listed in > >> > utf-8 > >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF > >> > BF > >> > BD) is xFFFD. This last one is the "replacement character" used when > >> > an > >> > illegal character is encountered. According to Wikipedia the 88 Bamum > >> > characters are in the range xA6A0 - xA6F7. > >> > > >> > In order to reproduce your problem we need to yse the same codepoints. > >> Can > >> > you tell us what the hex value of the failing characters are, in UTF-8 > >> > or > >> > UTF-!6? > >> > > >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not > >> runAE, > >> > following the quick test described in the UIMA-AS README. > >> > > >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <[email protected]> > wrote: > >> > > >> >> Maybe we've been on the wrong line of thinking. > >> >> > >> >> Perhaps the translation between UTF-8 (during transportation) and the > >> >> string > >> >> characters is fine, but the XML parsing is restricting the character > >> >> set > >> >> it uses. > >> >> > >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML > >> >> > >> >> where it says valid xml characters exclude the "surrogates", which > >> >> your > >> >> characters I think are. > >> >> > >> >> So, perhaps it's XML parsing which is complaining (and it appears > this > >> is > >> >> so, > >> >> from the stack trace). > >> >> > >> >> We should point out that UIMA's character offsets (like begin an end) > >> >> were > >> >> designed with Java String character offsets, and will perhaps not > work > >> >> correctly > >> >> when surrogates are being used. > >> >> > >> >> A possible workaround for this particular issue may be to switch to > >> >> binary > >> >> serialization, instead of xmi serialization. This has a restriction > in > >> >> that the > >> >> type systems much be identical (between the client and server). > >> >> > >> >> We could possibly get more confirmation of this hypothesis if you > >> >> could > >> >> say what > >> >> the stack trace was, beyond the first bit which you stated in your > >> >> original > >> >> note. There should be more stack trace information, further down, > >> >> starting with > >> >> "caused by ..." which may provide more helpful information. > >> >> > >> >> -Marshall > >> >> > >> >> > >> >> On 12/14/2016 9:38 AM, nelson rivera wrote: > >> >> > We also did that test with uima framework and RunAE tool and > >> >> > thecharacters in a file as you, and effectively not exist problem. > >> >> > The > >> >> > problem is use uima-as, sendCAS() with UimaAsynchronousEngine and > >> >> > when trying to deserialize the cas deserializeCasFromXmi() in > remote > >> >> > uima-as service, that i get the mentioned exception > >> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; > >> >> > Character reference "&#" > >> >> > > >> >> > In my case i don't read any file, not use > >> >> > FileSystemCollectionReader. > >> >> > The user introduces the text, the text is stored in string java > >> >> > (utf-16) and it set to the cas that will be processing, using > >> >> > setDocumentLanguage, then i send the cas. > >> >> > > >> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <[email protected]>: > >> >> >> I put these 3 characters as UTF-8 in a file in examples/data and > >> >> >> ran > >> >> >> the > >> >> >> MeetingDetector annotator as described in section 3.4 of the > >> >> >> README, > >> >> adding > >> >> >> the option "-o out". In that folder I found the returned results > >> >> >> in > >> >> >> xmi > >> >> >> format with the characters in the sofaString element. The > relevant > >> >> part of > >> >> >> this file in hex is: > >> >> >> > >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* > >> >> >> tring="......... > >> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 > >> >> >> .. "/><cas:V > >> >> >> > >> >> >> Note that the FileSystemCollectionReader by default uses the > system > >> >> >> encoding but you could add a ConfigurationParameterSetting of > UTF-8 > >> >> >> for > >> >> the > >> >> >> Encoding parameter in its descriptor. > >> >> >> > >> >> >> With the client & server on different (Linux) machines I see no > >> >> >> problem > >> >> >> with sending UTF-8 characters. > >> >> >> > >> >> >> > >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <[email protected]> > >> wrote: > >> >> >> > >> >> >>> another question: I assume there are perhaps 2 machines > involved, > >> >> >>> here > >> >> >>> (it's a > >> >> >>> UIMA-AS setup). > >> >> >>> > >> >> >>> From the exception, it appears that the error happen when the > >> >> >>> client > >> >> >>> sends > >> >> >>> the > >> >> >>> CAS to the remote. > >> >> >>> > >> >> >>> Can you print out the Linux (assuming that's the OS) default > >> >> >>> locale > >> >> >>> for > >> >> >>> both > >> >> >>> machines? (e.g. type into a command line "locale" and see what > >> >> >>> each > >> >> >>> machines > >> >> >>> has as its default character encoding). > >> >> >>> > >> >> >>> Please let us know what these are. > >> >> >>> > >> >> >>> Thanks. -Marshall > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote: > >> >> >>>> Yes these are the values of the troublesome characters, using > >> >> >>>> Integer.toHexString() to print out each byte, shows > >> >> >>>> > >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80 > >> >> >>>> > >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90 > >> >> >>>> > >> >> >>>> ffffffef ffffffbf ffffffbd > >> >> >>>> > >> >> >>>> ffffffef ffffffbf ffffffbd > >> >> >>>> > >> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <[email protected]>: > >> >> >>>>> Hi Nelson, > >> >> >>>>> > >> >> >>>>> Looking into this... Can you please confirm that the UTF-8 > >> >> >>>>> coding > >> >> >>>>> of > >> >> >>>>> the > >> >> >>>>> troublesome characters, in hexadecimal, is: > >> >> >>>>> > >> >> >>>>> F0 96 A6 80 > >> >> >>>>> > >> >> >>>>> F0 96 A6 90 > >> >> >>>>> > >> >> >>>>> EF BF BD > >> >> >>>>> > >> >> >>>>> EF BF BD > >> >> >>>>> > >> >> >>>>> If you have the string in Java, please try converting it to a > >> UTF-8 > >> >> >>> string > >> >> >>>>> using > >> >> >>>>> something like: > >> >> >>>>> byte[] theBytes = myTestString.getBytes("UTF-8"); > >> >> >>>>> > >> >> >>>>> and then print out theBytes in hex; they should look like the > >> >> above. > >> >> >>> If > >> >> >>>>> not, > >> >> >>>>> please let us know what the values is instead. > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> Thanks. -Marshall > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote: > >> >> >>>>>> Hi i was read your explication and saw the link, but in my > >> >> >>>>>> case, > >> i > >> >> >>>>>> don't read any xml file. Just i copy the text, get a new input > >> cas > >> >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in the > >> cas > >> >> >>>>>> and > >> >> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in > the > >> >> >>>>>> client > >> >> >>>>>> side. Apparently the characters are changed for its entities > >> >> >>>>>> corresponding when serialize the cas to send it, but i get the > >> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException; > lineNumber: > >> 1; > >> >> >>>>>> columnNumber: 571; Character reference "&#" > >> >> >>>>>> in uima-as framework installed when trying to deserialize the > >> >> >>>>>> cas > >> >> >>>>>> deserializeCasFromXmi(),to be processed for the service. > >> >> >>>>>> > >> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <[email protected]>: > >> >> >>>>>>> Hi Nelson, > >> >> >>>>>>> > >> >> >>>>>>> I can't see the characters (sorry). > >> >> >>>>>>> > >> >> >>>>>>> This might be an issue caused by a discrepancy between the > >> coding > >> >> of > >> >> >>> the > >> >> >>>>>>> file > >> >> >>>>>>> being read, and the coding indicated on the xml header. Can > >> >> >>>>>>> you > >> >> >>>>>>> check > >> >> >>>>>>> that > >> >> >>>>>>> those two things are the same? > >> >> >>>>>>> > >> >> >>>>>>> See > >> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is- > >> >> >>> the-encoding-in-the-xml-header > >> >> >>>>>>> for example. > >> >> >>>>>>> > >> >> >>>>>>> -Marshall > >> >> >>>>>>> > >> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote: > >> >> >>>>>>>> i tried to proccess the following text in a service deploy > in > >> >> >>> uima-as, > >> >> >>>>>>>> because is input of my application. This is the text : 𖦀 > 𖦐 > >> � > >> >> >>>>>>>> �. > >> >> >>>>>>>> These characters correspond to the bamun language, and > >> >> >>>>>>>> apparently > >> >> >>>>>>>> are > >> >> >>>>>>>> not invalid xml characters because tools such as browsers > >> >> >>>>>>>> interpret > >> >> >>>>>>>> it and show it. After get a new input cas to proccesing, set > >> the > >> >> >>>>>>>> text > >> >> >>>>>>>> and send the request, i get the exception that i show below > >> >> >>>>>>>> in > >> >> >>>>>>>> uima-as, the framework uima-as work and recovers correctly, > >> just > >> >> >>>>>>>> not > >> >> >>>>>>>> process this characters. > >> >> >>>>>>>> Could you tell me what happens with these characters, one of > >> >> >>>>>>>> these > >> >> >>>>>>>> is > >> >> >>>>>>>> invalid characters for framework uima-as? > >> >> >>>>>>>> > >> >> >>>>>>>> > >> >> >>>>>>>> > >> >> >>>>>>>> 04:00:31.606 - 14: > >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > impl. > >> >> >>> handleProcessRequestFromRemoteClient: > >> >> >>>>>>>> WARNING: > >> >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: > >> 571; > >> >> >>>>>>>> Character reference "&# > >> >> >>>>>>>> at > >> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers. > >> >> AbstractSAXParser.parse( > >> >> >>> AbstractSAXParser.java:1239) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( > >> >> >>> UimaSerializer.java:187) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > impl. > >> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_ > >> >> impl.java:222) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > impl. > >> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_ > >> >> impl.java:552) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > >> >> impl.handle( > >> >> >>> ProcessRequestHandler_impl.java:1090) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ > >> >> >>> impl.handle(MetadataRequestHandler_impl.java:78) > >> >> >>>>>>>> at > >> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. > >> >> >>> onMessage(JmsInputChannel.java:731) > >> >> >>> > >> >> > >> >> > >> > > >> > > >
