I understand, and yes, these characters should not appear in the serialized cas, but they appear using XmiCasSerializer.serialize(cas.getCas(), outStream):
...<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="�� �� � �"/>... In my application not use FileSystemCollectionReader. The user introduces the text, the text is stored in string java (utf-16) and it set to the cas that will be processing, using setDocumentLanguage, then i send the cas. 2016-12-18 23:06 GMT-05:00, Burn Lewis <[email protected]>: > Since these characters are above the basic UTF-16 limit they are > represented as 2 UTF-16 characters with high & low surrogate prefixes. So > 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate > prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as > 6980, and after adding 2*16 (since only characters above this need > surrogate pairs) we have the expected x16980. > So one mystery is: their appearance in the CAS with the &# notation. When > I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character, > e.g. in hex f096 a680 f096 a690. > What collection reader are you using? > > On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera <[email protected]> > wrote: > >> This is the cas serialize to xmi before send to uima-as service, >> serialize with XmiCasSerializer.serialize(cas.getCas(), outStream). >> The representation of the characters In this serialization does not >> match with the representation of characters with problems. It's being >> serialized the code points escape sequences corresponding to the Bamum >> characters, two code point by each character. >> Why can this happen? Any suggestions >> >> <?xml version="1.0" encoding="UTF-8"?><xmi:XMI >> xmlns:cas="http:///uima/cas.ecore" xmlns:xmi="http://www.omg.org/XMI" >> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore" >> xmlns:tcas="http:///uima/tcas.ecore" >> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore" >> xmi:version="2.0"><cas:NULL xmi:id="0"/><tcas:DocumentAnnotation >> xmi:id="8" sofa="1" begin="0" end="12" >> language="x-unspecified"/><cas:Sofa xmi:id="1" sofaNum="1" >> sofaID="_InitialView" mimeType="text" sofaString="�� >> �� � �"/><cas:View sofa="1" members="8"/></xmi:XMI> >> >> >> 2016-12-16 14:06 GMT-05:00, Burn Lewis <[email protected]>: >> > Sorry, I missed the supplement set. So the tests I did with x16980 & >> > x16990 are valid. runRemoteAsyncAE uses the same >> > FileSystemCollectionReader as runAE does ... did you use a different >> > collection reader? If a custom one perhaps you could serialize the cas >> to >> > a file as XMI and verify that the XMI is legal. >> > >> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera >> > <[email protected] >> > >> > wrote: >> > >> >> In Wikipedia the Bamum >> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another >> >> valid range is U+16800–U+16A3F, any of theses characters generate the >> >> same log trace. I will continue to test the Marshall Schor >> >> suggestion. >> >> >> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis <[email protected]>: >> >> > I think there's another problem ... the characters we have tested >> >> > with >> >> are >> >> > not in the Bamum unicode set. The first 2 that Marshall listed in >> >> > utf-8 >> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd >> >> > (EF >> >> > BF >> >> > BD) is xFFFD. This last one is the "replacement character" used >> >> > when >> >> > an >> >> > illegal character is encountered. According to Wikipedia the 88 >> >> > Bamum >> >> > characters are in the range xA6A0 - xA6F7. >> >> > >> >> > In order to reproduce your problem we need to yse the same >> >> > codepoints. >> >> Can >> >> > you tell us what the hex value of the failing characters are, in >> >> > UTF-8 >> >> > or >> >> > UTF-!6? >> >> > >> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not >> >> runAE, >> >> > following the quick test described in the UIMA-AS README. >> >> > >> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <[email protected]> >> wrote: >> >> > >> >> >> Maybe we've been on the wrong line of thinking. >> >> >> >> >> >> Perhaps the translation between UTF-8 (during transportation) and >> >> >> the >> >> >> string >> >> >> characters is fine, but the XML parsing is restricting the >> >> >> character >> >> >> set >> >> >> it uses. >> >> >> >> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML >> >> >> >> >> >> where it says valid xml characters exclude the "surrogates", which >> >> >> your >> >> >> characters I think are. >> >> >> >> >> >> So, perhaps it's XML parsing which is complaining (and it appears >> this >> >> is >> >> >> so, >> >> >> from the stack trace). >> >> >> >> >> >> We should point out that UIMA's character offsets (like begin an >> >> >> end) >> >> >> were >> >> >> designed with Java String character offsets, and will perhaps not >> work >> >> >> correctly >> >> >> when surrogates are being used. >> >> >> >> >> >> A possible workaround for this particular issue may be to switch to >> >> >> binary >> >> >> serialization, instead of xmi serialization. This has a restriction >> in >> >> >> that the >> >> >> type systems much be identical (between the client and server). >> >> >> >> >> >> We could possibly get more confirmation of this hypothesis if you >> >> >> could >> >> >> say what >> >> >> the stack trace was, beyond the first bit which you stated in your >> >> >> original >> >> >> note. There should be more stack trace information, further down, >> >> >> starting with >> >> >> "caused by ..." which may provide more helpful information. >> >> >> >> >> >> -Marshall >> >> >> >> >> >> >> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote: >> >> >> > We also did that test with uima framework and RunAE tool and >> >> >> > thecharacters in a file as you, and effectively not exist >> >> >> > problem. >> >> >> > The >> >> >> > problem is use uima-as, sendCAS() with UimaAsynchronousEngine >> >> >> > and >> >> >> > when trying to deserialize the cas deserializeCasFromXmi() in >> remote >> >> >> > uima-as service, that i get the mentioned exception >> >> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >> >> >> > Character reference "&#" >> >> >> > >> >> >> > In my case i don't read any file, not use >> >> >> > FileSystemCollectionReader. >> >> >> > The user introduces the text, the text is stored in string java >> >> >> > (utf-16) and it set to the cas that will be processing, using >> >> >> > setDocumentLanguage, then i send the cas. >> >> >> > >> >> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <[email protected]>: >> >> >> >> I put these 3 characters as UTF-8 in a file in examples/data and >> >> >> >> ran >> >> >> >> the >> >> >> >> MeetingDetector annotator as described in section 3.4 of the >> >> >> >> README, >> >> >> adding >> >> >> >> the option "-o out". In that folder I found the returned >> >> >> >> results >> >> >> >> in >> >> >> >> xmi >> >> >> >> format with the characters in the sofaString element. The >> relevant >> >> >> part of >> >> >> >> this file in hex is: >> >> >> >> >> >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* >> >> >> >> tring="......... >> >> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 >> >> >> >> .. "/><cas:V >> >> >> >> >> >> >> >> Note that the FileSystemCollectionReader by default uses the >> system >> >> >> >> encoding but you could add a ConfigurationParameterSetting of >> UTF-8 >> >> >> >> for >> >> >> the >> >> >> >> Encoding parameter in its descriptor. >> >> >> >> >> >> >> >> With the client & server on different (Linux) machines I see no >> >> >> >> problem >> >> >> >> with sending UTF-8 characters. >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <[email protected]> >> >> wrote: >> >> >> >> >> >> >> >>> another question: I assume there are perhaps 2 machines >> involved, >> >> >> >>> here >> >> >> >>> (it's a >> >> >> >>> UIMA-AS setup). >> >> >> >>> >> >> >> >>> From the exception, it appears that the error happen when the >> >> >> >>> client >> >> >> >>> sends >> >> >> >>> the >> >> >> >>> CAS to the remote. >> >> >> >>> >> >> >> >>> Can you print out the Linux (assuming that's the OS) default >> >> >> >>> locale >> >> >> >>> for >> >> >> >>> both >> >> >> >>> machines? (e.g. type into a command line "locale" and see what >> >> >> >>> each >> >> >> >>> machines >> >> >> >>> has as its default character encoding). >> >> >> >>> >> >> >> >>> Please let us know what these are. >> >> >> >>> >> >> >> >>> Thanks. -Marshall >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote: >> >> >> >>>> Yes these are the values of the troublesome characters, using >> >> >> >>>> Integer.toHexString() to print out each byte, shows >> >> >> >>>> >> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80 >> >> >> >>>> >> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90 >> >> >> >>>> >> >> >> >>>> ffffffef ffffffbf ffffffbd >> >> >> >>>> >> >> >> >>>> ffffffef ffffffbf ffffffbd >> >> >> >>>> >> >> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <[email protected]>: >> >> >> >>>>> Hi Nelson, >> >> >> >>>>> >> >> >> >>>>> Looking into this... Can you please confirm that the UTF-8 >> >> >> >>>>> coding >> >> >> >>>>> of >> >> >> >>>>> the >> >> >> >>>>> troublesome characters, in hexadecimal, is: >> >> >> >>>>> >> >> >> >>>>> F0 96 A6 80 >> >> >> >>>>> >> >> >> >>>>> F0 96 A6 90 >> >> >> >>>>> >> >> >> >>>>> EF BF BD >> >> >> >>>>> >> >> >> >>>>> EF BF BD >> >> >> >>>>> >> >> >> >>>>> If you have the string in Java, please try converting it to a >> >> UTF-8 >> >> >> >>> string >> >> >> >>>>> using >> >> >> >>>>> something like: >> >> >> >>>>> byte[] theBytes = myTestString.getBytes("UTF-8"); >> >> >> >>>>> >> >> >> >>>>> and then print out theBytes in hex; they should look like >> >> >> >>>>> the >> >> >> above. >> >> >> >>> If >> >> >> >>>>> not, >> >> >> >>>>> please let us know what the values is instead. >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> Thanks. -Marshall >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote: >> >> >> >>>>>> Hi i was read your explication and saw the link, but in my >> >> >> >>>>>> case, >> >> i >> >> >> >>>>>> don't read any xml file. Just i copy the text, get a new >> >> >> >>>>>> input >> >> cas >> >> >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in >> >> >> >>>>>> the >> >> cas >> >> >> >>>>>> and >> >> >> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in >> the >> >> >> >>>>>> client >> >> >> >>>>>> side. Apparently the characters are changed for its entities >> >> >> >>>>>> corresponding when serialize the cas to send it, but i get >> >> >> >>>>>> the >> >> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException; >> lineNumber: >> >> 1; >> >> >> >>>>>> columnNumber: 571; Character reference "&#" >> >> >> >>>>>> in uima-as framework installed when trying to deserialize >> >> >> >>>>>> the >> >> >> >>>>>> cas >> >> >> >>>>>> deserializeCasFromXmi(),to be processed for the service. >> >> >> >>>>>> >> >> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <[email protected]>: >> >> >> >>>>>>> Hi Nelson, >> >> >> >>>>>>> >> >> >> >>>>>>> I can't see the characters (sorry). >> >> >> >>>>>>> >> >> >> >>>>>>> This might be an issue caused by a discrepancy between the >> >> coding >> >> >> of >> >> >> >>> the >> >> >> >>>>>>> file >> >> >> >>>>>>> being read, and the coding indicated on the xml header. >> >> >> >>>>>>> Can >> >> >> >>>>>>> you >> >> >> >>>>>>> check >> >> >> >>>>>>> that >> >> >> >>>>>>> those two things are the same? >> >> >> >>>>>>> >> >> >> >>>>>>> See >> >> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is- >> >> >> >>> the-encoding-in-the-xml-header >> >> >> >>>>>>> for example. >> >> >> >>>>>>> >> >> >> >>>>>>> -Marshall >> >> >> >>>>>>> >> >> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote: >> >> >> >>>>>>>> i tried to proccess the following text in a service deploy >> in >> >> >> >>> uima-as, >> >> >> >>>>>>>> because is input of my application. This is the text : 𖦀 >> 𖦐 >> >> � >> >> >> >>>>>>>> �. >> >> >> >>>>>>>> These characters correspond to the bamun language, and >> >> >> >>>>>>>> apparently >> >> >> >>>>>>>> are >> >> >> >>>>>>>> not invalid xml characters because tools such as browsers >> >> >> >>>>>>>> interpret >> >> >> >>>>>>>> it and show it. After get a new input cas to proccesing, >> >> >> >>>>>>>> set >> >> the >> >> >> >>>>>>>> text >> >> >> >>>>>>>> and send the request, i get the exception that i show >> >> >> >>>>>>>> below >> >> >> >>>>>>>> in >> >> >> >>>>>>>> uima-as, the framework uima-as work and recovers >> >> >> >>>>>>>> correctly, >> >> just >> >> >> >>>>>>>> not >> >> >> >>>>>>>> process this characters. >> >> >> >>>>>>>> Could you tell me what happens with these characters, one >> >> >> >>>>>>>> of >> >> >> >>>>>>>> these >> >> >> >>>>>>>> is >> >> >> >>>>>>>> invalid characters for framework uima-as? >> >> >> >>>>>>>> >> >> >> >>>>>>>> >> >> >> >>>>>>>> >> >> >> >>>>>>>> 04:00:31.606 - 14: >> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ >> impl. >> >> >> >>> handleProcessRequestFromRemoteClient: >> >> >> >>>>>>>> WARNING: >> >> >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; >> >> >> >>>>>>>> columnNumber: >> >> 571; >> >> >> >>>>>>>> Character reference "&# >> >> >> >>>>>>>> at >> >> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers. >> >> >> AbstractSAXParser.parse( >> >> >> >>> AbstractSAXParser.java:1239) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( >> >> >> >>> UimaSerializer.java:187) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ >> impl. >> >> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_ >> >> >> impl.java:222) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ >> impl. >> >> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_ >> >> >> impl.java:552) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ >> >> >> impl.handle( >> >> >> >>> ProcessRequestHandler_impl.java:1090) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ >> >> >> >>> impl.handle(MetadataRequestHandler_impl.java:78) >> >> >> >>>>>>>> at >> >> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. >> >> >> >>> onMessage(JmsInputChannel.java:731) >> >> >> >>> >> >> >> >> >> >> >> >> > >> >> >> > >> >
