In Wikipedia the Bamum Script(https://en.wikipedia.org/wiki/Bamum_script) contain another valid range is U+16800–U+16A3F, any of theses characters generate the same log trace. I will continue to test the Marshall Schor suggestion.
2016-12-14 18:07 GMT-05:00, Burn Lewis <[email protected]>: > I think there's another problem ... the characters we have tested with are > not in the Bamum unicode set. The first 2 that Marshall listed in utf-8 > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF > BD) is xFFFD. This last one is the "replacement character" used when an > illegal character is encountered. According to Wikipedia the 88 Bamum > characters are in the range xA6A0 - xA6F7. > > In order to reproduce your problem we need to yse the same codepoints. Can > you tell us what the hex value of the failing characters are, in UTF-8 or > UTF-!6? > > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE, > following the quick test described in the UIMA-AS README. > > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <[email protected]> wrote: > >> Maybe we've been on the wrong line of thinking. >> >> Perhaps the translation between UTF-8 (during transportation) and the >> string >> characters is fine, but the XML parsing is restricting the character set >> it uses. >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML >> >> where it says valid xml characters exclude the "surrogates", which your >> characters I think are. >> >> So, perhaps it's XML parsing which is complaining (and it appears this is >> so, >> from the stack trace). >> >> We should point out that UIMA's character offsets (like begin an end) >> were >> designed with Java String character offsets, and will perhaps not work >> correctly >> when surrogates are being used. >> >> A possible workaround for this particular issue may be to switch to >> binary >> serialization, instead of xmi serialization. This has a restriction in >> that the >> type systems much be identical (between the client and server). >> >> We could possibly get more confirmation of this hypothesis if you could >> say what >> the stack trace was, beyond the first bit which you stated in your >> original >> note. There should be more stack trace information, further down, >> starting with >> "caused by ..." which may provide more helpful information. >> >> -Marshall >> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote: >> > We also did that test with uima framework and RunAE tool and >> > thecharacters in a file as you, and effectively not exist problem. The >> > problem is use uima-as, sendCAS() with UimaAsynchronousEngine and >> > when trying to deserialize the cas deserializeCasFromXmi() in remote >> > uima-as service, that i get the mentioned exception >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >> > Character reference "&#" >> > >> > In my case i don't read any file, not use FileSystemCollectionReader. >> > The user introduces the text, the text is stored in string java >> > (utf-16) and it set to the cas that will be processing, using >> > setDocumentLanguage, then i send the cas. >> > >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <[email protected]>: >> >> I put these 3 characters as UTF-8 in a file in examples/data and ran >> >> the >> >> MeetingDetector annotator as described in section 3.4 of the README, >> adding >> >> the option "-o out". In that folder I found the returned results in >> >> xmi >> >> format with the characters in the sofaString element. The relevant >> part of >> >> this file in hex is: >> >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* tring="......... >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 .. "/><cas:V >> >> >> >> Note that the FileSystemCollectionReader by default uses the system >> >> encoding but you could add a ConfigurationParameterSetting of UTF-8 >> >> for >> the >> >> Encoding parameter in its descriptor. >> >> >> >> With the client & server on different (Linux) machines I see no >> >> problem >> >> with sending UTF-8 characters. >> >> >> >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <[email protected]> wrote: >> >> >> >>> another question: I assume there are perhaps 2 machines involved, >> >>> here >> >>> (it's a >> >>> UIMA-AS setup). >> >>> >> >>> From the exception, it appears that the error happen when the client >> >>> sends >> >>> the >> >>> CAS to the remote. >> >>> >> >>> Can you print out the Linux (assuming that's the OS) default locale >> >>> for >> >>> both >> >>> machines? (e.g. type into a command line "locale" and see what each >> >>> machines >> >>> has as its default character encoding). >> >>> >> >>> Please let us know what these are. >> >>> >> >>> Thanks. -Marshall >> >>> >> >>> >> >>> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote: >> >>>> Yes these are the values of the troublesome characters, using >> >>>> Integer.toHexString() to print out each byte, shows >> >>>> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80 >> >>>> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90 >> >>>> >> >>>> ffffffef ffffffbf ffffffbd >> >>>> >> >>>> ffffffef ffffffbf ffffffbd >> >>>> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <[email protected]>: >> >>>>> Hi Nelson, >> >>>>> >> >>>>> Looking into this... Can you please confirm that the UTF-8 coding >> >>>>> of >> >>>>> the >> >>>>> troublesome characters, in hexadecimal, is: >> >>>>> >> >>>>> F0 96 A6 80 >> >>>>> >> >>>>> F0 96 A6 90 >> >>>>> >> >>>>> EF BF BD >> >>>>> >> >>>>> EF BF BD >> >>>>> >> >>>>> If you have the string in Java, please try converting it to a UTF-8 >> >>> string >> >>>>> using >> >>>>> something like: >> >>>>> byte[] theBytes = myTestString.getBytes("UTF-8"); >> >>>>> >> >>>>> and then print out theBytes in hex; they should look like the >> above. >> >>> If >> >>>>> not, >> >>>>> please let us know what the values is instead. >> >>>>> >> >>>>> >> >>>>> Thanks. -Marshall >> >>>>> >> >>>>> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote: >> >>>>>> Hi i was read your explication and saw the link, but in my case, i >> >>>>>> don't read any xml file. Just i copy the text, get a new input cas >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in the cas >> >>>>>> and >> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the >> >>>>>> client >> >>>>>> side. Apparently the characters are changed for its entities >> >>>>>> corresponding when serialize the cas to send it, but i get the >> >>>>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; >> >>>>>> columnNumber: 571; Character reference "&#" >> >>>>>> in uima-as framework installed when trying to deserialize the cas >> >>>>>> deserializeCasFromXmi(),to be processed for the service. >> >>>>>> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <[email protected]>: >> >>>>>>> Hi Nelson, >> >>>>>>> >> >>>>>>> I can't see the characters (sorry). >> >>>>>>> >> >>>>>>> This might be an issue caused by a discrepancy between the coding >> of >> >>> the >> >>>>>>> file >> >>>>>>> being read, and the coding indicated on the xml header. Can you >> >>>>>>> check >> >>>>>>> that >> >>>>>>> those two things are the same? >> >>>>>>> >> >>>>>>> See >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is- >> >>> the-encoding-in-the-xml-header >> >>>>>>> for example. >> >>>>>>> >> >>>>>>> -Marshall >> >>>>>>> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote: >> >>>>>>>> i tried to proccess the following text in a service deploy in >> >>> uima-as, >> >>>>>>>> because is input of my application. This is the text : 𖦀 𖦐 � >> >>>>>>>> �. >> >>>>>>>> These characters correspond to the bamun language, and >> >>>>>>>> apparently >> >>>>>>>> are >> >>>>>>>> not invalid xml characters because tools such as browsers >> >>>>>>>> interpret >> >>>>>>>> it and show it. After get a new input cas to proccesing, set the >> >>>>>>>> text >> >>>>>>>> and send the request, i get the exception that i show below in >> >>>>>>>> uima-as, the framework uima-as work and recovers correctly, just >> >>>>>>>> not >> >>>>>>>> process this characters. >> >>>>>>>> Could you tell me what happens with these characters, one of >> >>>>>>>> these >> >>>>>>>> is >> >>>>>>>> invalid characters for framework uima-as? >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> 04:00:31.606 - 14: >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> >>> handleProcessRequestFromRemoteClient: >> >>>>>>>> WARNING: >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >> >>>>>>>> Character reference "&# >> >>>>>>>> at >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers. >> AbstractSAXParser.parse( >> >>> AbstractSAXParser.java:1239) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( >> >>> UimaSerializer.java:187) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_ >> impl.java:222) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_ >> impl.java:552) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ >> impl.handle( >> >>> ProcessRequestHandler_impl.java:1090) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ >> >>> impl.handle(MetadataRequestHandler_impl.java:78) >> >>>>>>>> at >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. >> >>> onMessage(JmsInputChannel.java:731) >> >>> >> >> >
