I put these 3 characters as UTF-8 in a file in examples/data and ran the MeetingDetector annotator as described in section 3.4 of the README, adding the option "-o out". In that folder I found the returned results in xmi format with the characters in the sofaString element. The relevant part of this file in hex is:
000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* tring="......... 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 .. "/><cas:V Note that the FileSystemCollectionReader by default uses the system encoding but you could add a ConfigurationParameterSetting of UTF-8 for the Encoding parameter in its descriptor. With the client & server on different (Linux) machines I see no problem with sending UTF-8 characters. On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <[email protected]> wrote: > another question: I assume there are perhaps 2 machines involved, here > (it's a > UIMA-AS setup). > > From the exception, it appears that the error happen when the client sends > the > CAS to the remote. > > Can you print out the Linux (assuming that's the OS) default locale for > both > machines? (e.g. type into a command line "locale" and see what each > machines > has as its default character encoding). > > Please let us know what these are. > > Thanks. -Marshall > > > > On 12/12/2016 1:58 PM, nelson rivera wrote: > > Yes these are the values of the troublesome characters, using > > Integer.toHexString() to print out each byte, shows > > > > fffffff0 ffffff96 ffffffa6 ffffff80 > > > > fffffff0 ffffff96 ffffffa6 ffffff90 > > > > ffffffef ffffffbf ffffffbd > > > > ffffffef ffffffbf ffffffbd > > > > 2016-12-12 11:35 GMT-05:00, Marshall Schor <[email protected]>: > >> Hi Nelson, > >> > >> Looking into this... Can you please confirm that the UTF-8 coding of the > >> troublesome characters, in hexadecimal, is: > >> > >> F0 96 A6 80 > >> > >> F0 96 A6 90 > >> > >> EF BF BD > >> > >> EF BF BD > >> > >> If you have the string in Java, please try converting it to a UTF-8 > string > >> using > >> something like: > >> byte[] theBytes = myTestString.getBytes("UTF-8"); > >> > >> and then print out theBytes in hex; they should look like the above. > If > >> not, > >> please let us know what the values is instead. > >> > >> > >> Thanks. -Marshall > >> > >> > >> On 12/9/2016 9:02 AM, nelson rivera wrote: > >>> Hi i was read your explication and saw the link, but in my case, i > >>> don't read any xml file. Just i copy the text, get a new input cas > >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and > >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client > >>> side. Apparently the characters are changed for its entities > >>> corresponding when serialize the cas to send it, but i get the > >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; > >>> columnNumber: 571; Character reference "&#" > >>> in uima-as framework installed when trying to deserialize the cas > >>> deserializeCasFromXmi(),to be processed for the service. > >>> > >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <[email protected]>: > >>>> Hi Nelson, > >>>> > >>>> I can't see the characters (sorry). > >>>> > >>>> This might be an issue caused by a discrepancy between the coding of > the > >>>> file > >>>> being read, and the coding indicated on the xml header. Can you check > >>>> that > >>>> those two things are the same? > >>>> > >>>> See > >>>> http://stackoverflow.com/questions/5165347/what-use-is- > the-encoding-in-the-xml-header > >>>> for example. > >>>> > >>>> -Marshall > >>>> > >>>> On 12/8/2016 4:20 PM, nelson rivera wrote: > >>>>> i tried to proccess the following text in a service deploy in > uima-as, > >>>>> because is input of my application. This is the text : 𖦀 𖦐 � �. > >>>>> These characters correspond to the bamun language, and apparently are > >>>>> not invalid xml characters because tools such as browsers interpret > >>>>> it and show it. After get a new input cas to proccesing, set the text > >>>>> and send the request, i get the exception that i show below in > >>>>> uima-as, the framework uima-as work and recovers correctly, just not > >>>>> process this characters. > >>>>> Could you tell me what happens with these characters, one of these is > >>>>> invalid characters for framework uima-as? > >>>>> > >>>>> > >>>>> > >>>>> 04:00:31.606 - 14: > >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > handleProcessRequestFromRemoteClient: > >>>>> WARNING: > >>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; > >>>>> Character reference "&# > >>>>> at > >>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse( > AbstractSAXParser.java:1239) > >>>>> at > >>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( > UimaSerializer.java:187) > >>>>> at > >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222) > >>>>> at > >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552) > >>>>> at > >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle( > ProcessRequestHandler_impl.java:1090) > >>>>> at > >>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ > impl.handle(MetadataRequestHandler_impl.java:78) > >>>>> at > >>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. > onMessage(JmsInputChannel.java:731) > >>>>> > >> > >
