If you're using various "defaults", the serialization used is "XMI" , which, indeed, does require that text data being serialized be valid XML characters. And I see this is what's being used , from the backtrace.
If you need to use UIMA-AS with invalid chars, you can do one of several things: 1) change the type of the data holding these from String to some form of byte sequences. 2) change the way serialization is done among UIMA-AS components - there's a "binary" serialization which might avoid this issue (it's faster, too, but it has the drawback that the "client" and the "service" must have exactly the same type system. -Marshall On 10/21/2011 1:58 PM, Charles Bearden wrote: > I created a simple UIMA-AS pipeline comprising a collection reader and an > aggregate AE, which I ran simply like so: > > runRemoteAsyncAE.sh tcp://localhost:61616 CollectionReader \ > -d <deployment descriptor> \ > -c <collection reader descriptor> \ > > Evidently, the content I wish to process has some non-XML characters in it, > because a certain bit of data raises an exception, the heart of which appears > to be: > > Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 > character: , 0x19 > > The complete exception is here: > <http://pastebin.com/rMPyAhqP> > > The point in my code at which the exception enters the picture > (NoteLinesFromDBReader.java:139) is the point in the .getNext() method where I > get the next CAS: > jcas = aCAS.getJCas(); > > I don't run into this problem when I use the old-fashioned CPE, so my thinking > is that the CAS from the CR is being serialized before being put into the > queue. Is the expectation in UIMA AS that I sanitize text artifacts of non-XML > characters before the CR gets them? Or am I doing something else wrong > perhaps? > > Thanks for your help, > Chuck
