Dear, Adam; Thanks very much for your replies. Let me summarize. We have two general options so far, (1) preprocessing documents from outside UIMA (a. "upgrading" XMIs manually to XML version 1.1 and/or b. manually stripping offending character sequences) or (2) processing the input docs from within UIMA (a. XMI CAS serializer work with XML 1.1, b. replace offending sequences with spaces or c. store docs as byte arrays).
I would assume that, in general, (2) will be preferred over (1), and then again, I'd prefer 2b, over 2a over 2c. I agree with Adam that, although a nice simple solution, XML 1.1 might prove "inconsumable" :) for certain apps, and converting docs to byte array will add more processing. Since it's maybe safe to assume that & # 26 carries very little information at the time of searching for regexps, and because it is really simple, I'd go for 2b. I hope this is of some use, let me know what you have decided please. Thanks again for replying so fast. My best regards, Leo -- Leo Ferres, Ph.D. Human-Oriented Technology Lab Carleton University, Ottawa, ON, Canada
