Dear, Adam;

Thanks very much for your replies. Let me summarize. We have two
general options so far, (1) preprocessing documents from outside UIMA
(a. "upgrading" XMIs manually to XML version 1.1 and/or b. manually
stripping offending character sequences) or (2) processing the input
docs from within UIMA (a. XMI CAS serializer work with XML 1.1, b.
replace offending sequences with spaces or c. store docs as byte
arrays).

I would assume that, in general, (2) will be preferred over (1), and
then again, I'd prefer 2b, over 2a over 2c. I agree with Adam that,
although a nice simple solution, XML 1.1 might prove "inconsumable" :)
for certain apps, and converting docs to byte array will add more
processing. Since it's maybe safe to assume that & # 26 carries very
little information at the time of searching for regexps, and because
it is really simple, I'd go for 2b.

I hope this is of some use, let me know what you have decided please.

Thanks again for replying so fast.

My best regards,

Leo

--
Leo Ferres, Ph.D.
Human-Oriented Technology Lab
Carleton University,
Ottawa, ON, Canada

Reply via email to