On Thu, 23 Sep 2004, Jesse Pelton wrote: > I think the blame belongs with your clients' authoring tools, which > should help them produce well-formed documents. On the other hand, if > you want to work around the presence of certain illegal characters, you > could (as a service) translate them into character entities before > handing them off to a parser. (I wouldn't, though. You'll come to > regret it, perhaps when you have to accept a UTF-16 document, and a byte > of data is no longer even roughly equivalent to a character.)
I agree that your best chance for success is for you to filter and normalize the input before feeding it to Xerces, because the way that they get "fixed" is controllable by you. Let's say magic pixies add "fuzziness" to the next xerces release. Maybe the way that it fixes the data won't be the way you want it fixed. Maybe they won't handle one or more special cases you need. Maybe the way you would like the data fixed is contrary to the way someone else would like the data fixed. Maybe the users will find new, fun, and interesting ways to break the formatting. Now you have to wait for the next release for the magic pixies to add that new feature. If you do some simple sed/awk/perl/whatever scripting before passing it to xerces, you have complete control over the process, and it doesn't affect any other users. ---------------------------------------------------------------------------- DDDD David Kramer [EMAIL PROTECTED] http://thekramers.net DK KD DKK D "Music expresses that which cannot be said DK KD and on which it is impossible to be silent." DDDD - Victor Hugo --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]