On Thu, 23 Sep 2004, Jesse Pelton wrote:
> I think the blame belongs with your clients' authoring tools, which
> should help them produce well-formed documents.  On the other hand, if
> you want to work around the presence of certain illegal characters, you
> could (as a service) translate them into character entities before
> handing them off to a parser.  (I wouldn't, though.  You'll come to
> regret it, perhaps when you have to accept a UTF-16 document, and a byte
> of data is no longer even roughly equivalent to a character.)

I agree that your best chance for success is for you to filter and 
normalize the input before feeding it to Xerces, because the way that they 
get "fixed" is controllable by you.

Let's say magic pixies add "fuzziness" to the next xerces release.  Maybe
the way that it fixes the data won't be the way you want it fixed.  Maybe
they won't handle one or more special cases you need. Maybe the way you
would like the data fixed is contrary to the way someone else would like
the data fixed.  Maybe the users will find new, fun, and interesting ways
to break the formatting.  Now you have to wait for the next release for
the magic pixies to add that new feature.

If you do some simple sed/awk/perl/whatever scripting before passing it to 
xerces, you have complete control over the process, and it doesn't affect 
any other users.

----------------------------------------------------------------------------
DDDD   David Kramer         [EMAIL PROTECTED]       http://thekramers.net
DK KD  
DKK D  "Music expresses that which cannot be said 
DK KD  and on which it is impossible to be silent."
DDDD                                                           - Victor Hugo


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to