On Fri, 7 Dec 2012 17:48:12 -0800
Buck Golemon <[email protected]> wrote:

> > If you already have existing data in 1252 or a variation (and can’t
> > tell
> them apart), then nothing’s gained by making NEW requirements for 1252
> which the old data won’t conform to.
> 
> 
> Old latin1 documents can contain 0x81 and still be valid.
> All browsers decode latin1 documents with cp1252.
> In all cases, such a document would decode with a U+0081 character,
> with no error.

Are there *valid* Latin-1 documents with 0x81?  0x81 looks more like a
bit of mojibake.  Surely what's more at issue is finding the least bad
handling of partially corrupt text, e.g. with a view to correcting
errors, just as we don't discard emails with grammatical errors in the
text.

As for Shawn Steele's recommendation to create new data in UTF-8,
there are 8-bit channels that corrupt UTF-8, such as replies via the
Yahoo groups web interface, which irrecoverably mangles some
continuation bytes. 

Richard.

Reply via email to