On Fri, 7 Dec 2012 17:48:12 -0800 Buck Golemon <[email protected]> wrote:
> > If you already have existing data in 1252 or a variation (and can’t > > tell > them apart), then nothing’s gained by making NEW requirements for 1252 > which the old data won’t conform to. > > > Old latin1 documents can contain 0x81 and still be valid. > All browsers decode latin1 documents with cp1252. > In all cases, such a document would decode with a U+0081 character, > with no error. Are there *valid* Latin-1 documents with 0x81? 0x81 looks more like a bit of mojibake. Surely what's more at issue is finding the least bad handling of partially corrupt text, e.g. with a view to correcting errors, just as we don't discard emails with grammatical errors in the text. As for Shawn Steele's recommendation to create new data in UTF-8, there are 8-bit channels that corrupt UTF-8, such as replies via the Yahoo groups web interface, which irrecoverably mangles some continuation bytes. Richard.

