So I find that the unicode.org cp1252 file leaves those bytes undefined as well, so the issue stems from there.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? This would allow systems that follow the html5 standard and use cp1252 in place of latin1 to continue to be binary-faithful and reversible. On Fri, Nov 16, 2012 at 3:38 PM, Buck Golemon <[email protected]> wrote: > cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not. > This leaves five bytes with undefined semantics. > > Currently the python cp1252 decoder allows us to ignore/replace/error on > these bytes, but there's no facility for allowing these unknown bytes to > round-trip through the codec, as the latin1 codec does. > > I'd like to get this "fixed" but I will have a very hard time convincing > anyone that it's wrong. >

