Re: The Encoding Warning (Again) - some data

David E. Wheeler Tue, 28 Aug 2012 09:59:35 -0700

On Aug 27, 2012, at 9:03 PM, Grant McLean wrote:

> My opinion is that the warning is useful.  However to be pragmatic, now
> that the encoding detection heuristic has been implemented, adding
> =encoding declarations to any of those 1215 distributions will have no
> practical effect other than silencing the warning.  That's the best
> argument I can come up with in favour of changing the status quo.


Well, that and you ensure that the proper encoding is selected (assuming the 
developer knows how to encode things properly).

> On a tangentially related note, I was pondering whether the heuristic
> should actually fall back to CP1252 rather than ISO8859-1 - after all
> that's what the W3C recommend:
> 
>  http://www.w3.org/TR/html5/parsing.html#character-encodings-0
> 
> However my statistics show that only 44 files in current releases were
> detected as Latin-1 but actually contained CP1252 (typically "smart
> quote" symbols in the \x80-\x9F range).  So it doesn't seem worth
> pursuing that change.

You could assume CP1252 if characters are found in that range and Latin-1 
otherwise.

> Finally I searched for files which were detected as UTF-8 but actually
> contained characters from the CP1252 range.  There was only one and it
> wasn't an error in the detection, the source file contains a
> double-encoded character. It was a mangled attempt to name a contributor
> - Slaven Resić   :-)

Coincidence I’m sure!

Best,

David

Re: The Encoding Warning (Again) - some data

Reply via email to