Helleu, Pod pals!
Short version about "Re: Assume CP1252"-- I advise: yes, assume CP1252
where technically you were expecting Latin-1.
~~
Long version:
I don't normally pipe up about (or keep up with anything about) Pod
stuff, because it's yall's language now-- but since an issue of my
original intent has come up, and it shunted into my normal inbox, I'll
jump in:
On 01/05/2015 10:58 PM, David E. Wheeler wrote:
[...] Pod Peeps:
if the first highbit byte sequence in the file seems valid as a UTF-8
sequence, or otherwise as Latin-1.
[...] I suggest we switch from Latin-1 to CP1252. [...]
I agree completely, go for it!
Yes:
* assume that input is CP1252 in the absence of any encoding being
declared
* assume that input is CP1252 if the declared encoding is Latin-1
As far as I know, that amicable bait-and-switch (i.e., construing
Latin-1 to actually mean the superset CP1252) means in practice that
everybody wins, and nobody loses, and DWIM prevails yet again.
Moreover, this construal of Latin-1 as CP1252 has significant precedent:
«Most modern web browsers and e-mail clients treat the MIME charset
ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is
now standard behavior in the draft HTML 5 specification, which
requires that documents advertised as ISO-8859-1 actually be parsed
with the Windows-1252 encoding.»
And it obeys Postel's law:
"Be conservative in what you do; be liberal in what you accept from
others."
And...
http://www.w3.org/TR/encoding/#names-and-labels
even seems to tolerate more things, to a point, if I'm reading it
right. Dunno. On this point, it's up to you folks.
BTW: I think many people would appreciate having "=encoding ansi"
tolerated as a synonym for "=encoding win-1252"... because some
systems simply call it that-- and I can never remember 1252 vs 1250 vs
my own zipcode vs last four digits of my Antarctican passport, etc.
Incidentally, you presumably might want to expand the
%Latin1Code_to_fallback table in Pod::Escapes.
(...which reminds me to push out some more versions of Unidecode,
notably one that covers the symbol for the now very eventful ruble.)
Now, there's two issues that may or may not be already seen as separate:
* assuming that input is CP1252 in the absence of any encoding being
declared
* assuming that input is CP1252 if the declared encoding is Latin-1
I suggest doing both (like HTML5)-- but at least the first definitely!
If anyone wants extreme S&M, maybe a throw a note in WARNINGS about "I
expected this to be in Latin-1 but it looks like maybe you should
probably have a '=encoding win1252' line."
But that seems a case of pointless and even onerous obtuseness,
instead of unproblematic DWIM. I think.
I’ve discussed this with Sean Burke in the last couple years, and IIRC he said
he probably should have assumed CP1252 instead of Latin-1 when he wrote it.
True enough!
But not if there are flaws with the plan. Thoughts? Should we make this change?
Seems like a win overall to me, but I miss details all the time. Let me know
your thoughts.
As to possible flaws, I see two that are on the very edge of remote
possibility.
But, for sake of completeness, I'll note:
* I think using characters 0x80-0x9F might just conceivably screw up
some crazy text editors' "what encoding is this?" guesswork-- with
what consequences I don't know.
But, ya know, as Paul F. Tompkins says: "We are living in a year with
a TWO IN FRONT OF IT!", so any editor that silently guesses that way,
and somehow silently makes bad things happen, should have already been
pushed out an airlock at least a decade ago.
* And, speaking of heuristics: I think the recognition heuristics in
Unix's file(1) might... remotely, conceivably... change file(1)'s
opinion of what a pure-Pod input file is, from yes to no, if it
construes a file that has 0x80-0x9F but also has "=encoding latin-1"
as a paradox that means something not-Pod. Hypothetically.
But that is far beyond any sense that file(1) can be expected to
*reliably* have (or maybe can even express in its recognition rules).
Already file(1) is just catastrophically dumb at anything other than
answering thins like "is this extensionless file a GIF?", because
beyond that, it already guesses wrong more often than right.
I've just now run it on Pod/Simple.pod and it said
"C source, ASCII text"
Boioiooing.
And I've just now run it on a s2763_sjis.pod I had lying around, which
has two kanji in the first 64 bytes-- and with a "=encoding shiftjis"
being the second line in the file!, and file(1) said:
"Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL line
terminators"
So... Don't overthink why file(1) does what it does-- *it* certainly
doesn't overthink it.
I hope this message has helped.
REESE'S PIECES OUT.