Re: Assume CP1252

Sean Burke Sat, 10 Jan 2015 17:48:46 -0800

Helleu, Pod pals!

Short version about "Re: Assume CP1252"-- I advise: yes, assume CP1252where technically you were expecting Latin-1.


 ~~

Long version:

I don't normally pipe up about (or keep up with anything about) Podstuff, because it's yall's language now-- but since an issue of myoriginal intent has come up, and it shunted into my normal inbox, I'lljump in:


On 01/05/2015 10:58 PM, David E. Wheeler wrote:

[...] Pod Peeps:
 if the first highbit byte sequence in the file seems valid as a UTF-8
 sequence, or otherwise as Latin-1.
[...]  I suggest we switch from Latin-1 to CP1252. [...]


I agree completely, go for it!

Yes:

* assume that input is CP1252 in the absence of any encoding beingdeclared

* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construingLatin-1 to actually mean the superset CP1252) means in practice thateverybody wins, and nobody loses, and DWIM prevails yet again.


Moreover, this construal of Latin-1 as CP1252 has significant precedent:

«Most modern web browsers and e-mail clients treat the MIME charsetISO-8859-1 as Windows-1252 to accommodate such mislabeling. This isnow standard behavior in the draft HTML 5 specification, whichrequires that documents advertised as ISO-8859-1 actually be parsedwith the Windows-1252 encoding.»


And it obeys Postel's law:

"Be conservative in what you do; be liberal in what you accept fromothers."


And...
  http://www.w3.org/TR/encoding/#names-and-labels

even seems to tolerate more things, to a point, if I'm reading itright. Dunno. On this point, it's up to you folks.

BTW: I think many people would appreciate having "=encoding ansi"tolerated as a synonym for "=encoding win-1252"... because somesystems simply call it that-- and I can never remember 1252 vs 1250 vsmy own zipcode vs last four digits of my Antarctican passport, etc.

Incidentally, you presumably might want to expand the%Latin1Code_to_fallback table in Pod::Escapes.

(...which reminds me to push out some more versions of Unidecode,notably one that covers the symbol for the now very eventful ruble.)



Now, there's two issues that may or may not be already seen as separate:

* assuming that input is CP1252 in the absence of any encoding beingdeclared

* assuming that input is CP1252 if the declared encoding is Latin-1
I suggest doing both (like HTML5)-- but at least the first definitely!

If anyone wants extreme S&M, maybe a throw a note in WARNINGS about "Iexpected this to be in Latin-1 but it looks like maybe you shouldprobably have a '=encoding win1252' line."But that seems a case of pointless and even onerous obtuseness,instead of unproblematic DWIM. I think.

I’ve discussed this with Sean Burke in the last couple years, and IIRC he said 
he probably should have assumed CP1252 instead of Latin-1 when he wrote it.


True enough!

But not if there are flaws with the plan. Thoughts? Should we make this change? 
Seems like a win overall to me, but I miss details all the time. Let me know 
your thoughts.

As to possible flaws, I see two that are on the very edge of remotepossibility.

But, for sake of completeness, I'll note:

* I think using characters 0x80-0x9F might just conceivably screw upsome crazy text editors' "what encoding is this?" guesswork-- withwhat consequences I don't know.But, ya know, as Paul F. Tompkins says: "We are living in a year witha TWO IN FRONT OF IT!", so any editor that silently guesses that way,and somehow silently makes bad things happen, should have already beenpushed out an airlock at least a decade ago.

* And, speaking of heuristics: I think the recognition heuristics inUnix's file(1) might... remotely, conceivably... change file(1)'sopinion of what a pure-Pod input file is, from yes to no, if itconstrues a file that has 0x80-0x9F but also has "=encoding latin-1"as a paradox that means something not-Pod. Hypothetically.But that is far beyond any sense that file(1) can be expected to*reliably* have (or maybe can even express in its recognition rules).Already file(1) is just catastrophically dumb at anything other thananswering thins like "is this extensionless file a GIF?", becausebeyond that, it already guesses wrong more often than right.


I've just now run it on Pod/Simple.pod and it said
"C source, ASCII text"
Boioiooing.

And I've just now run it on a s2763_sjis.pod I had lying around, whichhas two kanji in the first 64 bytes-- and with a "=encoding shiftjis"being the second line in the file!, and file(1) said:"Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL lineterminators"

So... Don't overthink why file(1) does what it does-- *it* certainlydoesn't overthink it.



I hope this message has helped.
REESE'S PIECES OUT.

Re: Assume CP1252

Reply via email to