Re: Assume CP1252
On Jan 13, 2015, at 6:07 PM, David E. Wheeler da...@justatheory.com wrote: Pod::Simple 3.29 is on its way to CPAN now. I’m going to apply the change proposed in the Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns thread now, and once you have the EBCDIC and CP1252 stuff done, we can do a test release to let it smoke with those changes. And now 3.29_1 is on CPAN with EBCDIC support and CP1252 as the default thanks to Karl Williamson. Details here: http://theory.pm/2015/02/11/please-test-pod-simple-3-dot-29-3/ Please test! Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Assume CP1252
On 01/12/2015 01:27 PM, Karl Williamson wrote: On 01/12/2015 12:49 PM, David E. Wheeler wrote: On Jan 12, 2015, at 11:46 AM, Karl Williamson pub...@khwilliamson.com wrote: I ran across this link, but didn't see what action was taken on it: http://www.w3.org/TR/newline Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be a mistake on EBCDIC? David Yes, that's essentially what I meant when I said in an earlier email that NEL is THE new-line character on os390, which generally runs using EBCDIC. The code point for NEL in cp1252 is a horizontal ellipsis, and not a next line, but on some platforms, like os390, it means next line. This is a conflict. However, now that I think about it, when I look at os390 runs, I rarely see NELs. Maybe there is a filter that translates them to \n before the pod sees it, but sometimes, I do see NEL all over the place but no \n. I'll ask on the perl-mvs list about this. tl;dr: I was wrong to think there was a problem in s/latin1/cp1252/ for EBCDIC. In researching the issue in order to create an intelligent posting, I found the answer. It is an undocumented subtlety with Perl's EBCDIC implementation, that I was surprised I didn't know, as I've been pretty deep into that implementation. And it's interesting (at least to me), so I'll document it here (as well as make corrections to perlebcdic.pod). As many of you know, ASCII has both CR and LF characters that are used variously as line termination characters. Old Apple used CR, and Windows uses the combination CR-LF. Perl handled the Apple issue by swapping the meanings of \r and \n there; it handles CR-LF by having an I/O layer that makes CR-LF appears as a single \n internally so the gotchas are hidden from most applications. In addition, Unicode defines the NEL (next line) character which is an another alternative line terminator. Its code point is the one that CP1252 uses instead to mean a horizontal ellipsis. It turns out that NEL is the character that os390 uses as its line terminator, not CR nor LF. It is called NL in EBCDIC. (NL is unfortunately a synonym for LF in ASCII and Unicode terminology.) What Perl does to handle this is to simple swap the NEL and LF code points. That makes \n mean NEL instead of LF. Apparently LF is unused in EBCDIC applications, so it works. There is official support for this swap, as Unicode's definition of how to get UTF-8 to work on EBCDIC platforms says to do the swap. It does mean that NL doesn't mean the character that a native EBCDIC speaker would think. But the bottom line is that because of this character swapping, the NEL characters in EBCDIC appear as \n, so aren't a problem for CP1252.
Re: Assume CP1252
On Jan 13, 2015, at 10:31 AM, Karl Williamson pub...@khwilliamson.com wrote: What Perl does to handle this is to simple swap the NEL and LF code points. That makes \n mean NEL instead of LF. Apparently LF is unused in EBCDIC applications, so it works. There is official support for this swap, as Unicode's definition of how to get UTF-8 to work on EBCDIC platforms says to do the swap. Huh. Good to know (and have it documented now!). It does mean that NL doesn't mean the character that a native EBCDIC speaker would think. But the bottom line is that because of this character swapping, the NEL characters in EBCDIC appear as \n, so aren't a problem for CP1252. Nice. So should we then adopt the same pattern as the HTML 5 spec? And I wonder if that W3 spec issue you pointed to the other day could use a comment to this effect. Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Assume CP1252
On 01/12/2015 06:25 AM, Shawn H Corey wrote: On Sun, 11 Jan 2015 20:57:26 -0700 Karl Williamson pub...@khwilliamson.com wrote: To be clear, I think that assuming 1252 when there is no =encoding line is a good idea. But I'm leery of overriding an actual =encoding line. Agreed. I could possibly be persuaded, if someone want to make it, by the argument that 'latin1' is kind of colloquial, and someone using it may very well not be familiar with the possibility that they really mean cp1252. But, if so, there needs to be a way for someone to say I really mean it and not be overridden by us. Perhaps that could be =encoding ISO-8859-1. Q: What if there is more than one =encoding line? Does it switch encoding part way thru a POD? Error while formatting with Pod::Perldoc::ToMan: Nested processed encoding. at /usr/share/perl/5.18/Pod/Simple/BlackBox.pm line 380.
Re: Assume CP1252
On Jan 12, 2015, at 11:18 AM, Karl Williamson pub...@khwilliamson.com wrote: To be clear, I think that assuming 1252 when there is no =encoding line is a good idea. But I'm leery of overriding an actual =encoding line. Agreed. I’m okay with this. I could possibly be persuaded, if someone want to make it, by the argument that 'latin1' is kind of colloquial, and someone using it may very well not be familiar with the possibility that they really mean cp1252. But, if so, there needs to be a way for someone to say I really mean it and not be overridden by us. Perhaps that could be =encoding ISO-8859-1. If we *were* to assume CP1252 for Latin-1, I would want it to be consistent with the precedent set by the W3C. Sean supplied this link: http://www.w3.org/TR/encoding/#names-and-labels Here’s the list of labels that they translate to Windows-1252: ansi_x3.4-1968 ascii cp1252 cp819 csisolatin1 ibm819 iso-8859-1 iso-ir-100 iso8859-1 iso88591 iso_8859-1 iso_8859-1:1987 l1 latin1 us-ascii windows-1252 x-cp1252 In their interpretation, no label ever resolves to iso-8859-1. Pretty interesting. Q: What if there is more than one =encoding line? Does it switch encoding part way thru a POD? Error while formatting with Pod::Perldoc::ToMan: Nested processed encoding. at /usr/share/perl/5.18/Pod/Simple/BlackBox.pm line 380. I recently changed this error, because that was a pretty useless message. The new message is Cannot have multiple =encoding directives. Also, it is no longer fatal, but is passed to scream(), which means it would be a failure for Test::Pod, but won’t break tools that generate docs. http://github.com/theory/pod-simple/commit/cb884b5 Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Assume CP1252
On 01/12/2015 12:37 PM, David E. Wheeler wrote: On Jan 12, 2015, at 11:18 AM, Karl Williamson pub...@khwilliamson.com wrote: To be clear, I think that assuming 1252 when there is no =encoding line is a good idea. But I'm leery of overriding an actual =encoding line. Agreed. I’m okay with this. I could possibly be persuaded, if someone want to make it, by the argument that 'latin1' is kind of colloquial, and someone using it may very well not be familiar with the possibility that they really mean cp1252. But, if so, there needs to be a way for someone to say I really mean it and not be overridden by us. Perhaps that could be =encoding ISO-8859-1. If we *were* to assume CP1252 for Latin-1, I would want it to be consistent with the precedent set by the W3C. That sounds reasonable. Sean supplied this link: http://www.w3.org/TR/encoding/#names-and-labels Here’s the list of labels that they translate to Windows-1252: ansi_x3.4-1968 ascii cp1252 cp819 csisolatin1 ibm819 iso-8859-1 iso-ir-100 iso8859-1 iso88591 iso_8859-1 iso_8859-1:1987 l1 latin1 us-ascii windows-1252 x-cp1252 In their interpretation, no label ever resolves to iso-8859-1. Pretty interesting. I ran across this link, but didn't see what action was taken on it: http://www.w3.org/TR/newline Q: What if there is more than one =encoding line? Does it switch encoding part way thru a POD? Error while formatting with Pod::Perldoc::ToMan: Nested processed encoding. at /usr/share/perl/5.18/Pod/Simple/BlackBox.pm line 380. I recently changed this error, because that was a pretty useless message. The new message is Cannot have multiple =encoding directives. Also, it is no longer fatal, but is passed to scream(), which means it would be a failure for Test::Pod, but won’t break tools that generate docs. http://github.com/theory/pod-simple/commit/cb884b5 Best, David
Re: Assume CP1252
On 01/12/2015 12:49 PM, David E. Wheeler wrote: On Jan 12, 2015, at 11:46 AM, Karl Williamson pub...@khwilliamson.com wrote: I ran across this link, but didn't see what action was taken on it: http://www.w3.org/TR/newline Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be a mistake on EBCDIC? David Yes, that's essentially what I meant when I said in an earlier email that NEL is THE new-line character on os390, which generally runs using EBCDIC. The code point for NEL in cp1252 is a horizontal ellipsis, and not a next line, but on some platforms, like os390, it means next line. This is a conflict. However, now that I think about it, when I look at os390 runs, I rarely see NELs. Maybe there is a filter that translates them to \n before the pod sees it, but sometimes, I do see NEL all over the place but no \n. I'll ask on the perl-mvs list about this.
Re: Assume CP1252
On 01/10/2015 11:35 PM, David E. Wheeler wrote: On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote: Helleu, Pod pals! Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where technically you were expecting Latin-1. Thanks for chiming in, Sean. I agree completely, go for it! Yes: * assume that input is CP1252 in the absence of any encoding being declared * assume that input is CP1252 if the declared encoding is Latin-1 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to actually mean the superset CP1252) means in practice that everybody wins, and nobody loses, and DWIM prevails yet again. Right, I vaguely remember you telling me this before. I forgot about #2 (and the HTML 5 precedent). I think I oppose overruling someone's =encoding line. The reason that 1252 is effectively a superset of latin1 is because it reuses the C1 controls to mean something else, and we don't expect those controls to actually appear in a pod document. That is quite likely, except for one, NEL, U+85, which is the usual line separator on some platforms, notably os390 (that code point is the horizontal ellipsis in 1252). It strikes me as wrong anyway to say we know better than the coder. There needs to be a way for a coder to specify the coding and not have that specification ignored by us. We do not have the foresight to know the possible circumstances where Latin1 is the correct value and 1252 is not. We could be wrong, and we should provide an easy workaround for our wrongness. The most straight forward which will lead to the least resentment against us when we are wrong is to simply not second guess what the coder has said. os390 is proof that there is at least one platform that Perl runs on where 1252 is not a superset of Latin1. There could be special casing for that platform. But if we're wrong there, we could be wrong elsewhere. It just seems a bad idea to think we know better than the coder.
Re: Assume CP1252
On 01/11/2015 11:01 AM, Karl Williamson wrote: On 01/10/2015 11:35 PM, David E. Wheeler wrote: On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote: Helleu, Pod pals! Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where technically you were expecting Latin-1. Thanks for chiming in, Sean. I agree completely, go for it! Yes: * assume that input is CP1252 in the absence of any encoding being declared * assume that input is CP1252 if the declared encoding is Latin-1 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to actually mean the superset CP1252) means in practice that everybody wins, and nobody loses, and DWIM prevails yet again. Right, I vaguely remember you telling me this before. I forgot about #2 (and the HTML 5 precedent). I think I oppose overruling someone's =encoding line. The reason that 1252 is effectively a superset of latin1 is because it reuses the C1 controls to mean something else, and we don't expect those controls to actually appear in a pod document. That is quite likely, except for one, NEL, U+85, which is the usual line separator on some platforms, notably os390 (that code point is the horizontal ellipsis in 1252). It strikes me as wrong anyway to say we know better than the coder. There needs to be a way for a coder to specify the coding and not have that specification ignored by us. We do not have the foresight to know the possible circumstances where Latin1 is the correct value and 1252 is not. We could be wrong, and we should provide an easy workaround for our wrongness. The most straight forward which will lead to the least resentment against us when we are wrong is to simply not second guess what the coder has said. os390 is proof that there is at least one platform that Perl runs on where 1252 is not a superset of Latin1. There could be special casing for that platform. But if we're wrong there, we could be wrong elsewhere. It just seems a bad idea to think we know better than the coder. To be clear, I think that assuming 1252 when there is no =encoding line is a good idea. But I'm leery of overriding an actual =encoding line.
Re: Assume CP1252
Helleu, Pod pals! Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where technically you were expecting Latin-1. ~~ Long version: I don't normally pipe up about (or keep up with anything about) Pod stuff, because it's yall's language now-- but since an issue of my original intent has come up, and it shunted into my normal inbox, I'll jump in: On 01/05/2015 10:58 PM, David E. Wheeler wrote: [...] Pod Peeps: if the first highbit byte sequence in the file seems valid as a UTF-8 sequence, or otherwise as Latin-1. [...] I suggest we switch from Latin-1 to CP1252. [...] I agree completely, go for it! Yes: * assume that input is CP1252 in the absence of any encoding being declared * assume that input is CP1252 if the declared encoding is Latin-1 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to actually mean the superset CP1252) means in practice that everybody wins, and nobody loses, and DWIM prevails yet again. Moreover, this construal of Latin-1 as CP1252 has significant precedent: «Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the draft HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.» And it obeys Postel's law: Be conservative in what you do; be liberal in what you accept from others. And... http://www.w3.org/TR/encoding/#names-and-labels even seems to tolerate more things, to a point, if I'm reading it right. Dunno. On this point, it's up to you folks. BTW: I think many people would appreciate having =encoding ansi tolerated as a synonym for =encoding win-1252... because some systems simply call it that-- and I can never remember 1252 vs 1250 vs my own zipcode vs last four digits of my Antarctican passport, etc. Incidentally, you presumably might want to expand the %Latin1Code_to_fallback table in Pod::Escapes. (...which reminds me to push out some more versions of Unidecode, notably one that covers the symbol for the now very eventful ruble.) Now, there's two issues that may or may not be already seen as separate: * assuming that input is CP1252 in the absence of any encoding being declared * assuming that input is CP1252 if the declared encoding is Latin-1 I suggest doing both (like HTML5)-- but at least the first definitely! If anyone wants extreme SM, maybe a throw a note in WARNINGS about I expected this to be in Latin-1 but it looks like maybe you should probably have a '=encoding win1252' line. But that seems a case of pointless and even onerous obtuseness, instead of unproblematic DWIM. I think. I’ve discussed this with Sean Burke in the last couple years, and IIRC he said he probably should have assumed CP1252 instead of Latin-1 when he wrote it. True enough! But not if there are flaws with the plan. Thoughts? Should we make this change? Seems like a win overall to me, but I miss details all the time. Let me know your thoughts. As to possible flaws, I see two that are on the very edge of remote possibility. But, for sake of completeness, I'll note: * I think using characters 0x80-0x9F might just conceivably screw up some crazy text editors' what encoding is this? guesswork-- with what consequences I don't know. But, ya know, as Paul F. Tompkins says: We are living in a year with a TWO IN FRONT OF IT!, so any editor that silently guesses that way, and somehow silently makes bad things happen, should have already been pushed out an airlock at least a decade ago. * And, speaking of heuristics: I think the recognition heuristics in Unix's file(1) might... remotely, conceivably... change file(1)'s opinion of what a pure-Pod input file is, from yes to no, if it construes a file that has 0x80-0x9F but also has =encoding latin-1 as a paradox that means something not-Pod. Hypothetically. But that is far beyond any sense that file(1) can be expected to *reliably* have (or maybe can even express in its recognition rules). Already file(1) is just catastrophically dumb at anything other than answering thins like is this extensionless file a GIF?, because beyond that, it already guesses wrong more often than right. I've just now run it on Pod/Simple.pod and it said C source, ASCII text Boioiooing. And I've just now run it on a s2763_sjis.pod I had lying around, which has two kanji in the first 64 bytes-- and with a =encoding shiftjis being the second line in the file!, and file(1) said: Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL line terminators So... Don't overthink why file(1) does what it does-- *it* certainly doesn't overthink it. I hope this message has helped. REESE'S PIECES OUT.
Re: Assume CP1252
On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote: Helleu, Pod pals! Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where technically you were expecting Latin-1. Thanks for chiming in, Sean. I agree completely, go for it! Yes: * assume that input is CP1252 in the absence of any encoding being declared * assume that input is CP1252 if the declared encoding is Latin-1 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to actually mean the superset CP1252) means in practice that everybody wins, and nobody loses, and DWIM prevails yet again. Right, I vaguely remember you telling me this before. I forgot about #2 (and the HTML 5 precedent). BTW: I think many people would appreciate having =encoding ansi tolerated as a synonym for =encoding win-1252... because some systems simply call it that-- and I can never remember 1252 vs 1250 vs my own zipcode vs last four digits of my Antarctican passport, etc. ansi == cp1252?? I think Encode determines aliases. Incidentally, you presumably might want to expand the %Latin1Code_to_fallback table in Pod::Escapes. Paging Neil Bowers. Now, there's two issues that may or may not be already seen as separate: * assuming that input is CP1252 in the absence of any encoding being declared * assuming that input is CP1252 if the declared encoding is Latin-1 I suggest doing both (like HTML5)-- but at least the first definitely! +1 If anyone wants extreme SM, maybe a throw a note in WARNINGS about I expected this to be in Latin-1 but it looks like maybe you should probably have a '=encoding win1252' line. But that seems a case of pointless and even onerous obtuseness, instead of unproblematic DWIM. I think. Meh. I'm thinking, however, of adding a note to the ChangeLog for the next release that this change will be in the following release. I’ve already added a note that support for Perls 5.5 will be dropped. As to possible flaws, I see two that are on the very edge of remote possibility. But, for sake of completeness, I'll note: Pretty obscure! I hope this message has helped. REESE'S PIECES OUT. Thanks again! Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Assume CP1252
* Grant McLean gr...@mclean.net.nz [2015-01-07T18:47:49] I also agree this is a good idea. None of the Latin-1 control characters that CP1252 replaces with printable characters should be appearing in POD anyway. Seems safe, I think. At first, I thought, They're disjunct!! but then I realized that this is only true on codepoints that nobody is going to use in their Latin-1 POD. -- rjbs signature.asc Description: Digital signature