Re: Assume CP1252

2015-02-13 Thread David E. Wheeler
On Jan 13, 2015, at 6:07 PM, David E. Wheeler da...@justatheory.com wrote:

 Pod::Simple 3.29 is on its way to CPAN now. I’m going to apply the change 
 proposed in the Pod::Simple can treat binary as pod due to 
 liberal/inconsistent regexp patterns thread now, and once you have the 
 EBCDIC and CP1252 stuff done, we can do a test release to let it smoke with 
 those changes.

And now 3.29_1 is on CPAN with EBCDIC support and CP1252 as the default thanks 
to Karl Williamson. Details here:

  http://theory.pm/2015/02/11/please-test-pod-simple-3-dot-29-3/

Please test!

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Assume CP1252

2015-01-13 Thread Karl Williamson

On 01/12/2015 01:27 PM, Karl Williamson wrote:

On 01/12/2015 12:49 PM, David E. Wheeler wrote:

On Jan 12, 2015, at 11:46 AM, Karl Williamson
pub...@khwilliamson.com wrote:


I ran across this link, but didn't see what action was taken on it:
http://www.w3.org/TR/newline


Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be
a mistake on EBCDIC?

David



Yes, that's essentially what I meant when I said in an earlier email
that NEL is THE new-line character on os390, which generally runs using
EBCDIC.  The code point for NEL in cp1252 is a horizontal ellipsis, and
not a next line, but on some platforms, like os390, it means next
line.   This is a conflict.

However, now that I think about it, when I look at os390 runs, I rarely
see NELs.  Maybe there is a filter that translates them to \n before the
pod sees it, but sometimes, I do see NEL all over the place but no \n.
I'll ask on the perl-mvs list about this.



tl;dr:  I was wrong to think there was a problem in s/latin1/cp1252/ for 
EBCDIC.


In researching the issue in order to create an intelligent posting, I 
found the answer.


It is an undocumented subtlety with Perl's EBCDIC implementation, that I 
was surprised I didn't know, as I've been pretty deep into that 
implementation.


And it's interesting (at least to me), so I'll document it here (as well 
as make corrections to perlebcdic.pod).


As many of you know, ASCII has both CR and LF characters that are used 
variously as line termination characters.  Old Apple used CR, and 
Windows uses the combination CR-LF.  Perl handled the Apple issue by 
swapping the meanings of \r and \n there; it handles CR-LF by having an 
I/O layer that makes CR-LF appears as a single \n internally so the 
gotchas are hidden from most applications.


In addition, Unicode defines the NEL (next line) character which is an 
another alternative line terminator.  Its code point is the one that 
CP1252 uses instead to mean a horizontal ellipsis.


It turns out that NEL is the character that os390 uses as its line 
terminator, not CR nor LF.  It is called NL in EBCDIC.  (NL is 
unfortunately a synonym for LF in ASCII and Unicode terminology.)


What Perl does to handle this is to simple swap the NEL and LF code 
points.  That makes \n mean NEL instead of LF.  Apparently LF is unused 
in EBCDIC applications, so it works.  There is official support for this 
swap, as Unicode's definition of how to get UTF-8 to work on EBCDIC 
platforms says to do the swap.


It does mean that NL doesn't mean the character that a native EBCDIC 
speaker would think.


But the bottom line is that because of this character swapping, the NEL 
characters in EBCDIC appear as \n, so aren't a problem for CP1252.


Re: Assume CP1252

2015-01-13 Thread David E. Wheeler
On Jan 13, 2015, at 10:31 AM, Karl Williamson pub...@khwilliamson.com wrote:

 What Perl does to handle this is to simple swap the NEL and LF code points.  
 That makes \n mean NEL instead of LF.  Apparently LF is unused in EBCDIC 
 applications, so it works.  There is official support for this swap, as 
 Unicode's definition of how to get UTF-8 to work on EBCDIC platforms says to 
 do the swap.

Huh. Good to know (and have it documented now!).

 It does mean that NL doesn't mean the character that a native EBCDIC speaker 
 would think.
 
 But the bottom line is that because of this character swapping, the NEL 
 characters in EBCDIC appear as \n, so aren't a problem for CP1252.

Nice. So should we then adopt the same pattern as the HTML 5 spec?

And I wonder if that W3 spec issue you pointed to the other day could use a 
comment to this effect.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Assume CP1252

2015-01-12 Thread Karl Williamson

On 01/12/2015 06:25 AM, Shawn H Corey wrote:

On Sun, 11 Jan 2015 20:57:26 -0700
Karl Williamson pub...@khwilliamson.com wrote:


To be clear, I think that assuming 1252 when there is no =encoding
line is a good idea.  But I'm leery of overriding an actual =encoding
line.


Agreed.


I could possibly be persuaded, if someone want to make it, by the 
argument that 'latin1' is kind of colloquial, and someone using it may 
very well not be familiar with the possibility that they really mean 
cp1252.  But, if so, there needs to be a way for someone to say I 
really mean it and not be overridden by us.  Perhaps

that could be =encoding ISO-8859-1.



Q: What if there is more than one =encoding line? Does it switch
encoding part way thru a POD?




Error while formatting with Pod::Perldoc::ToMan:
 Nested processed encoding. at 
/usr/share/perl/5.18/Pod/Simple/BlackBox.pm line 380.




Re: Assume CP1252

2015-01-12 Thread David E. Wheeler
On Jan 12, 2015, at 11:18 AM, Karl Williamson pub...@khwilliamson.com wrote:

 To be clear, I think that assuming 1252 when there is no =encoding
 line is a good idea.  But I'm leery of overriding an actual =encoding
 line.
 
 Agreed.

I’m okay with this.

 I could possibly be persuaded, if someone want to make it, by the argument 
 that 'latin1' is kind of colloquial, and someone using it may very well not 
 be familiar with the possibility that they really mean cp1252.  But, if so, 
 there needs to be a way for someone to say I really mean it and not be 
 overridden by us.  Perhaps
 that could be =encoding ISO-8859-1.

If we *were* to assume CP1252 for Latin-1, I would want it to be consistent 
with the precedent set by the W3C. Sean supplied this link:

  http://www.w3.org/TR/encoding/#names-and-labels

Here’s the list of labels that they translate to Windows-1252:


ansi_x3.4-1968
ascii
cp1252
cp819
csisolatin1
ibm819
iso-8859-1
iso-ir-100
iso8859-1
iso88591
iso_8859-1
iso_8859-1:1987
l1
latin1
us-ascii
windows-1252
x-cp1252

In their interpretation, no label ever resolves to iso-8859-1. Pretty 
interesting.

 Q: What if there is more than one =encoding line? Does it switch
 encoding part way thru a POD?
 
 
 
 Error while formatting with Pod::Perldoc::ToMan:
 Nested processed encoding. at /usr/share/perl/5.18/Pod/Simple/BlackBox.pm 
 line 380.

I recently changed this error, because that was a pretty useless message. The 
new message is Cannot have multiple =encoding directives. Also, it is no 
longer fatal, but is passed to scream(), which means it would be a failure for 
Test::Pod, but won’t break tools that generate docs.

  http://github.com/theory/pod-simple/commit/cb884b5

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Assume CP1252

2015-01-12 Thread Karl Williamson

On 01/12/2015 12:37 PM, David E. Wheeler wrote:

On Jan 12, 2015, at 11:18 AM, Karl Williamson pub...@khwilliamson.com wrote:


To be clear, I think that assuming 1252 when there is no =encoding
line is a good idea.  But I'm leery of overriding an actual =encoding
line.


Agreed.


I’m okay with this.


I could possibly be persuaded, if someone want to make it, by the argument that 'latin1' 
is kind of colloquial, and someone using it may very well not be familiar with the 
possibility that they really mean cp1252.  But, if so, there needs to be a way for 
someone to say I really mean it and not be overridden by us.  Perhaps
that could be =encoding ISO-8859-1.


If we *were* to assume CP1252 for Latin-1, I would want it to be consistent 
with the precedent set by the W3C.


That sounds reasonable.


 Sean supplied this link:


   http://www.w3.org/TR/encoding/#names-and-labels

Here’s the list of labels that they translate to Windows-1252:


ansi_x3.4-1968
ascii
cp1252
cp819
csisolatin1
ibm819
iso-8859-1
iso-ir-100
iso8859-1
iso88591
iso_8859-1
iso_8859-1:1987
l1
latin1
us-ascii
windows-1252
x-cp1252

In their interpretation, no label ever resolves to iso-8859-1. Pretty 
interesting.


I ran across this link, but didn't see what action was taken on it:
http://www.w3.org/TR/newline






Q: What if there is more than one =encoding line? Does it switch
encoding part way thru a POD?




Error while formatting with Pod::Perldoc::ToMan:
Nested processed encoding. at /usr/share/perl/5.18/Pod/Simple/BlackBox.pm line 
380.


I recently changed this error, because that was a pretty useless message. The new message 
is Cannot have multiple =encoding directives. Also, it is no longer fatal, 
but is passed to scream(), which means it would be a failure for Test::Pod, but won’t 
break tools that generate docs.

   http://github.com/theory/pod-simple/commit/cb884b5

Best,

David





Re: Assume CP1252

2015-01-12 Thread Karl Williamson

On 01/12/2015 12:49 PM, David E. Wheeler wrote:

On Jan 12, 2015, at 11:46 AM, Karl Williamson pub...@khwilliamson.com wrote:


I ran across this link, but didn't see what action was taken on it:
http://www.w3.org/TR/newline


Pardon my ignorance. Does that mean that `s/Latin-1/CP1252/g` could be a 
mistake on EBCDIC?

David



Yes, that's essentially what I meant when I said in an earlier email 
that NEL is THE new-line character on os390, which generally runs using 
EBCDIC.  The code point for NEL in cp1252 is a horizontal ellipsis, and 
not a next line, but on some platforms, like os390, it means next 
line.   This is a conflict.


However, now that I think about it, when I look at os390 runs, I rarely 
see NELs.  Maybe there is a filter that translates them to \n before the 
pod sees it, but sometimes, I do see NEL all over the place but no \n. 
I'll ask on the perl-mvs list about this.


Re: Assume CP1252

2015-01-11 Thread Karl Williamson

On 01/10/2015 11:35 PM, David E. Wheeler wrote:

On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote:


Helleu, Pod pals!
Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where 
technically you were expecting Latin-1.


Thanks for chiming in, Sean.


I agree completely, go for it!

Yes:
* assume that input is CP1252 in the absence of any encoding being declared
* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to 
actually mean the superset CP1252) means in practice that everybody wins, and 
nobody loses, and DWIM prevails yet again.


Right, I vaguely remember you telling me this before. I forgot about #2 (and 
the HTML 5 precedent).


I think I oppose overruling someone's =encoding line.  The reason that 
1252 is effectively a superset of latin1 is because it reuses the C1 
controls to mean something else, and we don't expect those controls to 
actually appear in a pod document.  That is quite likely, except for 
one, NEL, U+85, which is the usual line separator on some platforms, 
notably os390 (that code point is the horizontal ellipsis in 1252).


It strikes me as wrong anyway to say we know better than the coder. 
There needs to be a way for a coder to specify the coding and not have 
that specification ignored by us.  We do not have the foresight to know 
the possible circumstances where Latin1 is the correct value and 1252 is 
not.  We could be wrong, and we should provide an easy workaround for 
our wrongness.  The most straight forward which will lead to the least 
resentment against us when we are wrong is to simply not second guess 
what the coder has said.


os390 is proof that there is at least one platform that Perl runs on 
where 1252 is not a superset of Latin1.  There could be special casing 
for that platform.  But if we're wrong there, we could be wrong 
elsewhere.  It just seems a bad idea to think we know better than the 
coder.




Re: Assume CP1252

2015-01-11 Thread Karl Williamson

On 01/11/2015 11:01 AM, Karl Williamson wrote:

On 01/10/2015 11:35 PM, David E. Wheeler wrote:

On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote:


Helleu, Pod pals!
Short version about Re: Assume CP1252-- I advise: yes, assume
CP1252 where technically you were expecting Latin-1.


Thanks for chiming in, Sean.


I agree completely, go for it!

Yes:
* assume that input is CP1252 in the absence of any encoding being
declared
* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construing
Latin-1 to actually mean the superset CP1252) means in practice that
everybody wins, and nobody loses, and DWIM prevails yet again.


Right, I vaguely remember you telling me this before. I forgot about
#2 (and the HTML 5 precedent).


I think I oppose overruling someone's =encoding line.  The reason that
1252 is effectively a superset of latin1 is because it reuses the C1
controls to mean something else, and we don't expect those controls to
actually appear in a pod document.  That is quite likely, except for
one, NEL, U+85, which is the usual line separator on some platforms,
notably os390 (that code point is the horizontal ellipsis in 1252).

It strikes me as wrong anyway to say we know better than the coder.
There needs to be a way for a coder to specify the coding and not have
that specification ignored by us.  We do not have the foresight to know
the possible circumstances where Latin1 is the correct value and 1252 is
not.  We could be wrong, and we should provide an easy workaround for
our wrongness.  The most straight forward which will lead to the least
resentment against us when we are wrong is to simply not second guess
what the coder has said.

os390 is proof that there is at least one platform that Perl runs on
where 1252 is not a superset of Latin1.  There could be special casing
for that platform.  But if we're wrong there, we could be wrong
elsewhere.  It just seems a bad idea to think we know better than the
coder.



To be clear, I think that assuming 1252 when there is no =encoding line 
is a good idea.  But I'm leery of overriding an actual =encoding line.




Re: Assume CP1252

2015-01-10 Thread Sean Burke

Helleu, Pod pals!
Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 
where technically you were expecting Latin-1.


 ~~

Long version:

I don't normally pipe up about (or keep up with anything about) Pod 
stuff, because it's yall's language now-- but since an issue of my 
original intent has come up, and it shunted into my normal inbox, I'll 
jump in:


On 01/05/2015 10:58 PM, David E. Wheeler wrote:


[...] Pod Peeps:
 if the first highbit byte sequence in the file seems valid as a UTF-8
 sequence, or otherwise as Latin-1.
[...]  I suggest we switch from Latin-1 to CP1252. [...]


I agree completely, go for it!

Yes:
* assume that input is CP1252 in the absence of any encoding being 
declared

* assume that input is CP1252 if the declared encoding is Latin-1

As far as I know, that amicable bait-and-switch (i.e., construing 
Latin-1 to actually mean the superset CP1252) means in practice that 
everybody wins, and nobody loses, and DWIM prevails yet again.


Moreover, this construal of Latin-1 as CP1252 has significant precedent:

«Most modern web browsers and e-mail clients treat the MIME charset 
ISO-8859-1 as Windows-1252 to accommodate such mislabeling.  This is 
now standard behavior in the draft HTML 5 specification, which 
requires that documents advertised as ISO-8859-1 actually be parsed 
with the Windows-1252 encoding.»


And it obeys Postel's law:
Be conservative in what you do; be liberal in what you accept from 
others.


And...
  http://www.w3.org/TR/encoding/#names-and-labels
even seems to tolerate more things, to a point, if I'm reading it 
right.  Dunno.  On this point, it's up to you folks.



BTW: I think many people would appreciate having =encoding ansi 
tolerated as a synonym for =encoding win-1252... because some 
systems simply call it that-- and I can never remember 1252 vs 1250 vs 
my own zipcode vs last four digits of my Antarctican passport, etc.



Incidentally, you presumably might want to expand the 
%Latin1Code_to_fallback table in Pod::Escapes.


(...which reminds me to push out some more versions of Unidecode, 
notably one that covers the symbol for the now very eventful ruble.)



Now, there's two issues that may or may not be already seen as separate:
* assuming that input is CP1252 in the absence of any encoding being 
declared

* assuming that input is CP1252 if the declared encoding is Latin-1
I suggest doing both (like HTML5)-- but at least the first definitely!


If anyone wants extreme SM, maybe a throw a note in WARNINGS about I 
expected this to be in Latin-1 but it looks like maybe you should 
probably have a '=encoding win1252' line.
But that seems a case of pointless and even onerous obtuseness, 
instead of unproblematic DWIM.  I think.




I’ve discussed this with Sean Burke in the last couple years, and IIRC he said 
he probably should have assumed CP1252 instead of Latin-1 when he wrote it.


True enough!


But not if there are flaws with the plan. Thoughts? Should we make this change? 
Seems like a win overall to me, but I miss details all the time. Let me know 
your thoughts.



As to possible flaws, I see two that are on the very edge of remote 
possibility.

But, for sake of completeness, I'll note:

* I think using characters 0x80-0x9F might just conceivably screw up 
some crazy text editors' what encoding is this? guesswork-- with 
what consequences I don't know.
But, ya know, as Paul F. Tompkins says: We are living in a year with 
a TWO IN FRONT OF IT!, so any editor that silently guesses that way, 
and somehow silently makes bad things happen, should have already been 
pushed out an airlock at least a decade ago.


* And, speaking of heuristics: I think the recognition heuristics in 
Unix's file(1) might... remotely, conceivably... change file(1)'s 
opinion of what a pure-Pod input file is, from yes to no, if it 
construes a file that has 0x80-0x9F but also has =encoding latin-1 
as a paradox that means something not-Pod.  Hypothetically.
But that is far beyond any sense that file(1) can be expected to 
*reliably* have (or maybe can even express in its recognition rules).
Already file(1) is just catastrophically dumb at anything other than 
answering thins like is this extensionless file a GIF?, because 
beyond that, it already guesses wrong more often than right.


I've just now run it on Pod/Simple.pod and it said
C source, ASCII text
Boioiooing.

And I've just now run it on a s2763_sjis.pod I had lying around, which 
has two kanji in the first 64 bytes-- and with a =encoding shiftjis 
being the second line in the file!, and file(1) said:
Perl POD document, Non-ISO extended-ASCII text, with CRLF, NEL line 
terminators


So... Don't overthink why file(1) does what it does--  *it* certainly 
doesn't overthink it.



I hope this message has helped.
REESE'S PIECES OUT.



Re: Assume CP1252

2015-01-10 Thread David E. Wheeler
On Jan 10, 2015, at 5:48 PM, Sean Burke sbu...@cpan.org wrote:

 Helleu, Pod pals!
 Short version about Re: Assume CP1252-- I advise: yes, assume CP1252 where 
 technically you were expecting Latin-1.

Thanks for chiming in, Sean.

 I agree completely, go for it!
 
 Yes:
 * assume that input is CP1252 in the absence of any encoding being declared
 * assume that input is CP1252 if the declared encoding is Latin-1
 
 As far as I know, that amicable bait-and-switch (i.e., construing Latin-1 to 
 actually mean the superset CP1252) means in practice that everybody wins, and 
 nobody loses, and DWIM prevails yet again.

Right, I vaguely remember you telling me this before. I forgot about #2 (and 
the HTML 5 precedent).

 BTW: I think many people would appreciate having =encoding ansi tolerated 
 as a synonym for =encoding win-1252... because some systems simply call it 
 that-- and I can never remember 1252 vs 1250 vs my own zipcode vs last four 
 digits of my Antarctican passport, etc.

ansi == cp1252??

I think Encode determines aliases.

 Incidentally, you presumably might want to expand the %Latin1Code_to_fallback 
 table in Pod::Escapes.

Paging Neil Bowers.

 Now, there's two issues that may or may not be already seen as separate:
 * assuming that input is CP1252 in the absence of any encoding being declared
 * assuming that input is CP1252 if the declared encoding is Latin-1
 I suggest doing both (like HTML5)-- but at least the first definitely!

+1

 If anyone wants extreme SM, maybe a throw a note in WARNINGS about I 
 expected this to be in Latin-1 but it looks like maybe you should probably 
 have a '=encoding win1252' line.
 But that seems a case of pointless and even onerous obtuseness, instead of 
 unproblematic DWIM.  I think.

Meh. I'm thinking, however, of adding a note to the ChangeLog for the next 
release that this change will be in the following release. I’ve already added a 
note that support for Perls  5.5 will be dropped.

 As to possible flaws, I see two that are on the very edge of remote 
 possibility.
 But, for sake of completeness, I'll note:

Pretty obscure!

 I hope this message has helped.
 REESE'S PIECES OUT.

Thanks again!

Best,

David




smime.p7s
Description: S/MIME cryptographic signature


Re: Assume CP1252

2015-01-08 Thread Ricardo Signes
* Grant McLean gr...@mclean.net.nz [2015-01-07T18:47:49]
 I also agree this is a good idea.  None of the Latin-1 control
 characters that CP1252 replaces with printable characters should be
 appearing in POD anyway.

Seems safe, I think.  At first, I thought, They're disjunct!! but then I
realized that this is only true on codepoints that nobody is going to use in
their Latin-1 POD.

-- 
rjbs


signature.asc
Description: Digital signature