Re: The Encoding Warning (Again) - some data

2012-08-29 Thread Will Coleda
On Wed, Aug 29, 2012 at 8:17 AM, Shawn H Corey  wrote:
> On Tue, 28 Aug 2012 22:16:20 -0600
> "Sean M. Burke"  wrote:
>
>> But I think we should also hold pod parsers to a high standard of
>> keeping quiet when simple heuristics are unproblematically applied.
>
> I'm not sure about that. One of the reason why there's so much trashy
> HTML out there is because browsers accept it without complaint. I mean,
> look at the problems of parsing this:
>
> =item 1. An Example
>
> When they really meant:
>
> =item 1.
>
> B
>
> As the old saying goes: it's the squeaky wheel that gets fixed.

This was a concern before HTML5, but now the trend is to work with the
HTML you have and handle errors, not to force the HTML to be more
strict.

http://en.wikipedia.org/wiki/HTML5#Error_handling

So, even we don't call those documents valid POD, it's OK if we handle
the way they deviate from the spec in a helpful way.

Regards.

>
> --
> Just my 0.0002 million dollars worth,
>   Shawn
>
> Programming is as much about organization and communication
> as it is about coding.
>
> _Perl links_
> official site   : http://www.perl.org/
> beginners' help : http://learn.perl.org/faq/beginners.html
> advance help: http://perlmonks.org/
> documentation   : http://perldoc.perl.org/
> news: http://perlsphere.net/
> repository  : http://www.cpan.org/
> blog: http://blogs.perl.org/
> regional groups : http://www.pm.org/



-- 
Will "Coke" Coleda


Re: The Encoding Warning (Again) - some data

2012-08-29 Thread Shawn H Corey
On Tue, 28 Aug 2012 22:16:20 -0600
"Sean M. Burke"  wrote:

> But I think we should also hold pod parsers to a high standard of 
> keeping quiet when simple heuristics are unproblematically applied.

I'm not sure about that. One of the reason why there's so much trashy
HTML out there is because browsers accept it without complaint. I mean,
look at the problems of parsing this:

=item 1. An Example

When they really meant:

=item 1.

B

As the old saying goes: it's the squeaky wheel that gets fixed.


-- 
Just my 0.0002 million dollars worth,
  Shawn

Programming is as much about organization and communication
as it is about coding.

_Perl links_
official site   : http://www.perl.org/
beginners' help : http://learn.perl.org/faq/beginners.html
advance help: http://perlmonks.org/
documentation   : http://perldoc.perl.org/
news: http://perlsphere.net/
repository  : http://www.cpan.org/
blog: http://blogs.perl.org/
regional groups : http://www.pm.org/


Re: The Encoding Warning (Again) - some data

2012-08-28 Thread Sean M. Burke

Hi all!
I'm still on this pod-people list, but I very rarely read it-- and 
usually just a glance at the subject lines.
But today I looked, saw i18n stuff going on, and I gave things a quick 
skim.  Well,...



First-and-foremost: perlpodspec is no longer mine.

But I can weigh in on my original intent in bits and pieces of it, for 
whatever that is worth.  My intent was:
  For files with no encoding declarations, when the parser hits 8-bit 
stuff, it should politely DWIM when it's sensible and possible.


And if I were writing the spec over again, I would spell that out: if 
that heuristic is applied and it is unproblematic (i.e., not 
contradicted later on, nor contradicting a BOM), then keep quiet.



I know we should hold CPAN module authors to a high standard of 
dotting their i's and crossing their t's ...and declaring their encodings.


But I think we should also hold pod parsers to a high standard of 
keeping quiet when simple heuristics are unproblematically applied.


(Pod parsers already do some silent heuristics,... well, I put the 
"heuristics" into the spec itself, just to make sure that they would 
get universally and silently implemented.  Example: numeric item 
points, "=item 3", are actually anything that matches

  m/\A=item\s+\d+\.?\s*\z/
...That's a moderately forgiving regexp.)

And now I consider that perl itself is *thick* with silent heuristics 
(such as: the pattern /goo$/ has nothing to do with the variable $/ 
...to say nothing of what z/8/g can do, depending on prototypes).
And considering those things, I would change a "should" to a "must", 
and I would say:


A pod parser must obey that encoding-guessing heuristic, and it must 
do so silently--


* Unless the result of the heuristic is contradicted by a different 
kind of byte sequence later, or by a contradictory "=encoding" line. 
(In that case, yes, complain!)


* Or unless the parser is asked to emit a warning (otherwise: stay quiet).
And the warning should be asked for probably just in the situation of 
the code being parsed as part of some "critique my code in EVERY way 
possible!" program-- not in the context of a user simply trying to 
install a module.
(Did I say *a* module?  Worse situation: when the user is simply 
trying to install one module, then the installer reports that that 
module has 5 dependencies, hey let's install them all, but dependency 
#3 is the one causing the discomfortingly baffling "Highbit but no 
=encoding!?" warning to fly by.)


Users' and authors' attention is better spent elsewhere than chasing 
why a previously all-OK module now throws warnings because Garcia 
changed into García. (And that difference might be invisible in our 
fun scrunched up programmerese fonts!)



But the above was just my intent.  Since then, there's been a decade 
for people to find trouble with possible "utf8 or latin1?" heuristics, 
or to see complications that need dodging,...

So it's up to you all, and I won't debate anyone's conclusions.


Oh, and I hope I can help here:

On 08/27/2012 10:03 PM, Grant McLean wrote:

On a tangentially related note, I was pondering whether the heuristic
should actually fall back to CP1252 rather than ISO8859-1 - after all
that's what the W3C recommend:

   http://www.w3.org/TR/html5/parsing.html#character-encodings-0


I say go for it.  Because: CP1252 simply hadn't occurred to me.
If it had, I would have declared CP1252 the fallback, instead of Latin-1.
(I would have done this especially if I had seen that W3C 
recommendation!  They do not do such things lightly.)


So wherever you see Latin-1, in the spec or in this message, I 
retcon-wise intended CP1252!


Ya know, also, maybe if we're in Latin-1 mode even from an explicit 
"=encoding latin-1", maybe 0x80-0x9F sequences should be forgiven as 
being in CP1252.  It would be something that I, just as a *user*, 
would appreciate.  I would never turn on true Latin-1 for the purposes 
of constraining me from using characters that CP1252 has in 0x80-0x9F. 
 But you guys figure it out.



And incidentally, it's up to you all to figure out whether \d in the 
=item thing should be changed to [0-9], now that \d can be all 
Unicodey and can be digits like ३٢๔༩၇ etc.,... Or I dunno, maybe 
someone writing pod Thai documentation does want "=item ๖", in which 
case the current regexp is... correct? I guess?



BTW: "pod Thai" GET IT?  PAD THAI.. Like the NOODLES! HAAAH!


Ten years I have waited for that.  Ten.  Ten.



Re: The Encoding Warning (Again) - some data

2012-08-28 Thread David E. Wheeler
On Aug 27, 2012, at 9:03 PM, Grant McLean wrote:

> My opinion is that the warning is useful.  However to be pragmatic, now
> that the encoding detection heuristic has been implemented, adding
> =encoding declarations to any of those 1215 distributions will have no
> practical effect other than silencing the warning.  That's the best
> argument I can come up with in favour of changing the status quo.

Well, that and you ensure that the proper encoding is selected (assuming the 
developer knows how to encode things properly).

> On a tangentially related note, I was pondering whether the heuristic
> should actually fall back to CP1252 rather than ISO8859-1 - after all
> that's what the W3C recommend:
> 
>  http://www.w3.org/TR/html5/parsing.html#character-encodings-0
> 
> However my statistics show that only 44 files in current releases were
> detected as Latin-1 but actually contained CP1252 (typically "smart
> quote" symbols in the \x80-\x9F range).  So it doesn't seem worth
> pursuing that change.

You could assume CP1252 if characters are found in that range and Latin-1 
otherwise.

> Finally I searched for files which were detected as UTF-8 but actually
> contained characters from the CP1252 range.  There was only one and it
> wasn't an error in the detection, the source file contains a
> double-encoded character. It was a mangled attempt to name a contributor
> - Slaven Resić   :-)

Coincidence I’m sure!

Best,

David



Re: The Encoding Warning (Again) - some data

2012-08-27 Thread Grant McLean
On Mon, 2012-08-27 at 10:17 -0700, David E. Wheeler wrote:
> Pod People,
> 
> In https://rt.cpan.org/Ticket/Display.html?id=79232, Saven Rezic
writes:
> 
> > Pod::Simple currently (e.g. with version 3.23) complains if a Pod
> > document has latin-1 characters in it but no =encoding command
> > specified. I think this is incorrect, both perlpod.pod and
> > perlpodspec.pod specify that a document without =encoding command is
in
> > latin-1:

When I kicked this process off in April the issue I was trying to fix
was that UTF-8 documents did not render correctly on metacpan.org.  I
proposed two changes: implementing the encoding heuristic and adding the
warning.  There was a small amount of discussion and both proposals were
considered sane.

  http://www.nntp.perl.org/group/perl.pod-people/2012/04/msg1789.html

At the time I had no data on how many distributions were affected (I
only knew I saw mangled characters quite frequently).  Now that the
patch is in and generating the warning, I am able to get that data.  So
today I rendered all the POD from all current distributions in my
minicpan and collected stats on how often the warning was generated.

>From a total 5157 distributions, files in 1215 distributions generated
the warning (i.e.: contained non-ASCII characters in POD with no
=encoding declaration).

The split was roughly 50-50 with 1187 files being detected as Latin-1
and 1131 as UTF-8.

So there are current 1131 files which are now able to render correctly
on metacpan.org which was my goal.

There are also 1187 files (from 570 distributions) which rendered
perfectly fine before but will now include the new warning in places
where rendering of parser errors is enabled.  This is approximately 11%
of current releases on CPAN - probably a higher number than I would have
anticipated.

Some portion of that 11% will be using Test::Pod and will now have a
test failure where none existed before.  (Sorry I don't have the
statistics on what proportion use Test::Pod and don't limit it to
'author' tests).

So if anyone's opinion is likely to be swayed by data - there's some
data.


My opinion is that the warning is useful.  However to be pragmatic, now
that the encoding detection heuristic has been implemented, adding
=encoding declarations to any of those 1215 distributions will have no
practical effect other than silencing the warning.  That's the best
argument I can come up with in favour of changing the status quo.

If we decided to turn off that warning then I would like to see an
option to allow people to turn it back on if they want.


On a tangentially related note, I was pondering whether the heuristic
should actually fall back to CP1252 rather than ISO8859-1 - after all
that's what the W3C recommend:

  http://www.w3.org/TR/html5/parsing.html#character-encodings-0

However my statistics show that only 44 files in current releases were
detected as Latin-1 but actually contained CP1252 (typically "smart
quote" symbols in the \x80-\x9F range).  So it doesn't seem worth
pursuing that change.

Finally I searched for files which were detected as UTF-8 but actually
contained characters from the CP1252 range.  There was only one and it
wasn't an error in the detection, the source file contains a
double-encoded character. It was a mangled attempt to name a contributor
- Slaven Resić   :-)

Regards
Grant





The Encoding Warning (Again)

2012-08-27 Thread David E. Wheeler
Pod People,

In https://rt.cpan.org/Ticket/Display.html?id=79232, Saven Rezic writes:

> Pod::Simple currently (e.g. with version 3.23) complains if a Pod
> document has latin-1 characters in it but no =encoding command
> specified. I think this is incorrect, both perlpod.pod and
> perlpodspec.pod specify that a document without =encoding command is in
> latin-1:
> 
> perlpod.pod (as of commit 684c7e375f581ccd114b4c6b4e8ea730402b50f3 in
> perl):
> 
> =encoding encoding
> 
> ... Most users won't need this; but if your encoding isn't US-ASCII
> or Latin-1 ...
> 
> perlpodspec.pod (as of commit c85e9b4c9684bc896847f5a80e9e91b478c2fc59
> in perl):
> 
> ... Otherwise, the character encoding should be understood as being
> UTF-8 if the first highbit byte sequence in the file seems valid as
> a UTF-8 sequence, or otherwise as Latin-1.
> 
> (Unfortunately perlpodspec.pod isn't quite clear about this when
> explaining encoding; the quoted paragraph is from the "Notes on
> Implementing Pod Processors" section.

None of this says anything about warnings, and it does seem that, as of 3.23, 
this is exactly how Pod::Simple behaves (or will once 
https://github.com/theory/pod-simple/pull/40 is merged).

Thoughts?

David