Re: Non-ASCII data in POD

2012-05-03 Thread Johan Vromans
Grant McLean gr...@mclean.net.nz writes:

 OK, so I went ahead and implemented both the warning and the heuristic
 to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
 resulting patch is here:

   https://github.com/theory/pod-simple/pull/26

This patch enforces authors to add an =encoding UTF-8 line to
specify that the doc is, indeed, UTF-8 encoded.

Wouldn't it be far better to consider all POD documents to be Utf-8
encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are
detected? In other words, do not enforce the author to add =encoding
UTF-8 since that's the default? And only add =encoding ISO8859-1 for
Latin1 encoded documents?

Since most POD documents currently are ASCII, they won't be affected.

POD docs that are Latin1 or something similar must get an explicit
encoding line added. These are precisely the documents affected by your
patch.

-- Johan


Re: Non-ASCII data in POD

2012-05-03 Thread Grant McLean
On Mon, 2012-04-30 at 14:24 +0200, Johan Vromans wrote:
 Grant McLean gr...@mclean.net.nz writes:
 
  OK, so I went ahead and implemented both the warning and the heuristic
  to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
  resulting patch is here:
 
https://github.com/theory/pod-simple/pull/26
 
 This patch enforces authors to add an =encoding UTF-8 line to
 specify that the doc is, indeed, UTF-8 encoded.

Not exactly.  It generates a warning during the parsing process which
will be visible in the output of any formatter that has error output
enabled.  It's not a fatal error so it doesn't exactly enforce
anything.

The aim is to help people comply with the spec for POD as it is
currently written.  And that spec says that if there are non-ASCII
characters there must be an =encoding declaration.

 Wouldn't it be far better to consider all POD documents to be Utf-8
 encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are
 detected?

You won't get any argument from me that UTF-8 would be a better default,
but that's not how the spec is currently written.

If your Perl source code includes UTF-8 characters, you must say:

  use utf8;

If your POD includes UTF-8 characters, you must say:

  =encoding utf8

 In other words, do not enforce the author to add =encoding
 UTF-8 since that's the default? And only add =encoding ISO8859-1 for
 Latin1 encoded documents?

The patch does also implement the heuristic recommended in the
perlpodspec which has the effect of allowing either Latin-1 or UTF-8 to
work (the default is ASCII) in spite of the missing declaration.  This
will be a win for sites like metacpan.org which currently don't display
UTF-8 correctly from POD that lacks an =encoding declaration.

Any formatter that has error display disabled will see better rendering
of UTF-8 with this patch.

Additionally, if errors are displayed, the non-compliance with
perlpodspec will be reported.

Regards
Grant