Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2013-06-06 Thread Dominic Hargreaves
Tags: -1 - patch

On Mon, May 23, 2011 at 04:05:11PM -0700, Russ Allbery wrote:
 Niko Tyni nt...@debian.org writes:
 
  It's clearly still true, and I can't see any fix for it other than
  adding =encoding utf8 lines in the POD files where necessary.
 
  However, I think all the documents that are rendered incorrectly with
  --utf8 are already rendered incorrectly now, albeit in a different
  way. See below.
 
 Yes, without the --utf8 option, pod2man assumes that it can only use 7-bit
 ASCII, and hence mangles non-ASCII characters pretty badly.  This is
 required for completely portable *roff output, since high-bit characters
 can even cause segfaults on some really old, broken *roff implementations.
 But this is probably now too conservative.
 
 I think the default, if --utf8 is not given, should probably be to just
 encode output in whatever the default local locale is and assume that
 people will do something else if they have to generate *roff that works on
 old, broken systems.  I'm not sure what to do if that locale is C, though.

Niko's patch to use pod2man --utf8 was applied (and then the code was
rewritten...). As we have seen during the perl 5.18 rebuild testing,
missing =encoding is now a fatal error.

I think these points mean that this bug is essentially fixed with Debian
(experimental) and should be closed. I will aim to verify this using the
test case provided by the original submitter before closing this bug
(I don't have access to a suitable test system at the moment, but I
wanted to record this on the bug report whilst at least some of the
details were in my head).

-- 
Dominic Hargreaves | http://www.larted.org.uk/~dom/
PGP key 5178E2A5 from the.earth.li (keyserver,web,email)


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2011-05-27 Thread Dominic Hargreaves
forwarded 492037 http://rt.perl.org/rt3/Public/Bug/Display.html?id=78332
thanks

On Sun, May 22, 2011 at 09:01:23AM +0300, Niko Tyni wrote:
 On Sat, May 21, 2011 at 03:56:16PM +0100, Dominic Hargreaves wrote:
 
  As far as I can see, pod2man --utf8 now exists, but will not render
  all documents correctly - possibly =encoding UTF8 is needed for this
  to work.
  
  Is this statement still true, or has any progress happened since the
  last message on this bug which I've missed?
 
 It's clearly still true, and I can't see any fix for it other than adding
 =encoding utf8 lines in the POD files where necessary.
 
 However, I think all the documents that are rendered incorrectly with
 --utf8 are already rendered incorrectly now, albeit in a different
 way. See below.

[snip]

 A quick check [1] on my system gives 26 files in /usr/share/perl5 that
 use UTF-8 characters in the POD part but don't declare an =encoding
 utf-8. All of them that I checked have broken manpages already (except
 Spiffy.pm which has been fixed with a hack, see #441828.)
 
 The proposed change of using --utf8 by default would just break these
 in a different way AFAICS.

Okay, so the patch is still okay to propose upstream, at least.

 (This looks like something lintian could detect.)
 
 [1]  find . -name '*.pm' -o -name '*.pod' | while read i; do if ! podselect 
 $i | perl -ne '$e++ if /^=encoding/; exit 1 if /[\200-\377]/  !$e'  iconv 
 -f utf8 -t utf8 $i /dev/null 21; then echo $i; fi; done
 
  Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000
  still has the patch from Niko with no further comment, so once we
  understand the current situation it would probably make sense to
  comment on that bug, to avoid anyone taking that and repeating work.
 
 I see Porting/Maintainers.pl says blead is upstream for Pod-Perldoc,
 so I seem to have filed the above ticket in a wrong place.

Okay, moved.
Dominic.

-- 
Dominic Hargreaves | http://www.larted.org.uk/~dom/
PGP key 5178E2A5 from the.earth.li (keyserver,web,email)



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2011-05-23 Thread Russ Allbery
Niko Tyni nt...@debian.org writes:

 It's clearly still true, and I can't see any fix for it other than
 adding =encoding utf8 lines in the POD files where necessary.

 However, I think all the documents that are rendered incorrectly with
 --utf8 are already rendered incorrectly now, albeit in a different
 way. See below.

Yes, without the --utf8 option, pod2man assumes that it can only use 7-bit
ASCII, and hence mangles non-ASCII characters pretty badly.  This is
required for completely portable *roff output, since high-bit characters
can even cause segfaults on some really old, broken *roff implementations.
But this is probably now too conservative.

I think the default, if --utf8 is not given, should probably be to just
encode output in whatever the default local locale is and assume that
people will do something else if they have to generate *roff that works on
old, broken systems.  I'm not sure what to do if that locale is C, though.

-- 
Russ Allbery (r...@stanford.edu) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2011-05-22 Thread Niko Tyni
On Sat, May 21, 2011 at 03:56:16PM +0100, Dominic Hargreaves wrote:

 As far as I can see, pod2man --utf8 now exists, but will not render
 all documents correctly - possibly =encoding UTF8 is needed for this
 to work.
 
 Is this statement still true, or has any progress happened since the
 last message on this bug which I've missed?

It's clearly still true, and I can't see any fix for it other than adding
=encoding utf8 lines in the POD files where necessary.

However, I think all the documents that are rendered incorrectly with
--utf8 are already rendered incorrectly now, albeit in a different
way. See below.

Incorrect (double encoded) output with a missing =encoding utf8:

 perl -CO -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH 
DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH'
 .SH ä

Correct output:

 perl -CO -Mcharnames=:full -E 'say qq(=encoding utf8\n\n=head1 \N{LATIN SMALL 
LETTER A WITH DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH'

 .SH ä

Current behaviour for UTF-8 with a missing =encoding utf8 is just as broken:

 perl -CO -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH 
DIAERESIS}\n)' | pod2man | grep '^\.SH'
 .SH A\*~X

and pure latin1 without an =encoding works with both of course:

 perl -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH 
DIAERESIS}\n)' | pod2man | grep '^\.SH'
 .SH a\*:

 perl -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH 
DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH'
 .SH ä
 
A quick check [1] on my system gives 26 files in /usr/share/perl5 that
use UTF-8 characters in the POD part but don't declare an =encoding
utf-8. All of them that I checked have broken manpages already (except
Spiffy.pm which has been fixed with a hack, see #441828.)

The proposed change of using --utf8 by default would just break these
in a different way AFAICS.

(This looks like something lintian could detect.)

[1]  find . -name '*.pm' -o -name '*.pod' | while read i; do if ! podselect $i 
| perl -ne '$e++ if /^=encoding/; exit 1 if /[\200-\377]/  !$e'  iconv -f 
utf8 -t utf8 $i /dev/null 21; then echo $i; fi; done

 Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000
 still has the patch from Niko with no further comment, so once we
 understand the current situation it would probably make sense to
 comment on that bug, to avoid anyone taking that and repeating work.

I see Porting/Maintainers.pl says blead is upstream for Pod-Perldoc,
so I seem to have filed the above ticket in a wrong place.
-- 
Niko Tyni   nt...@debian.org



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2011-05-21 Thread Dominic Hargreaves
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote:
 Niko Tyni nt...@debian.org writes:
 
  Any estimate on how widespread this POD problem is? Is the hardcoded
  'pod2man --utf8' in the Lenny perldoc going to cause more grief than
  it's worth?
 
  I'm leaning on reverting that and reopening #492037 until the issue is
  sorted out in Pod-Perldoc upstream. Adding a way to enable or disable
  the '--utf8' option on the perldoc command line is one possibility,
  but it might as well cause even further trouble if upstream chooses a
  different implementation.
 
 I looked at this some more, and there's a deeper problem.  If you run the
 current pod2man with --utf8 on an input POD file that doesn't declare an
 =encoding of UTF-8, any use of S in that POD file will result in invalid
 UTF-8, even if there's no use of high-bit characters in the input POD at
 all.
 
 I think the core problem was that Pod::Man is responsible for the output
 through the file handle and was missing an encoding layer.  The problem is
 that we can't just call encode() on the output, since that breaks if
 PERL_UNICODE is set or if an encoding was manually set on the file handle.
 You get double-encoding.  I think the least bad option is for Pod::Man and
 Pod::Text to force the encoding on their output file handles to UTF-8 when
 --utf8 is given.
 
 The problem with this fix is that this now really will break pod2man
 --utf8 if POD documents don't have their encoding declared properly, since
 it will end up double-encoding the UTF-8 given that, without =encoding,
 Pod::Simple is treating the input as ISO 8859-15.  I think it's correct
 according to the specifications, but existing POD text that doesn't
 declare an encoding will get double-encoded output.  I can work around
 this by not setting a UTF-8 output encoding unless the input encoding is
 detected as UTF-8, but that's not really correct.  You *should* be able to
 have an input POD document with =encoding ISO-8859-1 and run it through
 pod2man --utf8 and get UTF-8 output.  But a POD document with no
 =encoding according to perlpodspec has an implicit =encoding ISO-8859-1.
 
 Pod::Text has an additional challenge.  pod2man won't produce any
 non-ASCII characters without --utf8 and has been that way since the
 beginning of the Pod::Simple implementation.  pod2text, on the other hand,
 always passed through whatever it got.  I could just leave it alone, but
 if you feed the current pod2text a document that *does* have =encoding
 UTF-8 in it, you get Perl warnings about wide characters on output.  I
 think the best solution here is to force the output file handle to have an
 encoding matching what Pod::Simple believes the input encoding is.  This
 comes the closest to preserving the traditional pass-through behavior.
 
 I think that for lenny you may want to back out of the --utf8 change and
 give it some time to settle.

[the --utf8 change being the change to have perldoc run pod2html with
the --utf8 option by default].

I've spent a bit of time reading through #492037 (this bug) and
#480997 (which was resolved) trying to figure out how to progress
this issue.

As far as I can see, pod2man --utf8 now exists, but will not render
all documents correctly - possibly =encoding UTF8 is needed for this
to work.

Is this statement still true, or has any progress happened since the
last message on this bug which I've missed?

Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000
still has the patch from Niko with no further comment, so once we
understand the current situation it would probably make sense to
comment on that bug, to avoid anyone taking that and repeating work.

Dominic.

-- 
Dominic Hargreaves | http://www.larted.org.uk/~dom/
PGP key 5178E2A5 from the.earth.li (keyserver,web,email)



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-05 Thread Niko Tyni
On Fri, Oct 03, 2008 at 01:49:16PM +0100, Colin Watson wrote:

 Due to groff's inability to take Unicode input in most cases at the
 moment, man needs to know the language of the manual page in order to
 recode it back to a legacy encoding for formatting by groff. It does
 this either by relying on it being in a directory structure that looks
 like that used for translated manual pages, or else by guessing based on
 the locale.

Ah, thanks and sorry for not doing my homework.

It's all looking good to me now, so I intend to upload the Pod::Man
binmode() fix as 5.10.0-16.
-- 
Niko Tyni   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-05 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 OK, it's your call of course.

 Both patches have the one important property: they can't break anything
 when the utf8 option isn't used. I don't see any point in diverging,
 so I think the attached patch (based on yours but cleansed of the
 unrelated stopword changes) is the best choice for Lenny.

As mentioned in the debian-release response, it looks good to me.

I've been thinking about how best to fix this in the long run, and I think
that there are ways of handling this so that the caller doesn't have to be
aware of the encoding of the output file handle but also without
overwriting global state.  I just haven't had a chance to implement it
yet, but I think I can preserve the API presented by setting the encoding
on the file handle in future versions.

The basic idea will be to use PerlIO to probe the encoding layers on the
output file handle.  If PerlIO isn't available, or if PerlIO reveals no
encoding layers, then Pod::Man and Pod::Text can just call encode before
printing the output.  If PerlIO is available with an encoding layer, but
the encoding layer is UTF-8, we can continue without doing anything
encoding (and this will catch the PERL_UNICODE case).  If the encoding
layer is something incompatible, I'll probably throw an error in the event
that utf8 is set and otherwise trust it.

Not sure when I'll get to implementing this, since I have a vacation
coming up (and it's too late for lenny anyway), but definitely for squeeze
we should be able to clean this up further.

Thank you very much for all of your work on this.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-03 Thread Niko Tyni
On Thu, Oct 02, 2008 at 03:59:45PM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:

  So the output is ISO-8859-1 where possible and UTF-8 elsewhere.

  Russ, I think the binmode($output, :utf8) really belongs in pod2man
  instead of Pod::Man.
 
 It turns out, at least based on the experiments that I did, that you never
 want to use an encoding of :utf8.  What this does is tell Perl to just
 dump its internal encoding to the file handle rather than applying any
 encoding.  The only supported thing you can do with that byte stream is to
 read it back in via another file handle using the :utf8 encoding.  It is
 *not* necessarily valid UTF-8, and in practice I was getting all sorts of
 really strange things from it when looking at it via something other than
 Perl.
 
 You always want to use :encoding(utf-8) instead if the output is for
 anything other than Perl.

I see. The Perl internal encoding is UTF-8, but there are ways to get
invalid UTF-8 in there, for example by using :utf8 on binary input.
This invalid UTF-8 will then be output as-is with if :utf8 is set
on output.

I can't really think of a case where setting :encoding(utf-8) on output
does the right thing but :utf8 doesn't. It does turn the output into valid
UTF-8, but do you have an example where the content is not gibberish?

On the input side, :encoding(utf-8) is indeed probably the better
choice because it will croak when it encounters invalid bytes.

  Users of Pod::Man should do that themselves for their output file handle
  when they use the 'utf8' option. (This needs documentation, of course.)
 
 I'm not sure I like this as an interface since Pod::Man's supported
 interface involves opening the files itself.  This would mean that anyone
 who wants Unicode output can't use the API of Pod::Man and Pod::Text that
 have been supported for years.  I'd really rather try to transparently
 support Unicode using the existing API, even if it means messing with the
 state of provided output file handles.

How about providing your own parse_from_file() wrapper in Pod::Man that
knows about the utf8 option, does the open() and then sets the binmode?
I don't think there's any need to touch the filehandles of people using
parse_file().

  However, pod2man currently uses the parse_from_file() method, which is
  just a compatibility wrapper in Pod::Simple that does the open() and
  output_fh() calls. I suppose this should go in pod2man itself.
  Something like the attached patch might do, although I see there's some
  deeper magic in Pod::Simple.
 
 This patch looks fine to me as a workaround, although I think my previous
 patch is the better long-term fix.

OK. I'll use this (with :encoding(utf-8)) for lenny if no further
showstoppers come up.
 
 Note that Pod::Text has related issues; try running pod2text on your same
 sample POD file and you'll see that it produces warnings about wide
 characters as well.  I'm not sure if that's worth trying to tackle for
 lenny, though (it affects perldoc -t).

Yes, I think we should leave that alone at this point.
-- 
Niko Tyni   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-03 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 I see. The Perl internal encoding is UTF-8, but there are ways to get
 invalid UTF-8 in there, for example by using :utf8 on binary input.
 This invalid UTF-8 will then be output as-is with if :utf8 is set on
 output.

 I can't really think of a case where setting :encoding(utf-8) on output
 does the right thing but :utf8 doesn't. It does turn the output into valid
 UTF-8, but do you have an example where the content is not gibberish?

I can't easily duplicate what I was seeing now, but I was getting output
that was not UTF-8 while using that output encoding in combination with
Pod::Simple.  I'm not quite sure what was going on.  It's possible that I
had made some mistake in the middle of my testing, though.

 How about providing your own parse_from_file() wrapper in Pod::Man that
 knows about the utf8 option, does the open() and then sets the binmode?

I guess I could do that, but I think I disagree with this:

 I don't think there's any need to touch the filehandles of people using
 parse_file().

I would prefer not to touch the filehandle, but I don't think it's
acceptable to say that if you're using the utf8 option, you still have to
set up output encodings yourself.  Maybe I'm overreacting to how difficult
I found this area of Perl to understand, but I'd really rather that
Pod::Man and Pod::Text do the right thing without requiring people
understand Perl's very strange Unicode handling.

Pod::Text also has to do something more complex in order to preserve its
traditional encoding agnosticism if utf8 is not given.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-03 Thread Niko Tyni
On Fri, Oct 03, 2008 at 02:19:52AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:

  I don't think there's any need to touch the filehandles of people using
  parse_file().
 
 I would prefer not to touch the filehandle, but I don't think it's
 acceptable to say that if you're using the utf8 option, you still have to
 set up output encodings yourself.  Maybe I'm overreacting to how difficult
 I found this area of Perl to understand, but I'd really rather that
 Pod::Man and Pod::Text do the right thing without requiring people
 understand Perl's very strange Unicode handling.

OK, it's your call of course.

Both patches have the one important property: they can't break anything
when the utf8 option isn't used. I don't see any point in diverging,
so I think the attached patch (based on yours but cleansed of the
unrelated stopword changes) is the best choice for Lenny.

I have also dropped the documentation change proposed earlier because
it doesn't apply any more. This and the reverting the perldoc --utf8 
change are the only differences from -15.

Russ, please let me know if shipping this in Lenny would be OK by you.
Note that I set $Pod::Man::VERSION to 2.18_01 to emphasize that
this isn't any of the official versions.

Colin, I can't really get man to work with cyrillic documents.  As an
example, the attached ru.pod from #492037 looks fine after 'pod2man
--utf8', but 'man -l ru.man' just drops all the cyrillic characters.
Any ideas? Is this supposed to work at all?

The same applies to debconf.ru.1.pod (which needs an =encoding koi8-r
at the top first.)
-- 
Niko Tyni   [EMAIL PROTECTED]
Make Pod::Man use the PerlIO UTF-8 output layer when used with the --utf8 
option.

Modified from upstream patch in Debian bug #500210.
diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 961171b..d87cb8c 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -36,7 +36,8 @@ use POSIX qw(strftime);
 
 @ISA = qw(Pod::Simple);
 
-$VERSION = '2.18';
+# Custom Debian version, see http://bugs.debian.org/500210
+$VERSION = '2.18_01';
 
 # Set the debugging level.  If someone has inserted a debug function into this
 # class already, use that.  Otherwise, use any Pod::Simple debug function
@@ -731,6 +732,19 @@ sub start_document {
 return;
 }
 
+# If we were given the utf8 option, set an output encoding on our file
+# handle.  Wrap in an eval in case we're using a version of Perl too old
+# to understand this.
+#
+# This is evil because it changes the global state of a file handle that
+# we may not own.  However, we can't just blindly encode all output, since
+# there may be a pre-applied output encoding (such as from PERL_UNICODE)
+# and then we would double-encode.  This seems to be the least bad
+# approach.
+if ($$self{utf8}) {
+eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') };
+}
+
 # Determine information for the preamble and then output it.
 my ($name, $section);
 if (defined $$self{name}) {
@@ -1592,6 +1606,12 @@ be warned that *roff source with literal UTF-8 
characters is not supported
 by many implementations and may even result in segfaults and other bad
 behavior.
 
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C=encoding command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+Lperlpod(1) for more information on the C=encoding command.
+
 =back
 
 The standard Pod::Simple method parse_file() takes one argument naming the
@@ -1627,6 +1647,12 @@ invalid.  A quote specification must be one, two, or 
four characters long.
 
 =head1 BUGS
 
+Encoding handling assumes that PerlIO is available and does not work
+properly if it isn't since encode and decode do not work well in
+combination with PerlIO encoding layers.  It's very unclear how to
+correctly handle this without PerlIO encoding layers.  The Cutf8 option
+is therefore not supported unless Perl is built with PerlIO support.
+
 There is currently no way to turn off the guesswork that tries to format
 unmarked text appropriately, and sometimes it isn't wanted (particularly
 when using POD to document something other than Perl).  Most of the work
@@ -1652,6 +1678,13 @@ Pod::Man is excessively slow.
 
 =head1 CAVEATS
 
+If Pod::Man is given the Cutf8 option, the encoding of its output file
+handle will be forced to UTF-8 if possible, overriding any existing
+encoding.  This will be done even if the file handle is not created by
+Pod::Man and was passed in from outside.  This seems to be the only way to
+consistently enforce UTF-8-encoded output regardless of PERL_UNICODE and
+other settings.
+
 The handling of hyphens and em dashes is somewhat fragile, and one may get
 the wrong one under some circumstances.  This should only matter for
 Btroff output.
diff --git a/pod/pod2man.PL b/pod/pod2man.PL
index 

Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-03 Thread Colin Watson
On Fri, Oct 03, 2008 at 03:28:28PM +0300, Niko Tyni wrote:
 Colin, I can't really get man to work with cyrillic documents.  As an
 example, the attached ru.pod from #492037 looks fine after 'pod2man
 --utf8', but 'man -l ru.man' just drops all the cyrillic characters.
 Any ideas? Is this supposed to work at all?

Due to groff's inability to take Unicode input in most cases at the
moment, man needs to know the language of the manual page in order to
recode it back to a legacy encoding for formatting by groff. It does
this either by relying on it being in a directory structure that looks
like that used for translated manual pages, or else by guessing based on
the locale.

Thus, you can either:

  mkdir -p man/ru/man1
  mv ru.man man/ru/man1/
  man -l man/ru/man1/ru.man

or:

  # make sure the ru_RU.UTF-8 locale is generated
  LC_ALL=ru_RU.UTF-8 man -l ru.man

If groff Unicode support ever gets finished (it's been getting
asymptotically close, with only one major special-purpose piece left in
order to avoid important regressions for Japanese users) then this
should go away.

 The same applies to debconf.ru.1.pod (which needs an =encoding koi8-r
 at the top first.)

Actually I was just going to recode all that to UTF-8, as is done in
debconf's Subversion repository.

Cheers,

-- 
Colin Watson   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-02 Thread Niko Tyni
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:
 
  I think that for lenny you may want to back out of the --utf8 change and
  give it some time to settle.
 
  Are you referring to backing out the whole Pod::Man update (#480997)
  or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ?
 
 Sorry, I meant only the pod2man --utf8 change in perldoc.  I think that
 the behavior of pod2man, while not ideal, is still basically okay for
 lenny, although I'll be releasing a new version of podlators that will
 implement the changes described in my previous mail.

Thanks, we're on the same page then. I'll revert the perldoc change.

Does this documentation patch for lenny look OK to you?
(I suppose it should be duplicated in pod2man.PL too.)


diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 961171b..a3eac5a 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -1592,6 +1592,13 @@ be warned that *roff source with literal UTF-8 
characters is not supported
 by many implementations and may even result in segfaults and other bad
 behavior.
 
+Be aware that using this option currently only works properly on UTF-8
+encoded POD files that use the C=encoding POD command.  If the option
+is enabled on an input POD file that doesn't declare an =encoding of
+UTF-8, any use of S in that POD file will result in invalid UTF-8,
+even if there's no use of high-bit characters in the input POD at all.
+This is a bug that will be fixed in later versions.
+
 =back
 
 The standard Pod::Simple method parse_file() takes one argument naming the
@@ -1627,6 +1634,9 @@ invalid.  A quote specification must be one, two, or four 
characters long.
 
 =head1 BUGS
 
+As mentioned earlier in this document, the Cutf8 option is currently
+broken on non-UTF-8 input.
+
 There is currently no way to turn off the guesswork that tries to format
 unmarked text appropriately, and sometimes it isn't wanted (particularly
 when using POD to document something other than Perl).  Most of the work

-- 
Niko Tyni   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-02 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 Thanks, we're on the same page then. I'll revert the perldoc change.

 Does this documentation patch for lenny look OK to you?

Yup, this looks good to me.

 (I suppose it should be duplicated in pod2man.PL too.)

Yes.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-02 Thread Niko Tyni
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:
 
  I think that for lenny you may want to back out of the --utf8 change and
  give it some time to settle.
 
  Are you referring to backing out the whole Pod::Man update (#480997)
  or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ?
 
 Sorry, I meant only the pod2man --utf8 change in perldoc.  I think that
 the behavior of pod2man, while not ideal, is still basically okay for
 lenny, although I'll be releasing a new version of podlators that will
 implement the changes described in my previous mail.

Hm, this is looking worse the more I stare at it.

I've been testing pod2man with the attached .pod file that does have
'=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man
--utf8' gives these results:

- the Finnish a with two dots, i.e. LATIN SMALL LETTER A WITH DIAERESIS,
  is output as its ISO-8859-1 representation (octal 344)

- the Russian letter n, CYRILLIC SMALL LETTER EN, is output in UTF-8: 
  octal 320+275. However, there's a warning:

Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717.

- Sone two gets the ISO-8859-1 NO-BREAK SPACE in between

So the output is ISO-8859-1 where possible and UTF-8 elsewhere.

I really don't think this is acceptable. The pod2man output will almost
never be valid UTF-8.

Russ, I think the binmode($output, :utf8) really belongs in pod2man
instead of Pod::Man. Users of Pod::Man should do that themselves for
their output file handle when they use the 'utf8' option. (This needs
documentation, of course.)

However, pod2man currently uses the parse_from_file() method, which
is just a compatibility wrapper in Pod::Simple that does the open()
and output_fh() calls. I suppose this should go in pod2man itself.
Something like the attached patch might do, although I see there's some
deeper magic in Pod::Simple.

This still doesn't break anything not explicitly using the '--utf8'
option, so I suppose we could get it in lenny...

Comments welcome.
-- 
Niko Tyni   [EMAIL PROTECTED]
=encoding UTF-8

=head1 a with two dots

ä

=head1 russian letter n

н

=head1 non-breaking spaces

Sone two

diff --git a/pod/pod2man.PL b/pod/pod2man.PL
index 3abb658..a9b5b67 100644
--- a/pod/pod2man.PL
+++ b/pod/pod2man.PL
@@ -89,7 +89,22 @@ my @files;
 do {
 @files = splice (@ARGV, 0, 2);
 print   $files[1]\n if $verbose;
-$parser-parse_from_file (@files);
+if ($options{utf8}) {
+my ($in, $out) = (*STDIN, *STDOUT);
+$in = $files[0] if @files;
+if (@files == 2) {
+open($out, , $files[1])
+or die(open $files[1] for writing: $!);
+} else {
+$out = *STDOUT;
+}
+binmode($out, :utf8);
+$parser-output_fh($out);
+$parser-parse_file($in);
+close $out;
+} else {
+$parser-parse_from_file (@files);
+}
 } while (@ARGV);
 
 __END__


Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-02 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 Hm, this is looking worse the more I stare at it.

I spent four and a half hours on this the other night before producing the
patch that was in my previous message, so I'm sympathetic.  :)  It gets to
be more and more of a headache the more you work through it.

 I've been testing pod2man with the attached .pod file that does have
 '=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man
 --utf8' gives these results:

 - the Finnish a with two dots, i.e. LATIN SMALL LETTER A WITH DIAERESIS,
   is output as its ISO-8859-1 representation (octal 344)

 - the Russian letter n, CYRILLIC SMALL LETTER EN, is output in UTF-8: 
   octal 320+275. However, there's a warning:

 Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717.

 - Sone two gets the ISO-8859-1 NO-BREAK SPACE in between

 So the output is ISO-8859-1 where possible and UTF-8 elsewhere.

I was afraid of this.

The problem is that the version of Pod::Man that you have at the moment
doesn't understand anything about output encoding.  It therefore prints
out whatever Pod::Simple hands it.  This is, in Perl's Unicode world,
basically unsupported behavior.  What you get can be fairly random.  It
works in some cases but doesn't work in others.

This is the reason why I thought I needed to do things like remap the
non-breaking space.  The output is very confused, and I didn't understand
at first what I was seeing.

If one is dealing with Unicode in Perl, one is *required* to decode all
input and encode all output.  Nothing else works.  Pod::Simple does decode
input *if* =encoding is used, but doesn't encode output.  Pod::Man (and
Pod::Text for that matter) therefore have to encode output in order to
work properly.  The patch I sent previously does implement that, with some
other consequences.

One of the problems that makes this unnecessarily hard is that Perl
doesn't keep track of whether it's *already* encoded output, so if you set
an output encoding with binmode and also call encode() directly, you get
double-encoded output.  This basically means that, in practice, encode()
is unusable if you want to support the PERL_UNICODE environment variable,
since setting PERL_UNICODE silently adds output encodings to all your file
handles which will then happily double-encode the results of encode().

 Russ, I think the binmode($output, :utf8) really belongs in pod2man
 instead of Pod::Man.

It turns out, at least based on the experiments that I did, that you never
want to use an encoding of :utf8.  What this does is tell Perl to just
dump its internal encoding to the file handle rather than applying any
encoding.  The only supported thing you can do with that byte stream is to
read it back in via another file handle using the :utf8 encoding.  It is
*not* necessarily valid UTF-8, and in practice I was getting all sorts of
really strange things from it when looking at it via something other than
Perl.

You always want to use :encoding(utf-8) instead if the output is for
anything other than Perl.

 Users of Pod::Man should do that themselves for their output file handle
 when they use the 'utf8' option. (This needs documentation, of course.)

I'm not sure I like this as an interface since Pod::Man's supported
interface involves opening the files itself.  This would mean that anyone
who wants Unicode output can't use the API of Pod::Man and Pod::Text that
have been supported for years.  I'd really rather try to transparently
support Unicode using the existing API, even if it means messing with the
state of provided output file handles.

 However, pod2man currently uses the parse_from_file() method, which is
 just a compatibility wrapper in Pod::Simple that does the open() and
 output_fh() calls. I suppose this should go in pod2man itself.
 Something like the attached patch might do, although I see there's some
 deeper magic in Pod::Simple.

This patch looks fine to me as a workaround, although I think my previous
patch is the better long-term fix.

Note that Pod::Text has related issues; try running pod2text on your same
sample POD file and you'll see that it produces warnings about wide
characters as well.  I'm not sure if that's worth trying to tackle for
lenny, though (it affects perldoc -t).

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Niko Tyni
On Tue, Sep 30, 2008 at 08:22:35PM -0700, Russ Allbery wrote:

 You got it exactly right.  Basically, podlators has been papering over
 this bug incorrectly, but in a way that happens to do the right thing with
 a common POD problem.

 So if you're using UTF-8, starting the POD with:
 
 =encoding UTF-8
 
 is required.  If you add that, the current version of Pod::Man (and
 previous versions, as it turns out, mostly by chance) will do the right
 thing.

Any estimate on how widespread this POD problem is? Is the hardcoded
'pod2man --utf8' in the Lenny perldoc going to cause more grief than
it's worth?

I'm leaning on reverting that and reopening #492037 until the issue is
sorted out in Pod-Perldoc upstream. Adding a way to enable or disable
the '--utf8' option on the perldoc command line is one possibility,
but it might as well cause even further trouble if upstream chooses a
different implementation.
-- 
Niko Tyni   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 Any estimate on how widespread this POD problem is? Is the hardcoded
 'pod2man --utf8' in the Lenny perldoc going to cause more grief than
 it's worth?

 I'm leaning on reverting that and reopening #492037 until the issue is
 sorted out in Pod-Perldoc upstream. Adding a way to enable or disable
 the '--utf8' option on the perldoc command line is one possibility,
 but it might as well cause even further trouble if upstream chooses a
 different implementation.

I looked at this some more, and there's a deeper problem.  If you run the
current pod2man with --utf8 on an input POD file that doesn't declare an
=encoding of UTF-8, any use of S in that POD file will result in invalid
UTF-8, even if there's no use of high-bit characters in the input POD at
all.

I think the core problem was that Pod::Man is responsible for the output
through the file handle and was missing an encoding layer.  The problem is
that we can't just call encode() on the output, since that breaks if
PERL_UNICODE is set or if an encoding was manually set on the file handle.
You get double-encoding.  I think the least bad option is for Pod::Man and
Pod::Text to force the encoding on their output file handles to UTF-8 when
--utf8 is given.

The problem with this fix is that this now really will break pod2man
--utf8 if POD documents don't have their encoding declared properly, since
it will end up double-encoding the UTF-8 given that, without =encoding,
Pod::Simple is treating the input as ISO 8859-15.  I think it's correct
according to the specifications, but existing POD text that doesn't
declare an encoding will get double-encoded output.  I can work around
this by not setting a UTF-8 output encoding unless the input encoding is
detected as UTF-8, but that's not really correct.  You *should* be able to
have an input POD document with =encoding ISO-8859-1 and run it through
pod2man --utf8 and get UTF-8 output.  But a POD document with no
=encoding according to perlpodspec has an implicit =encoding ISO-8859-1.

Pod::Text has an additional challenge.  pod2man won't produce any
non-ASCII characters without --utf8 and has been that way since the
beginning of the Pod::Simple implementation.  pod2text, on the other hand,
always passed through whatever it got.  I could just leave it alone, but
if you feed the current pod2text a document that *does* have =encoding
UTF-8 in it, you get Perl warnings about wide characters on output.  I
think the best solution here is to force the output file handle to have an
encoding matching what Pod::Simple believes the input encoding is.  This
comes the closest to preserving the traditional pass-through behavior.

I think that for lenny you may want to back out of the --utf8 change and
give it some time to settle.

Here's the patch that I'm planning on including in the next podlators
release, for reference.

diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 48fe20e..b5aceef 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -36,7 +36,7 @@ use POSIX qw(strftime);
 
 @ISA = qw(Pod::Simple);
 
-$VERSION = '2.20';
+$VERSION = '2.21';
 
 # Set the debugging level.  If someone has inserted a debug function into this
 # class already, use that.  Otherwise, use any Pod::Simple debug function
@@ -736,6 +736,19 @@ sub start_document {
 return;
 }
 
+# If we were given the utf8 option, set an output encoding on our file
+# handle.  Wrap in an eval in case we're using a version of Perl too old
+# to understand this.
+#
+# This is evil because it changes the global state of a file handle that
+# we may not own.  However, we can't just blindly encode all output, since
+# there may be a pre-applied output encoding (such as from PERL_UNICODE)
+# and then we would double-encode.  This seems to be the least bad
+# approach.
+if ($$self{utf8}) {
+eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') };
+}
+
 # Determine information for the preamble and then output it.
 my ($name, $section);
 if (defined $$self{name}) {
@@ -1450,8 +1463,8 @@ Pod::Man - Convert POD data to formatted *roff input
 
 =for stopwords
 en em ALLCAPS teeny fixedbold fixeditalic fixedbolditalic stderr utf8
-UTF-8 Allbery Sean Burke Ossanna Solaris formatters troff uppercased
-Christiansen
+UTF-8 UTF-8-encoded Allbery Sean Burke Ossanna Solaris formatters troff
+uppercased Christiansen
 
 =head1 SYNOPSIS
 
@@ -1608,6 +1621,12 @@ be warned that *roff source with literal UTF-8 
characters is not supported
 by many implementations and may even result in segfaults and other bad
 behavior.
 
+Be aware that, when using this option, the input encoding of your POD
+source must be properly declared unless it is US-ASCII or Latin-1.  POD
+input without an C=encoding command will be assumed to be in Latin-1,
+and if it's actually in UTF-8, the output will be double-encoded.  See
+Lperlpod(1) for more information on the C=encoding command.
+
 =back
 
 The 

Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Colin Watson
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:
  Any estimate on how widespread this POD problem is? Is the hardcoded
  'pod2man --utf8' in the Lenny perldoc going to cause more grief than
  it's worth?
 
  I'm leaning on reverting that and reopening #492037 until the issue is
  sorted out in Pod-Perldoc upstream. Adding a way to enable or disable
  the '--utf8' option on the perldoc command line is one possibility,
  but it might as well cause even further trouble if upstream chooses a
  different implementation.
 
 I looked at this some more, and there's a deeper problem.  If you run the
 current pod2man with --utf8 on an input POD file that doesn't declare an
 =encoding of UTF-8, any use of S in that POD file will result in invalid
 UTF-8, even if there's no use of high-bit characters in the input POD at
 all.

Thanks for pointing out =encoding to me; I completely missed that in the
documentation.

 I think the core problem was that Pod::Man is responsible for the output
 through the file handle and was missing an encoding layer.  The problem is
 that we can't just call encode() on the output, since that breaks if
 PERL_UNICODE is set or if an encoding was manually set on the file handle.
 You get double-encoding.  I think the least bad option is for Pod::Man and
 Pod::Text to force the encoding on their output file handles to UTF-8 when
 --utf8 is given.
 
 The problem with this fix is that this now really will break pod2man
 --utf8 if POD documents don't have their encoding declared properly, since
 it will end up double-encoding the UTF-8 given that, without =encoding,
 Pod::Simple is treating the input as ISO 8859-15.  I think it's correct
 according to the specifications, but existing POD text that doesn't
 declare an encoding will get double-encoded output.  I can work around
 this by not setting a UTF-8 output encoding unless the input encoding is
 detected as UTF-8, but that's not really correct.  You *should* be able to
 have an input POD document with =encoding ISO-8859-1 and run it through
 pod2man --utf8 and get UTF-8 output.  But a POD document with no
 =encoding according to perlpodspec has an implicit =encoding ISO-8859-1.

While this is certainly something extra that people have to bear in mind
when using pod2man --utf8, it *is* an option people have to enable
manually (well, except for in perldoc; I suppose I'm more worried about
generated manual pages), and it doesn't seem too unreasonable to just
say that you have to specify =encoding when doing so. If that were
mentioned explicitly in the pod2man manual page then I think that would
be good enough.

Assuming that your intent is to run with UTF-8 across the board, then
just sticking =encoding UTF-8 at the top of all POD files before
passing them to pod2man is sufficient, and that's not too hard. The diff
to debconf looks like this:

Index: doc/Makefile
===
--- doc/Makefile(revision 2310)
+++ doc/Makefile(working copy)
@@ -4,6 +4,9 @@
 pod2man=pod2man -c Debconf -r '' --utf8
 manpages:
cd man  po4a po4a/po4a.cfg
+   for pod in man/*.pod; do \
+   perl -pi -e 'if (not $$seen and /^=head1/) { print =encoding 
UTF-8\n\n; $$seen = 1; }' $$pod; \
+   done
install -d man/gen
for num in 1 3 8; do \
find man -maxdepth 1 -type f -name *.$$num.pod -printf '%P\n' 
| \

I'd prefer to do this with a po4a addendum, but it turns out to be an
absolute pain. Also this would break if any of the source documents
contained S. Maybe I should just change all the source documents
instead.

Perhaps it would be helpful if po4a inserted an =encoding paragraph?
After all, it understands POD and it knows the encoding.

 I think that for lenny you may want to back out of the --utf8 change and
 give it some time to settle.

Hmm, this would be a shame. With your most recent patch it's now finally
possible for debconf to generate working manual pages for Russian and
French at the same time. I understand the perldoc problem though ...

-- 
Colin Watson   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Niko Tyni
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:
 
  Any estimate on how widespread this POD problem is? Is the hardcoded
  'pod2man --utf8' in the Lenny perldoc going to cause more grief than
  it's worth?

 I looked at this some more, and there's a deeper problem.  If you run the
 current pod2man with --utf8 on an input POD file that doesn't declare an
 =encoding of UTF-8, any use of S in that POD file will result in invalid
 UTF-8, even if there's no use of high-bit characters in the input POD at
 all.

Thanks for the follow-up. I see the problem.

 sid% pod2man --utf8 /usr/share/perl/5.10/pod/perlrun.pod|iconv --from utf8 
--to latin1 /dev/null
 iconv: illegal input sequence at position 2097

 I think that for lenny you may want to back out of the --utf8 change and
 give it some time to settle.

Are you referring to backing out the whole Pod::Man update (#480997)
or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ?

It looks to me like having the 'pod2man --utf8' option available in
lenny, even if it's broken without '=encoding utf8', is OK, as long
as it's not on by default. The pod2man manual page should probably be
updated to note the issue.

I think reverting just the perldoc change is the least disruptive choice 
for Lenny.
-- 
Niko Tyni   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Russ Allbery
Niko Tyni [EMAIL PROTECTED] writes:

 I think that for lenny you may want to back out of the --utf8 change and
 give it some time to settle.

 Are you referring to backing out the whole Pod::Man update (#480997)
 or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ?

Sorry, I meant only the pod2man --utf8 change in perldoc.  I think that
the behavior of pod2man, while not ideal, is still basically okay for
lenny, although I'll be releasing a new version of podlators that will
implement the changes described in my previous mail.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-10-01 Thread Colin Watson
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote:
 Niko Tyni [EMAIL PROTECTED] writes:
  Are you referring to backing out the whole Pod::Man update (#480997)
  or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ?
 
 Sorry, I meant only the pod2man --utf8 change in perldoc.  I think that
 the behavior of pod2man, while not ideal, is still basically okay for
 lenny, although I'll be releasing a new version of podlators that will
 implement the changes described in my previous mail.

I'd support that - after all, for many purposes people can just use man
instead on Debian, and it's not like it's a regression otherwise.

-- 
Colin Watson   [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-09-30 Thread Colin Watson
found 500210 5.10.0-15
thanks

On Fri, Sep 26, 2008 at 02:04:55PM +0300, Niko Tyni wrote:
 On Thu, Sep 25, 2008 at 11:37:21PM +0200, Gerfried Fuchs wrote:
  Package: perl-doc
  Version: 5.10.0-14
  Severity: normal
 
   When running perldoc perlrun I have strange characters in the output of
  it, and I managed to pin it down to a short POD snippet like this:
 
  =head1 SYNOPSIS
  
  Bperl S[ B-sTtUWX ]
 
 Thanks for noticing this. It's fixed in podlators-2.1.3:
 
 * lib/Pod/Man.pm (format_text): Stop remapping the code point for
 non-breaking space.  This should not be necessary and was wrong
 when the string from Pod::Simple was a character string and not a
 byte string.  It was papering over a bug in setting the encoding
 of an input POD file.
 
 Patch from upstream git attached.  I'd certainly like to fix this for
 Lenny, we'll see what the release team thinks.

 diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
 index 38c4e3d..203ef4a 100644
 --- a/lib/Pod/Man.pm
 +++ b/lib/Pod/Man.pm
 @@ -362,13 +362,6 @@ sub format_text {
  $text =~ s/([^\x00-\x7F])/$ESCAPES{ord ($1)} || X/eg;
  }
  
 -# For Unicode output, unconditionally remap ISO 8859-1 non-breaking 
 spaces
 -# to the correct code point.  This is really a bug in Pod::Simple to be
 -# embedding ISO 8859-1 characters in the output stream that we see.
 -if ($$self{utf8}  ASCII) {
 -$text =~ s/\xA0/\xC2\xA0/g;
 -}
 -
  # Ensure that *roff doesn't convert literal quotes to UTF-8 single 
 quotes,
  # but don't mess up our accept escapes.
  if ($literal) {

For me, this fixed the case where a 0xA0 byte is embedded essentially
accidentally in the middle of a UTF-8 stream (as happened with debconf's
Russian translations), but it broke the case where 0xA0 is actually
being used as a non-breaking space. Note that I'm using the new 'pod2man
--utf8' option, although presumably so is Gerfried since perldoc now
uses that option automatically.

I've attached debconf.fr.1.pod, which reproduces this problem. Run
'pod2man -c Debconf -r '' --utf8 --section=1 debconf.fr.1.pod', and look
carefully at the line matching purge. It looks like this:

  soient bons et pour que les commandes «?purge?» et «?unregister?» soient

The two characters marked as ? here are the byte 0xA0. The characters
around it are encoded in UTF-8. 0xA0 doesn't decode as UTF-8 so man
assumes that this page must be ISO-8859-1, which means the whole page
comes out misencoded.

Is this because Pod::Man hasn't been told about the encoding of the
input data, perhaps? The input files pretty much have to be in UTF-8 if
you're using --utf8, so do we have to tell perl that with binmode?

I think I've got about as far as I can with this, so CCing Russ for
help. :-)

-- 
Colin Watson   [EMAIL PROTECTED]

*
*   GENERATED FILE, DO NOT EDIT * 
* THIS IS NO SOURCE FILE, BUT RESULT OF COMPILATION *
*

This file was generated by po4a(7). Do not store it (in cvs, for example),
but store the po file used as source file by po4a-translate. 

In fact, consider this as a binary, and the po file as a regular .c file:
If the po get lost, keeping this translation up-to-date will be harder.

=head1 NOM

debconf - Exécuter un programme utilisant debconf

=head1 SYNOPSIS

 debconf [options] commande [args]

=head1 DESCRIPTION

Debconf est un système de configuration pour les paquets Debian. Pour faire
un tour d'horizon de debconf et pour obtenir de la documentation pour les
administrateurs système, veuillez consulter Ldebconf(7).

Le programme Bdebconf exécute un programme sous contrôle de debconf, en le
configurant pour communiquer avec debconf sur l'entrée et la sortie
standard. La sortie du programme sera l'une des commandes du protocole
debconf, et les codes résultants seront lus sur l'entrée standard. Pour plus
de détails sur le protocole de debconf, veuillez consulter
Ldebconf-devel(7).

La commande à exécuter depuis debconf doit être spécifiée de manière à ce
qu'elle soit trouvée dans votre PATH.

Cette commande ne reflète pas l'utilisation habituelle de debconf. Il est
courant pour debconf d'être appellé via Ldpkg-preconfigure(8) ou
Ldpkg-reconfigure(8).

=head1 OPTIONS

=over 4

=item B-oIpaquet, B--owner=Ipaquet

Indique à debconf à quel paquet appartient la commande exécutée. C'est
nécessaire pour que les droits de propriété des questions enregistrées
soient bons et pour que les commandes S« purge » et S« unregister » soient
gérées correctement.

=item B-fItype, B--frontend=Itype

Sélectionner l'interface à utiliser.

=item B-pIvaleur, B--priority=Ivaleur

Spécifier la priorité minimale des questions qui vont être posées.

=back

=head1 EXEMPLES

Pour déboguer un script shell qui utilise debconf, vous devriez Sutiliser :

 DEBCONF_DEBUG=developer debconf 

Bug#500210: perldoc perlrun spits out junk in synopsis

2008-09-30 Thread Russ Allbery
Colin Watson [EMAIL PROTECTED] writes:

 For me, this fixed the case where a 0xA0 byte is embedded essentially
 accidentally in the middle of a UTF-8 stream (as happened with debconf's
 Russian translations), but it broke the case where 0xA0 is actually
 being used as a non-breaking space. Note that I'm using the new 'pod2man
 --utf8' option, although presumably so is Gerfried since perldoc now
 uses that option automatically.

 I've attached debconf.fr.1.pod, which reproduces this problem. Run
 'pod2man -c Debconf -r '' --utf8 --section=1 debconf.fr.1.pod', and look
 carefully at the line matching purge. It looks like this:

   soient bons et pour que les commandes «?purge?» et «?unregister?» soient

 The two characters marked as ? here are the byte 0xA0. The characters
 around it are encoded in UTF-8. 0xA0 doesn't decode as UTF-8 so man
 assumes that this page must be ISO-8859-1, which means the whole page
 comes out misencoded.

 Is this because Pod::Man hasn't been told about the encoding of the
 input data, perhaps? The input files pretty much have to be in UTF-8 if
 you're using --utf8, so do we have to tell perl that with binmode?

Hi Colin,

You got it exactly right.  Basically, podlators has been papering over
this bug incorrectly, but in a way that happens to do the right thing with
a common POD problem.

Most POD authors from the pre-Unicode days of Perl don't realize this, but
if you use Unicode characters in POD, you have to declare the input
encoding in the POD in order for the results to be reliable and
consistent.  This is actually mentioned in perlpod, but if you were like
me, you haven't read that recently.  :)  I just discovered this myself.

   =encoding encodingname
   This command is used for declaring the encoding of a document.
   Most users won’t need this; but if your encoding isn’t US-ASCII or
   Latin-1, then put a =encoding encodingname command early in the
   document so that pod formatters will know how to decode the
   document.  For encodingname, use a name recognized by the
   Encode::Supported module.

So if you're using UTF-8, starting the POD with:

=encoding UTF-8

is required.  If you add that, the current version of Pod::Man (and
previous versions, as it turns out, mostly by chance) will do the right
thing.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-09-26 Thread Gerfried Fuchs
Package: perl-doc
Version: 5.10.0-14
Severity: normal

Hi!

 When running perldoc perlrun I have strange characters in the output of
it, and I managed to pin it down to a short POD snippet like this:

#v+
=head1 SYNOPSIS

Bperl S[ B-sTtUWX ]
#v-

 The S[ ] does strange stuff with the spaces it has in there. For
LC_ALL=C it puts a [C2] infront of the space, which turns into an LATIN
CAPITAL LETTER A WITH CIRCUMFLEX in my usual utf8 locale.

 Hope this can get easy fixed, it really looks ugly.
Rhonda

-- System Information:
Debian Release: lenny/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: powerpc (ppc)

Kernel: Linux 2.6.26-1-powerpc
Locale: LANG=de_AT.UTF-8, LC_CTYPE=de_AT.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages perl-doc depends on:
ii  perl  5.10.0-14  Larry Wall's Practical Extraction 

perl-doc recommends no packages.

Versions of packages perl-doc suggests:
pn  groff   none   (no description available)
ii  konqueror [man-browser] 4:3.5.9.dfsg.1-5 KDE's advanced file manager, web b
ii  man-db [man-browser]2.5.2-3  on-line manual pager

-- no debconf information



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#500210: perldoc perlrun spits out junk in synopsis

2008-09-26 Thread Niko Tyni
tag 500210 patch fixed-upstream
thanks

On Thu, Sep 25, 2008 at 11:37:21PM +0200, Gerfried Fuchs wrote:
 Package: perl-doc
 Version: 5.10.0-14
 Severity: normal

  When running perldoc perlrun I have strange characters in the output of
 it, and I managed to pin it down to a short POD snippet like this:

 =head1 SYNOPSIS
 
 Bperl S[ B-sTtUWX ]

Thanks for noticing this. It's fixed in podlators-2.1.3:

* lib/Pod/Man.pm (format_text): Stop remapping the code point for
non-breaking space.  This should not be necessary and was wrong
when the string from Pod::Simple was a character string and not a
byte string.  It was papering over a bug in setting the encoding
of an input POD file.

Patch from upstream git attached.  I'd certainly like to fix this for
Lenny, we'll see what the release team thinks.
-- 
Niko Tyni   [EMAIL PROTECTED]
diff --git a/ChangeLog b/ChangeLog
index f0c727e..25850bc 100644
diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 38c4e3d..203ef4a 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -362,13 +362,6 @@ sub format_text {
 $text =~ s/([^\x00-\x7F])/$ESCAPES{ord ($1)} || X/eg;
 }
 
-# For Unicode output, unconditionally remap ISO 8859-1 non-breaking spaces
-# to the correct code point.  This is really a bug in Pod::Simple to be
-# embedding ISO 8859-1 characters in the output stream that we see.
-if ($$self{utf8}  ASCII) {
-$text =~ s/\xA0/\xC2\xA0/g;
-}
-
 # Ensure that *roff doesn't convert literal quotes to UTF-8 single quotes,
 # but don't mess up our accept escapes.
 if ($literal) {