Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
Tags: -1 - patch On Mon, May 23, 2011 at 04:05:11PM -0700, Russ Allbery wrote: Niko Tyni nt...@debian.org writes: It's clearly still true, and I can't see any fix for it other than adding =encoding utf8 lines in the POD files where necessary. However, I think all the documents that are rendered incorrectly with --utf8 are already rendered incorrectly now, albeit in a different way. See below. Yes, without the --utf8 option, pod2man assumes that it can only use 7-bit ASCII, and hence mangles non-ASCII characters pretty badly. This is required for completely portable *roff output, since high-bit characters can even cause segfaults on some really old, broken *roff implementations. But this is probably now too conservative. I think the default, if --utf8 is not given, should probably be to just encode output in whatever the default local locale is and assume that people will do something else if they have to generate *roff that works on old, broken systems. I'm not sure what to do if that locale is C, though. Niko's patch to use pod2man --utf8 was applied (and then the code was rewritten...). As we have seen during the perl 5.18 rebuild testing, missing =encoding is now a fatal error. I think these points mean that this bug is essentially fixed with Debian (experimental) and should be closed. I will aim to verify this using the test case provided by the original submitter before closing this bug (I don't have access to a suitable test system at the moment, but I wanted to record this on the bug report whilst at least some of the details were in my head). -- Dominic Hargreaves | http://www.larted.org.uk/~dom/ PGP key 5178E2A5 from the.earth.li (keyserver,web,email) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
forwarded 492037 http://rt.perl.org/rt3/Public/Bug/Display.html?id=78332 thanks On Sun, May 22, 2011 at 09:01:23AM +0300, Niko Tyni wrote: On Sat, May 21, 2011 at 03:56:16PM +0100, Dominic Hargreaves wrote: As far as I can see, pod2man --utf8 now exists, but will not render all documents correctly - possibly =encoding UTF8 is needed for this to work. Is this statement still true, or has any progress happened since the last message on this bug which I've missed? It's clearly still true, and I can't see any fix for it other than adding =encoding utf8 lines in the POD files where necessary. However, I think all the documents that are rendered incorrectly with --utf8 are already rendered incorrectly now, albeit in a different way. See below. [snip] A quick check [1] on my system gives 26 files in /usr/share/perl5 that use UTF-8 characters in the POD part but don't declare an =encoding utf-8. All of them that I checked have broken manpages already (except Spiffy.pm which has been fixed with a hack, see #441828.) The proposed change of using --utf8 by default would just break these in a different way AFAICS. Okay, so the patch is still okay to propose upstream, at least. (This looks like something lintian could detect.) [1] find . -name '*.pm' -o -name '*.pod' | while read i; do if ! podselect $i | perl -ne '$e++ if /^=encoding/; exit 1 if /[\200-\377]/ !$e' iconv -f utf8 -t utf8 $i /dev/null 21; then echo $i; fi; done Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000 still has the patch from Niko with no further comment, so once we understand the current situation it would probably make sense to comment on that bug, to avoid anyone taking that and repeating work. I see Porting/Maintainers.pl says blead is upstream for Pod-Perldoc, so I seem to have filed the above ticket in a wrong place. Okay, moved. Dominic. -- Dominic Hargreaves | http://www.larted.org.uk/~dom/ PGP key 5178E2A5 from the.earth.li (keyserver,web,email) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni nt...@debian.org writes: It's clearly still true, and I can't see any fix for it other than adding =encoding utf8 lines in the POD files where necessary. However, I think all the documents that are rendered incorrectly with --utf8 are already rendered incorrectly now, albeit in a different way. See below. Yes, without the --utf8 option, pod2man assumes that it can only use 7-bit ASCII, and hence mangles non-ASCII characters pretty badly. This is required for completely portable *roff output, since high-bit characters can even cause segfaults on some really old, broken *roff implementations. But this is probably now too conservative. I think the default, if --utf8 is not given, should probably be to just encode output in whatever the default local locale is and assume that people will do something else if they have to generate *roff that works on old, broken systems. I'm not sure what to do if that locale is C, though. -- Russ Allbery (r...@stanford.edu) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
On Sat, May 21, 2011 at 03:56:16PM +0100, Dominic Hargreaves wrote: As far as I can see, pod2man --utf8 now exists, but will not render all documents correctly - possibly =encoding UTF8 is needed for this to work. Is this statement still true, or has any progress happened since the last message on this bug which I've missed? It's clearly still true, and I can't see any fix for it other than adding =encoding utf8 lines in the POD files where necessary. However, I think all the documents that are rendered incorrectly with --utf8 are already rendered incorrectly now, albeit in a different way. See below. Incorrect (double encoded) output with a missing =encoding utf8: perl -CO -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH' .SH ä Correct output: perl -CO -Mcharnames=:full -E 'say qq(=encoding utf8\n\n=head1 \N{LATIN SMALL LETTER A WITH DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH' .SH ä Current behaviour for UTF-8 with a missing =encoding utf8 is just as broken: perl -CO -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH DIAERESIS}\n)' | pod2man | grep '^\.SH' .SH A\*~X and pure latin1 without an =encoding works with both of course: perl -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH DIAERESIS}\n)' | pod2man | grep '^\.SH' .SH a\*: perl -Mcharnames=:full -E 'say qq(=head1 \N{LATIN SMALL LETTER A WITH DIAERESIS}\n)' | pod2man --utf8 | grep '^\.SH' .SH ä A quick check [1] on my system gives 26 files in /usr/share/perl5 that use UTF-8 characters in the POD part but don't declare an =encoding utf-8. All of them that I checked have broken manpages already (except Spiffy.pm which has been fixed with a hack, see #441828.) The proposed change of using --utf8 by default would just break these in a different way AFAICS. (This looks like something lintian could detect.) [1] find . -name '*.pm' -o -name '*.pod' | while read i; do if ! podselect $i | perl -ne '$e++ if /^=encoding/; exit 1 if /[\200-\377]/ !$e' iconv -f utf8 -t utf8 $i /dev/null 21; then echo $i; fi; done Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000 still has the patch from Niko with no further comment, so once we understand the current situation it would probably make sense to comment on that bug, to avoid anyone taking that and repeating work. I see Porting/Maintainers.pl says blead is upstream for Pod-Perldoc, so I seem to have filed the above ticket in a wrong place. -- Niko Tyni nt...@debian.org -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote: Niko Tyni nt...@debian.org writes: Any estimate on how widespread this POD problem is? Is the hardcoded 'pod2man --utf8' in the Lenny perldoc going to cause more grief than it's worth? I'm leaning on reverting that and reopening #492037 until the issue is sorted out in Pod-Perldoc upstream. Adding a way to enable or disable the '--utf8' option on the perldoc command line is one possibility, but it might as well cause even further trouble if upstream chooses a different implementation. I looked at this some more, and there's a deeper problem. If you run the current pod2man with --utf8 on an input POD file that doesn't declare an =encoding of UTF-8, any use of S in that POD file will result in invalid UTF-8, even if there's no use of high-bit characters in the input POD at all. I think the core problem was that Pod::Man is responsible for the output through the file handle and was missing an encoding layer. The problem is that we can't just call encode() on the output, since that breaks if PERL_UNICODE is set or if an encoding was manually set on the file handle. You get double-encoding. I think the least bad option is for Pod::Man and Pod::Text to force the encoding on their output file handles to UTF-8 when --utf8 is given. The problem with this fix is that this now really will break pod2man --utf8 if POD documents don't have their encoding declared properly, since it will end up double-encoding the UTF-8 given that, without =encoding, Pod::Simple is treating the input as ISO 8859-15. I think it's correct according to the specifications, but existing POD text that doesn't declare an encoding will get double-encoded output. I can work around this by not setting a UTF-8 output encoding unless the input encoding is detected as UTF-8, but that's not really correct. You *should* be able to have an input POD document with =encoding ISO-8859-1 and run it through pod2man --utf8 and get UTF-8 output. But a POD document with no =encoding according to perlpodspec has an implicit =encoding ISO-8859-1. Pod::Text has an additional challenge. pod2man won't produce any non-ASCII characters without --utf8 and has been that way since the beginning of the Pod::Simple implementation. pod2text, on the other hand, always passed through whatever it got. I could just leave it alone, but if you feed the current pod2text a document that *does* have =encoding UTF-8 in it, you get Perl warnings about wide characters on output. I think the best solution here is to force the output file handle to have an encoding matching what Pod::Simple believes the input encoding is. This comes the closest to preserving the traditional pass-through behavior. I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. [the --utf8 change being the change to have perldoc run pod2html with the --utf8 option by default]. I've spent a bit of time reading through #492037 (this bug) and #480997 (which was resolved) trying to figure out how to progress this issue. As far as I can see, pod2man --utf8 now exists, but will not render all documents correctly - possibly =encoding UTF8 is needed for this to work. Is this statement still true, or has any progress happened since the last message on this bug which I've missed? Note that http://rt.cpan.org/Public/Bug/Display.html?id=39000 still has the patch from Niko with no further comment, so once we understand the current situation it would probably make sense to comment on that bug, to avoid anyone taking that and repeating work. Dominic. -- Dominic Hargreaves | http://www.larted.org.uk/~dom/ PGP key 5178E2A5 from the.earth.li (keyserver,web,email) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#500210: perldoc perlrun spits out junk in synopsis
On Fri, Oct 03, 2008 at 01:49:16PM +0100, Colin Watson wrote: Due to groff's inability to take Unicode input in most cases at the moment, man needs to know the language of the manual page in order to recode it back to a legacy encoding for formatting by groff. It does this either by relying on it being in a directory structure that looks like that used for translated manual pages, or else by guessing based on the locale. Ah, thanks and sorry for not doing my homework. It's all looking good to me now, so I intend to upload the Pod::Man binmode() fix as 5.10.0-16. -- Niko Tyni [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: OK, it's your call of course. Both patches have the one important property: they can't break anything when the utf8 option isn't used. I don't see any point in diverging, so I think the attached patch (based on yours but cleansed of the unrelated stopword changes) is the best choice for Lenny. As mentioned in the debian-release response, it looks good to me. I've been thinking about how best to fix this in the long run, and I think that there are ways of handling this so that the caller doesn't have to be aware of the encoding of the output file handle but also without overwriting global state. I just haven't had a chance to implement it yet, but I think I can preserve the API presented by setting the encoding on the file handle in future versions. The basic idea will be to use PerlIO to probe the encoding layers on the output file handle. If PerlIO isn't available, or if PerlIO reveals no encoding layers, then Pod::Man and Pod::Text can just call encode before printing the output. If PerlIO is available with an encoding layer, but the encoding layer is UTF-8, we can continue without doing anything encoding (and this will catch the PERL_UNICODE case). If the encoding layer is something incompatible, I'll probably throw an error in the event that utf8 is set and otherwise trust it. Not sure when I'll get to implementing this, since I have a vacation coming up (and it's too late for lenny anyway), but definitely for squeeze we should be able to clean this up further. Thank you very much for all of your work on this. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
On Thu, Oct 02, 2008 at 03:59:45PM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: So the output is ISO-8859-1 where possible and UTF-8 elsewhere. Russ, I think the binmode($output, :utf8) really belongs in pod2man instead of Pod::Man. It turns out, at least based on the experiments that I did, that you never want to use an encoding of :utf8. What this does is tell Perl to just dump its internal encoding to the file handle rather than applying any encoding. The only supported thing you can do with that byte stream is to read it back in via another file handle using the :utf8 encoding. It is *not* necessarily valid UTF-8, and in practice I was getting all sorts of really strange things from it when looking at it via something other than Perl. You always want to use :encoding(utf-8) instead if the output is for anything other than Perl. I see. The Perl internal encoding is UTF-8, but there are ways to get invalid UTF-8 in there, for example by using :utf8 on binary input. This invalid UTF-8 will then be output as-is with if :utf8 is set on output. I can't really think of a case where setting :encoding(utf-8) on output does the right thing but :utf8 doesn't. It does turn the output into valid UTF-8, but do you have an example where the content is not gibberish? On the input side, :encoding(utf-8) is indeed probably the better choice because it will croak when it encounters invalid bytes. Users of Pod::Man should do that themselves for their output file handle when they use the 'utf8' option. (This needs documentation, of course.) I'm not sure I like this as an interface since Pod::Man's supported interface involves opening the files itself. This would mean that anyone who wants Unicode output can't use the API of Pod::Man and Pod::Text that have been supported for years. I'd really rather try to transparently support Unicode using the existing API, even if it means messing with the state of provided output file handles. How about providing your own parse_from_file() wrapper in Pod::Man that knows about the utf8 option, does the open() and then sets the binmode? I don't think there's any need to touch the filehandles of people using parse_file(). However, pod2man currently uses the parse_from_file() method, which is just a compatibility wrapper in Pod::Simple that does the open() and output_fh() calls. I suppose this should go in pod2man itself. Something like the attached patch might do, although I see there's some deeper magic in Pod::Simple. This patch looks fine to me as a workaround, although I think my previous patch is the better long-term fix. OK. I'll use this (with :encoding(utf-8)) for lenny if no further showstoppers come up. Note that Pod::Text has related issues; try running pod2text on your same sample POD file and you'll see that it produces warnings about wide characters as well. I'm not sure if that's worth trying to tackle for lenny, though (it affects perldoc -t). Yes, I think we should leave that alone at this point. -- Niko Tyni [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: I see. The Perl internal encoding is UTF-8, but there are ways to get invalid UTF-8 in there, for example by using :utf8 on binary input. This invalid UTF-8 will then be output as-is with if :utf8 is set on output. I can't really think of a case where setting :encoding(utf-8) on output does the right thing but :utf8 doesn't. It does turn the output into valid UTF-8, but do you have an example where the content is not gibberish? I can't easily duplicate what I was seeing now, but I was getting output that was not UTF-8 while using that output encoding in combination with Pod::Simple. I'm not quite sure what was going on. It's possible that I had made some mistake in the middle of my testing, though. How about providing your own parse_from_file() wrapper in Pod::Man that knows about the utf8 option, does the open() and then sets the binmode? I guess I could do that, but I think I disagree with this: I don't think there's any need to touch the filehandles of people using parse_file(). I would prefer not to touch the filehandle, but I don't think it's acceptable to say that if you're using the utf8 option, you still have to set up output encodings yourself. Maybe I'm overreacting to how difficult I found this area of Perl to understand, but I'd really rather that Pod::Man and Pod::Text do the right thing without requiring people understand Perl's very strange Unicode handling. Pod::Text also has to do something more complex in order to preserve its traditional encoding agnosticism if utf8 is not given. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
On Fri, Oct 03, 2008 at 02:19:52AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: I don't think there's any need to touch the filehandles of people using parse_file(). I would prefer not to touch the filehandle, but I don't think it's acceptable to say that if you're using the utf8 option, you still have to set up output encodings yourself. Maybe I'm overreacting to how difficult I found this area of Perl to understand, but I'd really rather that Pod::Man and Pod::Text do the right thing without requiring people understand Perl's very strange Unicode handling. OK, it's your call of course. Both patches have the one important property: they can't break anything when the utf8 option isn't used. I don't see any point in diverging, so I think the attached patch (based on yours but cleansed of the unrelated stopword changes) is the best choice for Lenny. I have also dropped the documentation change proposed earlier because it doesn't apply any more. This and the reverting the perldoc --utf8 change are the only differences from -15. Russ, please let me know if shipping this in Lenny would be OK by you. Note that I set $Pod::Man::VERSION to 2.18_01 to emphasize that this isn't any of the official versions. Colin, I can't really get man to work with cyrillic documents. As an example, the attached ru.pod from #492037 looks fine after 'pod2man --utf8', but 'man -l ru.man' just drops all the cyrillic characters. Any ideas? Is this supposed to work at all? The same applies to debconf.ru.1.pod (which needs an =encoding koi8-r at the top first.) -- Niko Tyni [EMAIL PROTECTED] Make Pod::Man use the PerlIO UTF-8 output layer when used with the --utf8 option. Modified from upstream patch in Debian bug #500210. diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm index 961171b..d87cb8c 100644 --- a/lib/Pod/Man.pm +++ b/lib/Pod/Man.pm @@ -36,7 +36,8 @@ use POSIX qw(strftime); @ISA = qw(Pod::Simple); -$VERSION = '2.18'; +# Custom Debian version, see http://bugs.debian.org/500210 +$VERSION = '2.18_01'; # Set the debugging level. If someone has inserted a debug function into this # class already, use that. Otherwise, use any Pod::Simple debug function @@ -731,6 +732,19 @@ sub start_document { return; } +# If we were given the utf8 option, set an output encoding on our file +# handle. Wrap in an eval in case we're using a version of Perl too old +# to understand this. +# +# This is evil because it changes the global state of a file handle that +# we may not own. However, we can't just blindly encode all output, since +# there may be a pre-applied output encoding (such as from PERL_UNICODE) +# and then we would double-encode. This seems to be the least bad +# approach. +if ($$self{utf8}) { +eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') }; +} + # Determine information for the preamble and then output it. my ($name, $section); if (defined $$self{name}) { @@ -1592,6 +1606,12 @@ be warned that *roff source with literal UTF-8 characters is not supported by many implementations and may even result in segfaults and other bad behavior. +Be aware that, when using this option, the input encoding of your POD +source must be properly declared unless it is US-ASCII or Latin-1. POD +input without an C=encoding command will be assumed to be in Latin-1, +and if it's actually in UTF-8, the output will be double-encoded. See +Lperlpod(1) for more information on the C=encoding command. + =back The standard Pod::Simple method parse_file() takes one argument naming the @@ -1627,6 +1647,12 @@ invalid. A quote specification must be one, two, or four characters long. =head1 BUGS +Encoding handling assumes that PerlIO is available and does not work +properly if it isn't since encode and decode do not work well in +combination with PerlIO encoding layers. It's very unclear how to +correctly handle this without PerlIO encoding layers. The Cutf8 option +is therefore not supported unless Perl is built with PerlIO support. + There is currently no way to turn off the guesswork that tries to format unmarked text appropriately, and sometimes it isn't wanted (particularly when using POD to document something other than Perl). Most of the work @@ -1652,6 +1678,13 @@ Pod::Man is excessively slow. =head1 CAVEATS +If Pod::Man is given the Cutf8 option, the encoding of its output file +handle will be forced to UTF-8 if possible, overriding any existing +encoding. This will be done even if the file handle is not created by +Pod::Man and was passed in from outside. This seems to be the only way to +consistently enforce UTF-8-encoded output regardless of PERL_UNICODE and +other settings. + The handling of hyphens and em dashes is somewhat fragile, and one may get the wrong one under some circumstances. This should only matter for Btroff output. diff --git a/pod/pod2man.PL b/pod/pod2man.PL index
Bug#500210: perldoc perlrun spits out junk in synopsis
On Fri, Oct 03, 2008 at 03:28:28PM +0300, Niko Tyni wrote: Colin, I can't really get man to work with cyrillic documents. As an example, the attached ru.pod from #492037 looks fine after 'pod2man --utf8', but 'man -l ru.man' just drops all the cyrillic characters. Any ideas? Is this supposed to work at all? Due to groff's inability to take Unicode input in most cases at the moment, man needs to know the language of the manual page in order to recode it back to a legacy encoding for formatting by groff. It does this either by relying on it being in a directory structure that looks like that used for translated manual pages, or else by guessing based on the locale. Thus, you can either: mkdir -p man/ru/man1 mv ru.man man/ru/man1/ man -l man/ru/man1/ru.man or: # make sure the ru_RU.UTF-8 locale is generated LC_ALL=ru_RU.UTF-8 man -l ru.man If groff Unicode support ever gets finished (it's been getting asymptotically close, with only one major special-purpose piece left in order to avoid important regressions for Japanese users) then this should go away. The same applies to debconf.ru.1.pod (which needs an =encoding koi8-r at the top first.) Actually I was just going to recode all that to UTF-8, as is done in debconf's Subversion repository. Cheers, -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Are you referring to backing out the whole Pod::Man update (#480997) or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ? Sorry, I meant only the pod2man --utf8 change in perldoc. I think that the behavior of pod2man, while not ideal, is still basically okay for lenny, although I'll be releasing a new version of podlators that will implement the changes described in my previous mail. Thanks, we're on the same page then. I'll revert the perldoc change. Does this documentation patch for lenny look OK to you? (I suppose it should be duplicated in pod2man.PL too.) diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm index 961171b..a3eac5a 100644 --- a/lib/Pod/Man.pm +++ b/lib/Pod/Man.pm @@ -1592,6 +1592,13 @@ be warned that *roff source with literal UTF-8 characters is not supported by many implementations and may even result in segfaults and other bad behavior. +Be aware that using this option currently only works properly on UTF-8 +encoded POD files that use the C=encoding POD command. If the option +is enabled on an input POD file that doesn't declare an =encoding of +UTF-8, any use of S in that POD file will result in invalid UTF-8, +even if there's no use of high-bit characters in the input POD at all. +This is a bug that will be fixed in later versions. + =back The standard Pod::Simple method parse_file() takes one argument naming the @@ -1627,6 +1634,9 @@ invalid. A quote specification must be one, two, or four characters long. =head1 BUGS +As mentioned earlier in this document, the Cutf8 option is currently +broken on non-UTF-8 input. + There is currently no way to turn off the guesswork that tries to format unmarked text appropriately, and sometimes it isn't wanted (particularly when using POD to document something other than Perl). Most of the work -- Niko Tyni [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: Thanks, we're on the same page then. I'll revert the perldoc change. Does this documentation patch for lenny look OK to you? Yup, this looks good to me. (I suppose it should be duplicated in pod2man.PL too.) Yes. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Are you referring to backing out the whole Pod::Man update (#480997) or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ? Sorry, I meant only the pod2man --utf8 change in perldoc. I think that the behavior of pod2man, while not ideal, is still basically okay for lenny, although I'll be releasing a new version of podlators that will implement the changes described in my previous mail. Hm, this is looking worse the more I stare at it. I've been testing pod2man with the attached .pod file that does have '=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man --utf8' gives these results: - the Finnish a with two dots, i.e. LATIN SMALL LETTER A WITH DIAERESIS, is output as its ISO-8859-1 representation (octal 344) - the Russian letter n, CYRILLIC SMALL LETTER EN, is output in UTF-8: octal 320+275. However, there's a warning: Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717. - Sone two gets the ISO-8859-1 NO-BREAK SPACE in between So the output is ISO-8859-1 where possible and UTF-8 elsewhere. I really don't think this is acceptable. The pod2man output will almost never be valid UTF-8. Russ, I think the binmode($output, :utf8) really belongs in pod2man instead of Pod::Man. Users of Pod::Man should do that themselves for their output file handle when they use the 'utf8' option. (This needs documentation, of course.) However, pod2man currently uses the parse_from_file() method, which is just a compatibility wrapper in Pod::Simple that does the open() and output_fh() calls. I suppose this should go in pod2man itself. Something like the attached patch might do, although I see there's some deeper magic in Pod::Simple. This still doesn't break anything not explicitly using the '--utf8' option, so I suppose we could get it in lenny... Comments welcome. -- Niko Tyni [EMAIL PROTECTED] =encoding UTF-8 =head1 a with two dots ä =head1 russian letter n н =head1 non-breaking spaces Sone two diff --git a/pod/pod2man.PL b/pod/pod2man.PL index 3abb658..a9b5b67 100644 --- a/pod/pod2man.PL +++ b/pod/pod2man.PL @@ -89,7 +89,22 @@ my @files; do { @files = splice (@ARGV, 0, 2); print $files[1]\n if $verbose; -$parser-parse_from_file (@files); +if ($options{utf8}) { +my ($in, $out) = (*STDIN, *STDOUT); +$in = $files[0] if @files; +if (@files == 2) { +open($out, , $files[1]) +or die(open $files[1] for writing: $!); +} else { +$out = *STDOUT; +} +binmode($out, :utf8); +$parser-output_fh($out); +$parser-parse_file($in); +close $out; +} else { +$parser-parse_from_file (@files); +} } while (@ARGV); __END__
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: Hm, this is looking worse the more I stare at it. I spent four and a half hours on this the other night before producing the patch that was in my previous message, so I'm sympathetic. :) It gets to be more and more of a headache the more you work through it. I've been testing pod2man with the attached .pod file that does have '=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man --utf8' gives these results: - the Finnish a with two dots, i.e. LATIN SMALL LETTER A WITH DIAERESIS, is output as its ISO-8859-1 representation (octal 344) - the Russian letter n, CYRILLIC SMALL LETTER EN, is output in UTF-8: octal 320+275. However, there's a warning: Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717. - Sone two gets the ISO-8859-1 NO-BREAK SPACE in between So the output is ISO-8859-1 where possible and UTF-8 elsewhere. I was afraid of this. The problem is that the version of Pod::Man that you have at the moment doesn't understand anything about output encoding. It therefore prints out whatever Pod::Simple hands it. This is, in Perl's Unicode world, basically unsupported behavior. What you get can be fairly random. It works in some cases but doesn't work in others. This is the reason why I thought I needed to do things like remap the non-breaking space. The output is very confused, and I didn't understand at first what I was seeing. If one is dealing with Unicode in Perl, one is *required* to decode all input and encode all output. Nothing else works. Pod::Simple does decode input *if* =encoding is used, but doesn't encode output. Pod::Man (and Pod::Text for that matter) therefore have to encode output in order to work properly. The patch I sent previously does implement that, with some other consequences. One of the problems that makes this unnecessarily hard is that Perl doesn't keep track of whether it's *already* encoded output, so if you set an output encoding with binmode and also call encode() directly, you get double-encoded output. This basically means that, in practice, encode() is unusable if you want to support the PERL_UNICODE environment variable, since setting PERL_UNICODE silently adds output encodings to all your file handles which will then happily double-encode the results of encode(). Russ, I think the binmode($output, :utf8) really belongs in pod2man instead of Pod::Man. It turns out, at least based on the experiments that I did, that you never want to use an encoding of :utf8. What this does is tell Perl to just dump its internal encoding to the file handle rather than applying any encoding. The only supported thing you can do with that byte stream is to read it back in via another file handle using the :utf8 encoding. It is *not* necessarily valid UTF-8, and in practice I was getting all sorts of really strange things from it when looking at it via something other than Perl. You always want to use :encoding(utf-8) instead if the output is for anything other than Perl. Users of Pod::Man should do that themselves for their output file handle when they use the 'utf8' option. (This needs documentation, of course.) I'm not sure I like this as an interface since Pod::Man's supported interface involves opening the files itself. This would mean that anyone who wants Unicode output can't use the API of Pod::Man and Pod::Text that have been supported for years. I'd really rather try to transparently support Unicode using the existing API, even if it means messing with the state of provided output file handles. However, pod2man currently uses the parse_from_file() method, which is just a compatibility wrapper in Pod::Simple that does the open() and output_fh() calls. I suppose this should go in pod2man itself. Something like the attached patch might do, although I see there's some deeper magic in Pod::Simple. This patch looks fine to me as a workaround, although I think my previous patch is the better long-term fix. Note that Pod::Text has related issues; try running pod2text on your same sample POD file and you'll see that it produces warnings about wide characters as well. I'm not sure if that's worth trying to tackle for lenny, though (it affects perldoc -t). -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
On Tue, Sep 30, 2008 at 08:22:35PM -0700, Russ Allbery wrote: You got it exactly right. Basically, podlators has been papering over this bug incorrectly, but in a way that happens to do the right thing with a common POD problem. So if you're using UTF-8, starting the POD with: =encoding UTF-8 is required. If you add that, the current version of Pod::Man (and previous versions, as it turns out, mostly by chance) will do the right thing. Any estimate on how widespread this POD problem is? Is the hardcoded 'pod2man --utf8' in the Lenny perldoc going to cause more grief than it's worth? I'm leaning on reverting that and reopening #492037 until the issue is sorted out in Pod-Perldoc upstream. Adding a way to enable or disable the '--utf8' option on the perldoc command line is one possibility, but it might as well cause even further trouble if upstream chooses a different implementation. -- Niko Tyni [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: Any estimate on how widespread this POD problem is? Is the hardcoded 'pod2man --utf8' in the Lenny perldoc going to cause more grief than it's worth? I'm leaning on reverting that and reopening #492037 until the issue is sorted out in Pod-Perldoc upstream. Adding a way to enable or disable the '--utf8' option on the perldoc command line is one possibility, but it might as well cause even further trouble if upstream chooses a different implementation. I looked at this some more, and there's a deeper problem. If you run the current pod2man with --utf8 on an input POD file that doesn't declare an =encoding of UTF-8, any use of S in that POD file will result in invalid UTF-8, even if there's no use of high-bit characters in the input POD at all. I think the core problem was that Pod::Man is responsible for the output through the file handle and was missing an encoding layer. The problem is that we can't just call encode() on the output, since that breaks if PERL_UNICODE is set or if an encoding was manually set on the file handle. You get double-encoding. I think the least bad option is for Pod::Man and Pod::Text to force the encoding on their output file handles to UTF-8 when --utf8 is given. The problem with this fix is that this now really will break pod2man --utf8 if POD documents don't have their encoding declared properly, since it will end up double-encoding the UTF-8 given that, without =encoding, Pod::Simple is treating the input as ISO 8859-15. I think it's correct according to the specifications, but existing POD text that doesn't declare an encoding will get double-encoded output. I can work around this by not setting a UTF-8 output encoding unless the input encoding is detected as UTF-8, but that's not really correct. You *should* be able to have an input POD document with =encoding ISO-8859-1 and run it through pod2man --utf8 and get UTF-8 output. But a POD document with no =encoding according to perlpodspec has an implicit =encoding ISO-8859-1. Pod::Text has an additional challenge. pod2man won't produce any non-ASCII characters without --utf8 and has been that way since the beginning of the Pod::Simple implementation. pod2text, on the other hand, always passed through whatever it got. I could just leave it alone, but if you feed the current pod2text a document that *does* have =encoding UTF-8 in it, you get Perl warnings about wide characters on output. I think the best solution here is to force the output file handle to have an encoding matching what Pod::Simple believes the input encoding is. This comes the closest to preserving the traditional pass-through behavior. I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Here's the patch that I'm planning on including in the next podlators release, for reference. diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm index 48fe20e..b5aceef 100644 --- a/lib/Pod/Man.pm +++ b/lib/Pod/Man.pm @@ -36,7 +36,7 @@ use POSIX qw(strftime); @ISA = qw(Pod::Simple); -$VERSION = '2.20'; +$VERSION = '2.21'; # Set the debugging level. If someone has inserted a debug function into this # class already, use that. Otherwise, use any Pod::Simple debug function @@ -736,6 +736,19 @@ sub start_document { return; } +# If we were given the utf8 option, set an output encoding on our file +# handle. Wrap in an eval in case we're using a version of Perl too old +# to understand this. +# +# This is evil because it changes the global state of a file handle that +# we may not own. However, we can't just blindly encode all output, since +# there may be a pre-applied output encoding (such as from PERL_UNICODE) +# and then we would double-encode. This seems to be the least bad +# approach. +if ($$self{utf8}) { +eval { binmode ($$self{output_fh}, ':encoding(UTF-8)') }; +} + # Determine information for the preamble and then output it. my ($name, $section); if (defined $$self{name}) { @@ -1450,8 +1463,8 @@ Pod::Man - Convert POD data to formatted *roff input =for stopwords en em ALLCAPS teeny fixedbold fixeditalic fixedbolditalic stderr utf8 -UTF-8 Allbery Sean Burke Ossanna Solaris formatters troff uppercased -Christiansen +UTF-8 UTF-8-encoded Allbery Sean Burke Ossanna Solaris formatters troff +uppercased Christiansen =head1 SYNOPSIS @@ -1608,6 +1621,12 @@ be warned that *roff source with literal UTF-8 characters is not supported by many implementations and may even result in segfaults and other bad behavior. +Be aware that, when using this option, the input encoding of your POD +source must be properly declared unless it is US-ASCII or Latin-1. POD +input without an C=encoding command will be assumed to be in Latin-1, +and if it's actually in UTF-8, the output will be double-encoded. See +Lperlpod(1) for more information on the C=encoding command. + =back The
Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: Any estimate on how widespread this POD problem is? Is the hardcoded 'pod2man --utf8' in the Lenny perldoc going to cause more grief than it's worth? I'm leaning on reverting that and reopening #492037 until the issue is sorted out in Pod-Perldoc upstream. Adding a way to enable or disable the '--utf8' option on the perldoc command line is one possibility, but it might as well cause even further trouble if upstream chooses a different implementation. I looked at this some more, and there's a deeper problem. If you run the current pod2man with --utf8 on an input POD file that doesn't declare an =encoding of UTF-8, any use of S in that POD file will result in invalid UTF-8, even if there's no use of high-bit characters in the input POD at all. Thanks for pointing out =encoding to me; I completely missed that in the documentation. I think the core problem was that Pod::Man is responsible for the output through the file handle and was missing an encoding layer. The problem is that we can't just call encode() on the output, since that breaks if PERL_UNICODE is set or if an encoding was manually set on the file handle. You get double-encoding. I think the least bad option is for Pod::Man and Pod::Text to force the encoding on their output file handles to UTF-8 when --utf8 is given. The problem with this fix is that this now really will break pod2man --utf8 if POD documents don't have their encoding declared properly, since it will end up double-encoding the UTF-8 given that, without =encoding, Pod::Simple is treating the input as ISO 8859-15. I think it's correct according to the specifications, but existing POD text that doesn't declare an encoding will get double-encoded output. I can work around this by not setting a UTF-8 output encoding unless the input encoding is detected as UTF-8, but that's not really correct. You *should* be able to have an input POD document with =encoding ISO-8859-1 and run it through pod2man --utf8 and get UTF-8 output. But a POD document with no =encoding according to perlpodspec has an implicit =encoding ISO-8859-1. While this is certainly something extra that people have to bear in mind when using pod2man --utf8, it *is* an option people have to enable manually (well, except for in perldoc; I suppose I'm more worried about generated manual pages), and it doesn't seem too unreasonable to just say that you have to specify =encoding when doing so. If that were mentioned explicitly in the pod2man manual page then I think that would be good enough. Assuming that your intent is to run with UTF-8 across the board, then just sticking =encoding UTF-8 at the top of all POD files before passing them to pod2man is sufficient, and that's not too hard. The diff to debconf looks like this: Index: doc/Makefile === --- doc/Makefile(revision 2310) +++ doc/Makefile(working copy) @@ -4,6 +4,9 @@ pod2man=pod2man -c Debconf -r '' --utf8 manpages: cd man po4a po4a/po4a.cfg + for pod in man/*.pod; do \ + perl -pi -e 'if (not $$seen and /^=head1/) { print =encoding UTF-8\n\n; $$seen = 1; }' $$pod; \ + done install -d man/gen for num in 1 3 8; do \ find man -maxdepth 1 -type f -name *.$$num.pod -printf '%P\n' | \ I'd prefer to do this with a po4a addendum, but it turns out to be an absolute pain. Also this would break if any of the source documents contained S. Maybe I should just change all the source documents instead. Perhaps it would be helpful if po4a inserted an =encoding paragraph? After all, it understands POD and it knows the encoding. I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Hmm, this would be a shame. With your most recent patch it's now finally possible for debconf to generate working manual pages for Russian and French at the same time. I understand the perldoc problem though ... -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#492037: Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: Any estimate on how widespread this POD problem is? Is the hardcoded 'pod2man --utf8' in the Lenny perldoc going to cause more grief than it's worth? I looked at this some more, and there's a deeper problem. If you run the current pod2man with --utf8 on an input POD file that doesn't declare an =encoding of UTF-8, any use of S in that POD file will result in invalid UTF-8, even if there's no use of high-bit characters in the input POD at all. Thanks for the follow-up. I see the problem. sid% pod2man --utf8 /usr/share/perl/5.10/pod/perlrun.pod|iconv --from utf8 --to latin1 /dev/null iconv: illegal input sequence at position 2097 I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Are you referring to backing out the whole Pod::Man update (#480997) or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ? It looks to me like having the 'pod2man --utf8' option available in lenny, even if it's broken without '=encoding utf8', is OK, as long as it's not on by default. The pod2man manual page should probably be updated to note the issue. I think reverting just the perldoc change is the least disruptive choice for Lenny. -- Niko Tyni [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Niko Tyni [EMAIL PROTECTED] writes: I think that for lenny you may want to back out of the --utf8 change and give it some time to settle. Are you referring to backing out the whole Pod::Man update (#480997) or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ? Sorry, I meant only the pod2man --utf8 change in perldoc. I think that the behavior of pod2man, while not ideal, is still basically okay for lenny, although I'll be releasing a new version of podlators that will implement the changes described in my previous mail. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
On Wed, Oct 01, 2008 at 11:26:23AM -0700, Russ Allbery wrote: Niko Tyni [EMAIL PROTECTED] writes: Are you referring to backing out the whole Pod::Man update (#480997) or just the hardcoded 'pod2man --utf8' in perldoc (#492037) ? Sorry, I meant only the pod2man --utf8 change in perldoc. I think that the behavior of pod2man, while not ideal, is still basically okay for lenny, although I'll be releasing a new version of podlators that will implement the changes described in my previous mail. I'd support that - after all, for many purposes people can just use man instead on Debian, and it's not like it's a regression otherwise. -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
found 500210 5.10.0-15 thanks On Fri, Sep 26, 2008 at 02:04:55PM +0300, Niko Tyni wrote: On Thu, Sep 25, 2008 at 11:37:21PM +0200, Gerfried Fuchs wrote: Package: perl-doc Version: 5.10.0-14 Severity: normal When running perldoc perlrun I have strange characters in the output of it, and I managed to pin it down to a short POD snippet like this: =head1 SYNOPSIS Bperl S[ B-sTtUWX ] Thanks for noticing this. It's fixed in podlators-2.1.3: * lib/Pod/Man.pm (format_text): Stop remapping the code point for non-breaking space. This should not be necessary and was wrong when the string from Pod::Simple was a character string and not a byte string. It was papering over a bug in setting the encoding of an input POD file. Patch from upstream git attached. I'd certainly like to fix this for Lenny, we'll see what the release team thinks. diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm index 38c4e3d..203ef4a 100644 --- a/lib/Pod/Man.pm +++ b/lib/Pod/Man.pm @@ -362,13 +362,6 @@ sub format_text { $text =~ s/([^\x00-\x7F])/$ESCAPES{ord ($1)} || X/eg; } -# For Unicode output, unconditionally remap ISO 8859-1 non-breaking spaces -# to the correct code point. This is really a bug in Pod::Simple to be -# embedding ISO 8859-1 characters in the output stream that we see. -if ($$self{utf8} ASCII) { -$text =~ s/\xA0/\xC2\xA0/g; -} - # Ensure that *roff doesn't convert literal quotes to UTF-8 single quotes, # but don't mess up our accept escapes. if ($literal) { For me, this fixed the case where a 0xA0 byte is embedded essentially accidentally in the middle of a UTF-8 stream (as happened with debconf's Russian translations), but it broke the case where 0xA0 is actually being used as a non-breaking space. Note that I'm using the new 'pod2man --utf8' option, although presumably so is Gerfried since perldoc now uses that option automatically. I've attached debconf.fr.1.pod, which reproduces this problem. Run 'pod2man -c Debconf -r '' --utf8 --section=1 debconf.fr.1.pod', and look carefully at the line matching purge. It looks like this: soient bons et pour que les commandes «?purge?» et «?unregister?» soient The two characters marked as ? here are the byte 0xA0. The characters around it are encoded in UTF-8. 0xA0 doesn't decode as UTF-8 so man assumes that this page must be ISO-8859-1, which means the whole page comes out misencoded. Is this because Pod::Man hasn't been told about the encoding of the input data, perhaps? The input files pretty much have to be in UTF-8 if you're using --utf8, so do we have to tell perl that with binmode? I think I've got about as far as I can with this, so CCing Russ for help. :-) -- Colin Watson [EMAIL PROTECTED] * * GENERATED FILE, DO NOT EDIT * * THIS IS NO SOURCE FILE, BUT RESULT OF COMPILATION * * This file was generated by po4a(7). Do not store it (in cvs, for example), but store the po file used as source file by po4a-translate. In fact, consider this as a binary, and the po file as a regular .c file: If the po get lost, keeping this translation up-to-date will be harder. =head1 NOM debconf - Exécuter un programme utilisant debconf =head1 SYNOPSIS debconf [options] commande [args] =head1 DESCRIPTION Debconf est un système de configuration pour les paquets Debian. Pour faire un tour d'horizon de debconf et pour obtenir de la documentation pour les administrateurs système, veuillez consulter Ldebconf(7). Le programme Bdebconf exécute un programme sous contrôle de debconf, en le configurant pour communiquer avec debconf sur l'entrée et la sortie standard. La sortie du programme sera l'une des commandes du protocole debconf, et les codes résultants seront lus sur l'entrée standard. Pour plus de détails sur le protocole de debconf, veuillez consulter Ldebconf-devel(7). La commande à exécuter depuis debconf doit être spécifiée de manière à ce qu'elle soit trouvée dans votre PATH. Cette commande ne reflète pas l'utilisation habituelle de debconf. Il est courant pour debconf d'être appellé via Ldpkg-preconfigure(8) ou Ldpkg-reconfigure(8). =head1 OPTIONS =over 4 =item B-oIpaquet, B--owner=Ipaquet Indique à debconf à quel paquet appartient la commande exécutée. C'est nécessaire pour que les droits de propriété des questions enregistrées soient bons et pour que les commandes S« purge » et S« unregister » soient gérées correctement. =item B-fItype, B--frontend=Itype Sélectionner l'interface à utiliser. =item B-pIvaleur, B--priority=Ivaleur Spécifier la priorité minimale des questions qui vont être posées. =back =head1 EXEMPLES Pour déboguer un script shell qui utilise debconf, vous devriez Sutiliser : DEBCONF_DEBUG=developer debconf
Bug#500210: perldoc perlrun spits out junk in synopsis
Colin Watson [EMAIL PROTECTED] writes: For me, this fixed the case where a 0xA0 byte is embedded essentially accidentally in the middle of a UTF-8 stream (as happened with debconf's Russian translations), but it broke the case where 0xA0 is actually being used as a non-breaking space. Note that I'm using the new 'pod2man --utf8' option, although presumably so is Gerfried since perldoc now uses that option automatically. I've attached debconf.fr.1.pod, which reproduces this problem. Run 'pod2man -c Debconf -r '' --utf8 --section=1 debconf.fr.1.pod', and look carefully at the line matching purge. It looks like this: soient bons et pour que les commandes «?purge?» et «?unregister?» soient The two characters marked as ? here are the byte 0xA0. The characters around it are encoded in UTF-8. 0xA0 doesn't decode as UTF-8 so man assumes that this page must be ISO-8859-1, which means the whole page comes out misencoded. Is this because Pod::Man hasn't been told about the encoding of the input data, perhaps? The input files pretty much have to be in UTF-8 if you're using --utf8, so do we have to tell perl that with binmode? Hi Colin, You got it exactly right. Basically, podlators has been papering over this bug incorrectly, but in a way that happens to do the right thing with a common POD problem. Most POD authors from the pre-Unicode days of Perl don't realize this, but if you use Unicode characters in POD, you have to declare the input encoding in the POD in order for the results to be reliable and consistent. This is actually mentioned in perlpod, but if you were like me, you haven't read that recently. :) I just discovered this myself. =encoding encodingname This command is used for declaring the encoding of a document. Most users won’t need this; but if your encoding isn’t US-ASCII or Latin-1, then put a =encoding encodingname command early in the document so that pod formatters will know how to decode the document. For encodingname, use a name recognized by the Encode::Supported module. So if you're using UTF-8, starting the POD with: =encoding UTF-8 is required. If you add that, the current version of Pod::Man (and previous versions, as it turns out, mostly by chance) will do the right thing. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
Package: perl-doc Version: 5.10.0-14 Severity: normal Hi! When running perldoc perlrun I have strange characters in the output of it, and I managed to pin it down to a short POD snippet like this: #v+ =head1 SYNOPSIS Bperl S[ B-sTtUWX ] #v- The S[ ] does strange stuff with the spaces it has in there. For LC_ALL=C it puts a [C2] infront of the space, which turns into an LATIN CAPITAL LETTER A WITH CIRCUMFLEX in my usual utf8 locale. Hope this can get easy fixed, it really looks ugly. Rhonda -- System Information: Debian Release: lenny/sid APT prefers testing APT policy: (500, 'testing') Architecture: powerpc (ppc) Kernel: Linux 2.6.26-1-powerpc Locale: LANG=de_AT.UTF-8, LC_CTYPE=de_AT.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages perl-doc depends on: ii perl 5.10.0-14 Larry Wall's Practical Extraction perl-doc recommends no packages. Versions of packages perl-doc suggests: pn groff none (no description available) ii konqueror [man-browser] 4:3.5.9.dfsg.1-5 KDE's advanced file manager, web b ii man-db [man-browser]2.5.2-3 on-line manual pager -- no debconf information -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#500210: perldoc perlrun spits out junk in synopsis
tag 500210 patch fixed-upstream thanks On Thu, Sep 25, 2008 at 11:37:21PM +0200, Gerfried Fuchs wrote: Package: perl-doc Version: 5.10.0-14 Severity: normal When running perldoc perlrun I have strange characters in the output of it, and I managed to pin it down to a short POD snippet like this: =head1 SYNOPSIS Bperl S[ B-sTtUWX ] Thanks for noticing this. It's fixed in podlators-2.1.3: * lib/Pod/Man.pm (format_text): Stop remapping the code point for non-breaking space. This should not be necessary and was wrong when the string from Pod::Simple was a character string and not a byte string. It was papering over a bug in setting the encoding of an input POD file. Patch from upstream git attached. I'd certainly like to fix this for Lenny, we'll see what the release team thinks. -- Niko Tyni [EMAIL PROTECTED] diff --git a/ChangeLog b/ChangeLog index f0c727e..25850bc 100644 diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm index 38c4e3d..203ef4a 100644 --- a/lib/Pod/Man.pm +++ b/lib/Pod/Man.pm @@ -362,13 +362,6 @@ sub format_text { $text =~ s/([^\x00-\x7F])/$ESCAPES{ord ($1)} || X/eg; } -# For Unicode output, unconditionally remap ISO 8859-1 non-breaking spaces -# to the correct code point. This is really a bug in Pod::Simple to be -# embedding ISO 8859-1 characters in the output stream that we see. -if ($$self{utf8} ASCII) { -$text =~ s/\xA0/\xC2\xA0/g; -} - # Ensure that *roff doesn't convert literal quotes to UTF-8 single quotes, # but don't mess up our accept escapes. if ($literal) {