Bug#768127: Fails to build the index when invalid UTF-8 is met
Control: severity 773485 normal On Thu, Dec 18, 2014 at 11:19:31PM +0100, gregor herrmann wrote: Probably the severity in the cloned bug in ruby-debian shoule be lowered; the problem might not be present if a non-UTF-9 file is not opened as UTF-8 ... yes -- Antonio Terceiro terce...@debian.org signature.asc Description: Digital signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Fri, 19 Dec 2014 07:57:15 +0200, Yavor Doganov wrote: Right, cloning+reassigning to ruby-debian might make sense. Let's do this :) (And close the original bug since it does fix another problem.) Thanks. I used dpkg --update-avail to get rid of that ancient package and now dhelp works as expected. (Thought dpkg was doing this automatically these days...) Great! Probably the severity in the cloned bug in ruby-debian shoule be lowered; the problem might not be present if a non-UTF-9 file is not opened as UTF-8 ... You're probably right; it's up to the ruby-debian maintainers. Already done by Antonio in the meantime, thanks. Thanks to both of you for your efforts. Thanks for your patience :) Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
reopen 768127 notfixed 768127 0.6.21+nmu6 thanks Thanks for your work, but unfortunately I experience exactly the same problem with the new version. $ isutf8 /var/lib/doc-base/documents/* $ echo $? 0 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, 18 Dec 2014 12:01:02 +0200, Yavor Doganov wrote: Thanks for your work, but unfortunately I experience exactly the same problem with the new version. $ isutf8 /var/lib/doc-base/documents/* $ echo $? 0 Ouch. I'm sorry to hear this. Could you please provide a bit more information? I guess it would be helpful if you could try to - add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184 (see Santiago's message #24); - copy the output of `locale' and `/etc/cron.weekly/dhelp'; - maybe try `LC_ALL=x LANG=x /etc/cron.weekly/dhelp' for different versions of x. (Hm, maybe we should set LANG/LC_ALL in the cron script. But before that I'd like to see where it fails for Yavor.) Thanks in advance, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Various Artists: Green Fields of France signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
gregor herrmann wrote: On Thu, 18 Dec 2014 12:01:02 +0200, Yavor Doganov wrote: Thanks for your work, but unfortunately I experience exactly the same problem with the new version. I guess it would be helpful if you could try to - add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184 (see Santiago's message #24); - copy the output of `locale' and `/etc/cron.weekly/dhelp'; $ locale LANG=bg_BG.UTF-8 LANGUAGE=bg:en_GB LC_CTYPE=bg_BG.UTF-8 LC_NUMERIC=bg_BG.UTF-8 LC_TIME=bg_BG.UTF-8 LC_COLLATE=bg_BG.UTF-8 LC_MONETARY=bg_BG.UTF-8 LC_MESSAGES=bg_BG.UTF-8 LC_PAPER=bg_BG.UTF-8 LC_NAME=bg_BG.UTF-8 LC_ADDRESS=bg_BG.UTF-8 LC_TELEPHONE=bg_BG.UTF-8 LC_MEASUREMENT=bg_BG.UTF-8 LC_IDENTIFICATION=bg_BG.UTF-8 LC_ALL= [ Sorry for the long output. ] /var/lib/doc-base/documents/dc /var/lib/doc-base/documents/abi-compliance-checker /var/lib/doc-base/documents/aptitude-doc-en /var/lib/doc-base/documents/autoconf /var/lib/doc-base/documents/automake-1.14 /var/lib/doc-base/documents/bash /var/lib/doc-base/documents/bashref /var/lib/doc-base/documents/bc /var/lib/doc-base/documents/bzip2 /var/lib/doc-base/documents/bzr /var/lib/doc-base/documents/bzr-builddeb /var/lib/doc-base/documents/comerr-manual /var/lib/doc-base/documents/copyright-format-1.0 /var/lib/doc-base/documents/cpp-4.9 /var/lib/doc-base/documents/cppinternals-4.9 /var/lib/doc-base/documents/cvs-doc /var/lib/doc-base/documents/cvs-doc-client /var/lib/doc-base/documents/cvs-doc-faq /var/lib/doc-base/documents/cvs-doc-intro /var/lib/doc-base/documents/cvs-doc-paper /var/lib/doc-base/documents/cvs-doc-rcsfiles /var/lib/doc-base/documents/dbuskit-api-docs /var/lib/doc-base/documents/dbuskit-manual /var/lib/doc-base/documents/debconf-spec /var/lib/doc-base/documents/debian-constitution-text /var/lib/doc-base/documents/debian-faq /var/lib/doc-base/documents/debian-mailing-lists /var/lib/doc-base/documents/debian-manifesto /var/lib/doc-base/documents/debian-menu-policy /var/lib/doc-base/documents/debian-perl-policy /var/lib/doc-base/documents/debian-policy /var/lib/doc-base/documents/debian-reporting-bugs /var/lib/doc-base/documents/debian-social-contract /var/lib/doc-base/documents/debian-tex-policy /var/lib/doc-base/documents/developers-reference /var/lib/doc-base/documents/doc-base /var/lib/doc-base/documents/docbook-xsl-doc /var/lib/doc-base/documents/everyday-git /var/lib/doc-base/documents/exim4-filter-txt /var/lib/doc-base/documents/exim4-readme-debian /var/lib/doc-base/documents/exim4-spec-txt /var/lib/doc-base/documents/expat /var/lib/doc-base/documents/feynmf-manual /var/lib/doc-base/documents/fhs /var/lib/doc-base/documents/findutils /var/lib/doc-base/documents/fontconfig-devel /var/lib/doc-base/documents/fontconfig-user /var/lib/doc-base/documents/gawk-doc-gawk /var/lib/doc-base/documents/gawk-doc-gawkinet /var/lib/doc-base/documents/gcc-4.9 /var/lib/doc-base/documents/gccint-4.9 /var/lib/doc-base/documents/gccintro /var/lib/doc-base/documents/gdl2api /var/lib/doc-base/documents/gdl2intro /var/lib/doc-base/documents/git-api /var/lib/doc-base/documents/git-bisect-lk2009 /var/lib/doc-base/documents/git-buildpackage /var/lib/doc-base/documents/git-howtos /var/lib/doc-base/documents/git-index-format /var/lib/doc-base/documents/git-pack-format /var/lib/doc-base/documents/git-protocol /var/lib/doc-base/documents/git-reference-manual /var/lib/doc-base/documents/git-shallow-clone-design /var/lib/doc-base/documents/git-tools /var/lib/doc-base/documents/git-trivial-merge-rules /var/lib/doc-base/documents/git-user-manual /var/lib/doc-base/documents/glibc-manual /var/lib/doc-base/documents/gnu-coding-standards /var/lib/doc-base/documents/gnu-maintainers-information /var/lib/doc-base/documents/gnustep-base-additions /var/lib/doc-base/documents/gnustep-base-programming-manual /var/lib/doc-base/documents/gnustep-base-reference /var/lib/doc-base/documents/gnustep-base-tools /var/lib/doc-base/documents/gnustep-coding-standards /var/lib/doc-base/documents/gnustep-gui-additions /var/lib/doc-base/documents/gnustep-gui-programming-manual /var/lib/doc-base/documents/gnustep-gui-reference /var/lib/doc-base/documents/gnustep-make-manual /var/lib/doc-base/documents/gnustep-netclasses-docs /var/lib/doc-base/documents/gnustep-performance /var/lib/doc-base/documents/gnustep-sqlclient /var/lib/doc-base/documents/gorm.app /var/lib/doc-base/documents/initramfs-maintainer /var/lib/doc-base/documents/install-docs-man /var/lib/doc-base/documents/jade /var/lib/doc-base/documents/kbd-font-formats /var/lib/doc-base/documents/libao /var/lib/doc-base/documents/libexif-api /var/lib/doc-base/documents/libffi /var/lib/doc-base/documents/libfreetype6-dev /var/lib/doc-base/documents/libidn11 /var/lib/doc-base/documents/libio-stringy-perl /var/lib/doc-base/documents/libpng12 /var/lib/doc-base/documents/libsdl1.2-dev /var/lib/doc-base/documents/libsndfile /var/lib/doc-base/documents/libtasn1 /var/lib/doc-base/documents/libvorbis /var/lib/doc-base/documents/libxml-parser-perl
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, 18 Dec 2014 18:24:53 +0200, Yavor Doganov wrote: I guess it would be helpful if you could try to - add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184 (see Santiago's message #24); - copy the output of `locale' and `/etc/cron.weekly/dhelp'; Thanks! $ locale LANG=bg_BG.UTF-8 LANGUAGE=bg:en_GB LC_CTYPE=bg_BG.UTF-8 LC_NUMERIC=bg_BG.UTF-8 LC_TIME=bg_BG.UTF-8 LC_COLLATE=bg_BG.UTF-8 LC_MONETARY=bg_BG.UTF-8 LC_MESSAGES=bg_BG.UTF-8 LC_PAPER=bg_BG.UTF-8 LC_NAME=bg_BG.UTF-8 LC_ADDRESS=bg_BG.UTF-8 LC_TELEPHONE=bg_BG.UTF-8 LC_MEASUREMENT=bg_BG.UTF-8 LC_IDENTIFICATION=bg_BG.UTF-8 LC_ALL= Ok. [ Sorry for the long output. ] No worries, that was expected :) [..] /var/lib/doc-base/documents/xinetd-faq /var/lib/doc-base/documents/xterm-ctlseqs /var/lib/doc-base/documents/xterm-faq ArgumentError: invalid byte sequence in UTF-8 (/usr/lib/ruby/vendor_ruby/debian.rb:914:in `block in initialize' So according to the previous findings, my guess is that /var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8. Which is interesing for two reasons: - First, I have the file installed and don't have any problems with dhelp (although not with a bulgarian locale but UTF-8 should be UTF-8?!) - Second, at least on my system, it's plain ASCII: % file -i /var/lib/doc-base/documents/xterm-faq /var/lib/doc-base/documents/xterm-faq: text/plain; charset=us-ascii And the file itself looks completely innocent: #v+ Document: xterm-faq Section: Terminal Emulators Title: XTerm Frequently Asked Questions (FAQ) Author: Thomas Dickey Abstract: This document provides answers to frequently asked questions about the XTerm terminal emulator as it ships with the X.Org distribution of the X Window System. Format: HTML Index: /usr/share/doc/xterm/xterm.faq.html Files: /usr/share/doc/xterm/xterm.faq.html Format: text Files: /usr/share/doc/xterm/xterm.faq.gz #v- This is getting slightly mysterious. Could you please try `file -i /var/lib/doc-base/documents/xterm-faq' as well and open the file in a pager/editor to see if it's the same on your machine? If I'm understanding this correctly, the origin is /usr/share/doc-base/xterm-faq which has, according to /var/lib/dpkg/info/xterm.md5sums, a md5sum of 4f81e4dd965c918abc250beeb54131fb. Confirmed locally: % md5sum /usr/share/doc-base/xterm-faq 4f81e4dd965c918abc250beeb54131fb /usr/share/doc-base/xterm-faq Maybe you could check this out as well to rule out a corrupted file? - maybe try `LC_ALL=x LANG=x /etc/cron.weekly/dhelp' for different versions of x. There is no difference. Ok, thanks again! Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Nick Drake: Hanging on a Star signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
gregor herrmann wrote: /var/lib/doc-base/documents/xterm-ctlseqs /var/lib/doc-base/documents/xterm-faq ArgumentError: invalid byte sequence in UTF-8 (/usr/lib/ruby/vendor_ruby/debian.rb:914:in `block in initialize' So according to the previous findings, my guess is that /var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8. I don't think this has anything to do with xterm-faq, it's just that this is the last file in alphabetical order. If I move the file away I get the same failure -- it chokes on xterm-ctlseqs then. Could you please try `file -i /var/lib/doc-base/documents/xterm-faq' as well and open the file in a pager/editor to see if it's the same on your machine? Yes, it's the same: $ file -i /var/lib/doc-base/documents/xterm-faq /var/lib/doc-base/documents/xterm-faq: text/plain; charset=us-ascii $ md5sum /usr/share/doc-base/xterm-faq 4f81e4dd965c918abc250beeb54131fb /usr/share/doc-base/xterm-faq Too bad I don't know Ruby and I'm completely clueless. It seems that there are two different problems -- Santiago's failure that he posted on the bug log is at dhelp.rb:185 while mine is at debian.rb:914 (which is why I suggested it might be a ruby-debian issue). If you and Daniel have reproduced and fixed Santiago's bug it is not surprising that the NMU does not address the bug I am observing. (At least that is how I explain the mystery for the time being, with my limited knowledge.) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, Dec 18, 2014 at 4:30 PM, Yavor Doganov ya...@gnu.org wrote: It seems that there are two different problems -- Santiago's failure that he posted on the bug log is at dhelp.rb:185 while mine is at debian.rb:914 (which is why I suggested it might be a ruby-debian issue). If you and Daniel have reproduced and fixed Santiago's bug it is not surprising that the NMU does not address the bug I am observing. (At least that is how I explain the mystery for the time being, with my limited knowledge.) Yes, that's my understanding, too. Can you run with the attached patch to debian.rb, and see if it will show which entry of which file triggers the error? --- /usr/lib/ruby/vendor_ruby/debian.rb.orig 2014-12-18 19:01:03.233496178 -0100 +++ /usr/lib/ruby/vendor_ruby/debian.rb.debug 2014-12-18 19:00:26.229041877 -0100 @@ -911,7 +911,14 @@ @provides = {} @file = [file] @lists = Archives.parseArchiveFile(file) {|info| -info =~ /Package:\s(.*)$/; + begin + info =~ /Package:\s(.*)$/; +rescue = e + puts Error parsing file #{file} + puts Contents of info: + puts info + raise e +end if pkgs.empty? || pkgs.include?($1) d = Deb.new(info,fields) add_provides(d)
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, 18 Dec 2014 19:30:32 +0200, Yavor Doganov wrote: So according to the previous findings, my guess is that /var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8. I don't think this has anything to do with xterm-faq, it's just that this is the last file in alphabetical order. If I move the file away I get the same failure -- it chokes on xterm-ctlseqs then. D'oh. Then my guesses were really off track. Too bad I don't know Ruby and I'm completely clueless. Same here. It seems that there are two different problems -- Santiago's failure that he posted on the bug log is at dhelp.rb:185 while mine is at debian.rb:914 (which is why I suggested it might be a ruby-debian issue). Right, sorry for missing this. Let's hope that Daniel's debugging idea helps ... Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Buffy St. Marie: The Universal Soldier signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
Daniel Getz wrote: Can you run with the attached patch to debian.rb, and see if it will show which entry of which file triggers the error? Thanks; here's the output: Error parsing file /var/lib/dpkg/available Contents of info: Package: ayuda Priority: extra Section: misc Installed-Size: 204 Maintainer: Javier Vi�uales Guti�rrez v...@matrio.com Architecture: all Version: 0.1-4 Suggests: manpages-es, doc-linux-es, doc-debian-es Filename: pool/main/a/ayuda/ayuda_0.1-4_all.deb Size: 31710 MD5sum: 79c8ded94cce4b054ad883cb139500d7 Description: Help for spanish-speakers This package contains a help program called 'ayuda' useful for users that speak spanish, and are new to the world of Debian GNU/Linux. . The help provided covers many topics from administration to daily use. (Ruby backtrace follows.) The maintainer name is not valid UTF-8: $ isutf8 /var/lib/dpkg/available /var/lib/dpkg/available: line 17427, char 1, byte offset 22: invalid UTF-8 code -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, 18 Dec 2014 23:25:06 +0200, Yavor Doganov wrote: Daniel Getz wrote: Can you run with the attached patch to debian.rb, and see if it will show which entry of which file triggers the error? Thanks; here's the output: Error parsing file /var/lib/dpkg/available Wow, that's an interesting finding. Contents of info: Package: ayuda Priority: extra Section: misc There is no package ayuda in Debian (anymore; it was removed in 2005, according to https://packages.qa.debian.org/a/ayuda.html -- which also shows the encoding problems :)) Ok, so what are we doing now? While I would like dhelp to handle this situation a bit more gracefully, I suggest to downgrade the severity of the bug since it shouldn't affect anyone running packages contained in recent and upcoming Debian releases, and keeping it out of jessie for these cornercases seems a bit to strong for me. (Of course a fix would be best :)) Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Ludwig Hirsch: Die Spur im Schnee signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
On Thu, Dec 18, 2014 at 8:49 PM, gregor herrmann gre...@debian.org wrote: Ok, so what are we doing now? While I would like dhelp to handle this situation a bit more gracefully, I suggest to downgrade the severity of the bug since it shouldn't affect anyone running packages contained in recent and upcoming Debian releases, and keeping it out of jessie for these cornercases seems a bit to strong for me. (Of course a fix would be best :)) In terms of fixing dhelp, we could in theory catch the UTF-8 error, log a warning somehow, and continue on to the next package description. One missing package from the documentation index isn't the end of the world (and in my experience, dhelp doesn't index all my HTML documentation anyway.) However, the code in question is in ruby-debian, which is a separate library used by other packages. Might not be correct behavior for other users of the library?
Bug#768127: Fails to build the index when invalid UTF-8 is met
Control: clone -1 -2 Control: reassign -2 ruby-debian Control: affects -2 dhelp Control: close -1 768127 0.6.21+nmu6 On Thu, 18 Dec 2014 21:02:58 -0100, Daniel Getz wrote: While I would like dhelp to handle this situation a bit more gracefully, I suggest to downgrade the severity of the bug since it shouldn't affect anyone running packages contained in recent and upcoming Debian releases, and keeping it out of jessie for these cornercases seems a bit to strong for me. In terms of fixing dhelp, we could in theory catch the UTF-8 error, log a warning somehow, and continue on to the next package description. One missing package from the documentation index isn't the end of the world (and in my experience, dhelp doesn't index all my HTML documentation anyway.) However, the code in question is in ruby-debian, which is a separate library used by other packages. Might not be correct behavior for other users of the library? Right, cloning+reassigning to ruby-debian might make sense. Let's do this :) (And close the original bug since it does fix another problem.) Probably the severity in the cloned bug in ruby-debian shoule be lowered; the problem might not be present if a non-UTF-9 file is not opened as UTF-8 ... Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Rolling Stones: Lucky signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
gregor herrmann wrote: On Thu, 18 Dec 2014 21:02:58 -0100, Daniel Getz wrote: However, the code in question is in ruby-debian, which is a separate library used by other packages. Might not be correct behavior for other users of the library? Right, cloning+reassigning to ruby-debian might make sense. Let's do this :) (And close the original bug since it does fix another problem.) Thanks. I used dpkg --update-avail to get rid of that ancient package and now dhelp works as expected. (Thought dpkg was doing this automatically these days...) Probably the severity in the cloned bug in ruby-debian shoule be lowered; the problem might not be present if a non-UTF-9 file is not opened as UTF-8 ... You're probably right; it's up to the ruby-debian maintainers. Thanks to both of you for your efforts. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#768127: Fails to build the index when invalid UTF-8 is met
Control: tag -1 - moreinfo Control: tag -1 + confirmed On Sat, 06 Dec 2014 01:33:58 -0100, Daniel Getz wrote: I can reproduce the problem with LC_ALL=C LANG=C /etc/cron.weekly/dhelp Attached is a diff with a change to dhelp_parse.rb which sets Encoding.default_external explicitly, so that even if LANG=C, it uses UTF-8 instead of US-ASCII as the default for opening files. By my (limited) understanding of Encoding.default_external, this should have the same effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would. On my machine, without the patch, I see the same errors with LANG=C as the others here. With the patch, I do not. Works for me as well. Since I don't speak any ruby I'm a bit hesitant to upload; maybe some ruby speaker knowing Encoding.default_external can confirm that's the correct way forwards? (And: Are we sure all doc-base files are us-ascii or utf-8 encoded? At least on my machine they are, so maybe that's a non-concern.) Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Treibhaus: Yellowman Jamaica signature.asc Description: Digital Signature
Bug#768127: Fails to build the index when invalid UTF-8 is met
UTF-8 should be the right format for doc-base files, according to https://lintian.debian.org/tags/doc-base-file-uses-obsolete-national-encoding.html I also don't know ruby, but from my research setting Encoding.default_external is considered the wrong thing to do, the right way being to pass -E UTF-8 as an option to ruby via the command line, or the environment variable RUBYOPT. I had to explicitly silence a warning because of this. See http://docs.ruby-lang.org/en/2.1.0/Encoding.html#method-c-default_external-3D However, neither of those right ways to set the encoding work well with using a ruby file directly as a script. (Is ruby not intended to be used in scripts?!) In the ruby docs, it says the problem is if code gets run before the change to the encoding. That's avoidable, and I believe I avoided it in my patch by placing the encoding change before any require imports. An alternative is to explicitly set the encoding to UTF-8 each time a file is opened. If someone feels that's a better way, I'm willing to do that and create a new patch. But like I said, I don't know ruby, so I can't guarantee correctness beyond trying it and seeing that it works. - Dan On Sun, Dec 7, 2014 at 2:06 PM, gregor herrmann gre...@debian.org wrote: Control: tag -1 - moreinfo Control: tag -1 + confirmed On Sat, 06 Dec 2014 01:33:58 -0100, Daniel Getz wrote: I can reproduce the problem with LC_ALL=C LANG=C /etc/cron.weekly/dhelp Attached is a diff with a change to dhelp_parse.rb which sets Encoding.default_external explicitly, so that even if LANG=C, it uses UTF-8 instead of US-ASCII as the default for opening files. By my (limited) understanding of Encoding.default_external, this should have the same effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would. On my machine, without the patch, I see the same errors with LANG=C as the others here. With the patch, I do not. Works for me as well. Since I don't speak any ruby I'm a bit hesitant to upload; maybe some ruby speaker knowing Encoding.default_external can confirm that's the correct way forwards? (And: Are we sure all doc-base files are us-ascii or utf-8 encoded? At least on my machine they are, so maybe that's a non-concern.) Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - http://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Treibhaus: Yellowman Jamaica -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQJ8BAEBCgBmBQJUhGzpXxSAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXREMUUxMzE2RTkzQTc2MEE4MTA0RDg1RkFC QjNBNjgwMTg2NDlBQTA2AAoJELs6aAGGSaoG/LMP/2o9yR4MuLwI+uxzEq0sgiPW wz5K4/+98llYpEnHrcEzWIp5sdJF3NkMqEr8eqtycOUUdLismSp3MeH7DByxQX9H to/qFXpwM+qTf6dLiNrQykQzkBI+kTg7SszslTIdNbrOqSDR9UGOSZs2IX3OoKac N/651M1MfPz6EuyVehUEeLchUJWaiqz+XpLblV10FjnH8UxUzeMg6Dck7bYpGAuT +PLfNrurXx1ldoCkoqaCwCzBbKb0ZBu8A0AzdfgWUeudXwmgIF+u0Fs0rQMqUifS +QfcS0lMFAxBTBIimDogoyteLhxgE9OaNGqizZv2/xQPPvXOTrzF7BlKSr5SLWw0 A73YqAhrzU0Rxawl6i7+eKyEYUt59Cc7mJWAKCJ8o10QipDid90GPAJ78Rmjxo8W aWb/zGu/DJ70e+D1WEZ+VEwDQs6LgpibY10cjkLOH813b62DahDh9vuHIgvIc7Xa 3naQRh626lAmpxdCqqDobxMa3o8M2tcbqrIFrQRq69VarW2eDXJVT/MoCUy+vjCS Qu5t5vCX+qONuxYnGUAiHsnk7eSGh52EOUtaXjYFvqUA6YWFkSfy0+apaFD1nlj9 H93c1xAFfDFbE4Aue9oxIenIVXMEH/KtPqYikt0ApHH/IcYiMDc3nGNhUUL4Nvyc WuWu7s3lZpbMnI0Cgzly =pVVw -END PGP SIGNATURE-
Bug#768127: Fails to build the index when invalid UTF-8 is met
Attached is a diff with a change to dhelp_parse.rb which sets Encoding.default_external explicitly, so that even if LANG=C, it uses UTF-8 instead of US-ASCII as the default for opening files. By my (limited) understanding of Encoding.default_external, this should have the same effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would. On my machine, without the patch, I see the same errors with LANG=C as the others here. With the patch, I do not. Hope to help, - Dan Getz diff -Nru dhelp-0.6.21+nmu5/debian/changelog dhelp-0.6.21+nmu6/debian/changelog --- dhelp-0.6.21+nmu5/debian/changelog 2014-10-15 06:35:28.0 -0100 +++ dhelp-0.6.21+nmu6/debian/changelog 2014-12-06 01:05:28.0 -0100 @@ -1,3 +1,10 @@ +dhelp (0.6.21+nmu6) UNRELEASED; urgency=medium + + * Non-maintainer upload. + * Load files as UTF-8, regardless of $LANG + + -- Dan Getz tank...@gmail.com Sat, 06 Dec 2014 00:41:01 -0100 + dhelp (0.6.21+nmu5) unstable; urgency=medium * Non-maintainer upload. diff -Nru dhelp-0.6.21+nmu5/src/dhelp_parse.rb dhelp-0.6.21+nmu6/src/dhelp_parse.rb --- dhelp-0.6.21+nmu5/src/dhelp_parse.rb2014-10-15 06:12:27.0 -0100 +++ dhelp-0.6.21+nmu6/src/dhelp_parse.rb2014-12-06 01:05:04.0 -0100 @@ -24,6 +24,11 @@ PREFIX = '/usr' DEFAULT_INDEX_ROOT = #{PREFIX}/share/doc/HTML +# Set default file format as UTF-8, without printing a warning +old_verbose, $VERBOSE = $VERBOSE, false +Encoding.default_external = UTF-8 +$VERBOSE = old_verbose + require 'dhelp' require 'dhelp/exporter/html' include Dhelp
Bug#768127: Fails to build the index when invalid UTF-8 is met
Package: dhelp Version: 0.6.21+nmu5 Followup-For: Bug #768127 I don't know if it helps, but I got a similar error from the weekly cron task. LANG=C sudo /etc/cron.weekly/dhelp ArgumentError: invalid byte sequence in US-ASCII (/usr/lib/ruby/vendor_ruby/dhelp.rb:185:in `===' /usr/lib/ruby/vendor_ruby/dhelp.rb:185:in `block in initialize' /usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `each' /usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `initialize' /usr/lib/ruby/vendor_ruby/dhelp.rb:309:in `new' /usr/lib/ruby/vendor_ruby/dhelp.rb:309:in `block (2 levels) in each' /usr/lib/ruby/vendor_ruby/dhelp.rb:306:in `each' /usr/lib/ruby/vendor_ruby/dhelp.rb:306:in `block in each' /usr/lib/ruby/vendor_ruby/dhelp.rb:305:in `each' /usr/lib/ruby/vendor_ruby/dhelp.rb:305:in `each' /usr/lib/ruby/vendor_ruby/dhelp.rb:456:in `_register_docs' /usr/lib/ruby/vendor_ruby/dhelp.rb:387:in `rebuild' /usr/sbin/dhelp_parse:204:in `main' /usr/sbin/dhelp_parse:216:in `main') It seems to work fine with my default locale: LANG=es_CO.utf8 sudo /etc/cron.weekly/dhelp I added a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184 and it shows this: ... /var/lib/doc-base/documents/bogofilter-bogotune-faq /var/lib/doc-base/documents/developers-reference /var/lib/doc-base/documents/developers-reference /var/lib/doc-base/documents/developers-reference /var/lib/doc-base/documents/developers-reference ArgumentError: invalid byte sequence in US-ASCII (/usr/lib/ruby/vendor_ruby/dhelp.rb:186:in `===' /usr/lib/ruby/vendor_ruby/dhelp.rb:186:in `block in initialize' /usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `each' ... The problem rises when it parses the fourth line in developers-reference: Author: Adam Di Carlo, Josip Rodin, Raphaël Hertzog, et al It doesn't complain when I use fr_FR.UTF-8 ! (which is not consisent :P) Cheers, Santiago -- System Information: Debian Release: 8.0 APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores) Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages dhelp depends on: ii doc-base 0.10.6 ii libdata-page-perl 2.02-1 ii libhtml-parser-perl 3.71-1+b3 ii liblocale-gettext-perl1.05-8+b1 ii libtemplate-perl 2.24-1.2+b1 ii liburi-perl 1.64-1 ii perl-modules 5.20.1-3 ii poppler-utils 0.26.5-2 ii pstotext 1.9-6+b1 ii ruby 1:2.1.0.4 ii ruby-bdb 0.6.6-1+b2 ii ruby-debian 0.3.9 ii ruby-gettext 3.1.2-1 ii ruby1.8 [ruby-interpreter]1.8.7.358-7.1+deb7u1 ii ruby1.9.1 [ruby-interpreter] 1.9.3.194-8.1+deb7u2 ii ruby2.1 [ruby-interpreter]2.1.5-1 ii swish++ 6.1.5-2.2 ii ucf 3.0030 Versions of packages dhelp recommends: ii chromium [www-browser] 38.0.2125.101-3 ii html2text1.3.2a-18 ii iceweasel [www-browser] 31.2.0esr-3 ii opera [www-browser] 12.16.1860 ii w3m [www-browser]0.5.3-19 Versions of packages dhelp suggests: ii apache2 [httpd-cgi] 2.4.10-8 ii apache2-mpm-prefork [httpd-cgi] 2.4.10-8 pn catdvi none pn info2www none pn man2html none -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org