Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-19 Thread Antonio Terceiro
Control: severity 773485 normal

On Thu, Dec 18, 2014 at 11:19:31PM +0100, gregor herrmann wrote:
 Probably the severity in the cloned bug in ruby-debian shoule be
 lowered; the problem might not be present if a non-UTF-9 file is not
 opened as UTF-8 ...

yes

-- 
Antonio Terceiro terce...@debian.org


signature.asc
Description: Digital signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-19 Thread gregor herrmann
On Fri, 19 Dec 2014 07:57:15 +0200, Yavor Doganov wrote:

  Right, cloning+reassigning to ruby-debian might make sense.
  Let's do this :)
  (And close the original bug since it does fix another problem.)
 Thanks.  I used dpkg --update-avail to get rid of that ancient
 package and now dhelp works as expected.  (Thought dpkg was doing this
 automatically these days...)

Great!
 
  Probably the severity in the cloned bug in ruby-debian shoule be
  lowered; the problem might not be present if a non-UTF-9 file is not
  opened as UTF-8 ...
 You're probably right; it's up to the ruby-debian maintainers.

Already done by Antonio in the meantime, thanks.
 
 Thanks to both of you for your efforts.

Thanks for your patience :)


Cheers,
gregor 

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Yavor Doganov
reopen 768127
notfixed 768127 0.6.21+nmu6
thanks

Thanks for your work, but unfortunately I experience exactly the same
problem with the new version.

$ isutf8 /var/lib/doc-base/documents/*
$ echo $?
0


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread gregor herrmann
On Thu, 18 Dec 2014 12:01:02 +0200, Yavor Doganov wrote:

 Thanks for your work, but unfortunately I experience exactly the same
 problem with the new version.
 
 $ isutf8 /var/lib/doc-base/documents/*
 $ echo $?
 0

Ouch. I'm sorry to hear this.

Could you please provide a bit more information? I guess it would be
helpful if you could try to
- add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184
  (see Santiago's message #24);
- copy the output of `locale' and `/etc/cron.weekly/dhelp';
- maybe try `LC_ALL=x LANG=x /etc/cron.weekly/dhelp'
  for different versions of x.


(Hm, maybe we should set LANG/LC_ALL in the cron script. But before
that I'd like to see where it fails for Yavor.)


Thanks in advance,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Various Artists: Green Fields of France


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Yavor Doganov
gregor herrmann wrote:
 On Thu, 18 Dec 2014 12:01:02 +0200, Yavor Doganov wrote:
  Thanks for your work, but unfortunately I experience exactly the
  same problem with the new version.
 
 I guess it would be helpful if you could try to
 - add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184
   (see Santiago's message #24);
 - copy the output of `locale' and `/etc/cron.weekly/dhelp';

$ locale
LANG=bg_BG.UTF-8
LANGUAGE=bg:en_GB
LC_CTYPE=bg_BG.UTF-8
LC_NUMERIC=bg_BG.UTF-8
LC_TIME=bg_BG.UTF-8
LC_COLLATE=bg_BG.UTF-8
LC_MONETARY=bg_BG.UTF-8
LC_MESSAGES=bg_BG.UTF-8
LC_PAPER=bg_BG.UTF-8
LC_NAME=bg_BG.UTF-8
LC_ADDRESS=bg_BG.UTF-8
LC_TELEPHONE=bg_BG.UTF-8
LC_MEASUREMENT=bg_BG.UTF-8
LC_IDENTIFICATION=bg_BG.UTF-8
LC_ALL=

[ Sorry for the long output. ]

/var/lib/doc-base/documents/dc
/var/lib/doc-base/documents/abi-compliance-checker
/var/lib/doc-base/documents/aptitude-doc-en
/var/lib/doc-base/documents/autoconf
/var/lib/doc-base/documents/automake-1.14
/var/lib/doc-base/documents/bash
/var/lib/doc-base/documents/bashref
/var/lib/doc-base/documents/bc
/var/lib/doc-base/documents/bzip2
/var/lib/doc-base/documents/bzr
/var/lib/doc-base/documents/bzr-builddeb
/var/lib/doc-base/documents/comerr-manual
/var/lib/doc-base/documents/copyright-format-1.0
/var/lib/doc-base/documents/cpp-4.9
/var/lib/doc-base/documents/cppinternals-4.9
/var/lib/doc-base/documents/cvs-doc
/var/lib/doc-base/documents/cvs-doc-client
/var/lib/doc-base/documents/cvs-doc-faq
/var/lib/doc-base/documents/cvs-doc-intro
/var/lib/doc-base/documents/cvs-doc-paper
/var/lib/doc-base/documents/cvs-doc-rcsfiles
/var/lib/doc-base/documents/dbuskit-api-docs
/var/lib/doc-base/documents/dbuskit-manual
/var/lib/doc-base/documents/debconf-spec
/var/lib/doc-base/documents/debian-constitution-text
/var/lib/doc-base/documents/debian-faq
/var/lib/doc-base/documents/debian-mailing-lists
/var/lib/doc-base/documents/debian-manifesto
/var/lib/doc-base/documents/debian-menu-policy
/var/lib/doc-base/documents/debian-perl-policy
/var/lib/doc-base/documents/debian-policy
/var/lib/doc-base/documents/debian-reporting-bugs
/var/lib/doc-base/documents/debian-social-contract
/var/lib/doc-base/documents/debian-tex-policy
/var/lib/doc-base/documents/developers-reference
/var/lib/doc-base/documents/doc-base
/var/lib/doc-base/documents/docbook-xsl-doc
/var/lib/doc-base/documents/everyday-git
/var/lib/doc-base/documents/exim4-filter-txt
/var/lib/doc-base/documents/exim4-readme-debian
/var/lib/doc-base/documents/exim4-spec-txt
/var/lib/doc-base/documents/expat
/var/lib/doc-base/documents/feynmf-manual
/var/lib/doc-base/documents/fhs
/var/lib/doc-base/documents/findutils
/var/lib/doc-base/documents/fontconfig-devel
/var/lib/doc-base/documents/fontconfig-user
/var/lib/doc-base/documents/gawk-doc-gawk
/var/lib/doc-base/documents/gawk-doc-gawkinet
/var/lib/doc-base/documents/gcc-4.9
/var/lib/doc-base/documents/gccint-4.9
/var/lib/doc-base/documents/gccintro
/var/lib/doc-base/documents/gdl2api
/var/lib/doc-base/documents/gdl2intro
/var/lib/doc-base/documents/git-api
/var/lib/doc-base/documents/git-bisect-lk2009
/var/lib/doc-base/documents/git-buildpackage
/var/lib/doc-base/documents/git-howtos
/var/lib/doc-base/documents/git-index-format
/var/lib/doc-base/documents/git-pack-format
/var/lib/doc-base/documents/git-protocol
/var/lib/doc-base/documents/git-reference-manual
/var/lib/doc-base/documents/git-shallow-clone-design
/var/lib/doc-base/documents/git-tools
/var/lib/doc-base/documents/git-trivial-merge-rules
/var/lib/doc-base/documents/git-user-manual
/var/lib/doc-base/documents/glibc-manual
/var/lib/doc-base/documents/gnu-coding-standards
/var/lib/doc-base/documents/gnu-maintainers-information
/var/lib/doc-base/documents/gnustep-base-additions
/var/lib/doc-base/documents/gnustep-base-programming-manual
/var/lib/doc-base/documents/gnustep-base-reference
/var/lib/doc-base/documents/gnustep-base-tools
/var/lib/doc-base/documents/gnustep-coding-standards
/var/lib/doc-base/documents/gnustep-gui-additions
/var/lib/doc-base/documents/gnustep-gui-programming-manual
/var/lib/doc-base/documents/gnustep-gui-reference
/var/lib/doc-base/documents/gnustep-make-manual
/var/lib/doc-base/documents/gnustep-netclasses-docs
/var/lib/doc-base/documents/gnustep-performance
/var/lib/doc-base/documents/gnustep-sqlclient
/var/lib/doc-base/documents/gorm.app
/var/lib/doc-base/documents/initramfs-maintainer
/var/lib/doc-base/documents/install-docs-man
/var/lib/doc-base/documents/jade
/var/lib/doc-base/documents/kbd-font-formats
/var/lib/doc-base/documents/libao
/var/lib/doc-base/documents/libexif-api
/var/lib/doc-base/documents/libffi
/var/lib/doc-base/documents/libfreetype6-dev
/var/lib/doc-base/documents/libidn11
/var/lib/doc-base/documents/libio-stringy-perl
/var/lib/doc-base/documents/libpng12
/var/lib/doc-base/documents/libsdl1.2-dev
/var/lib/doc-base/documents/libsndfile
/var/lib/doc-base/documents/libtasn1
/var/lib/doc-base/documents/libvorbis
/var/lib/doc-base/documents/libxml-parser-perl

Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread gregor herrmann
On Thu, 18 Dec 2014 18:24:53 +0200, Yavor Doganov wrote:

  I guess it would be helpful if you could try to
  - add a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184
(see Santiago's message #24);
  - copy the output of `locale' and `/etc/cron.weekly/dhelp';

Thanks!
 
 $ locale
 LANG=bg_BG.UTF-8
 LANGUAGE=bg:en_GB
 LC_CTYPE=bg_BG.UTF-8
 LC_NUMERIC=bg_BG.UTF-8
 LC_TIME=bg_BG.UTF-8
 LC_COLLATE=bg_BG.UTF-8
 LC_MONETARY=bg_BG.UTF-8
 LC_MESSAGES=bg_BG.UTF-8
 LC_PAPER=bg_BG.UTF-8
 LC_NAME=bg_BG.UTF-8
 LC_ADDRESS=bg_BG.UTF-8
 LC_TELEPHONE=bg_BG.UTF-8
 LC_MEASUREMENT=bg_BG.UTF-8
 LC_IDENTIFICATION=bg_BG.UTF-8
 LC_ALL=

Ok.
 
 [ Sorry for the long output. ]

No worries, that was expected :)
 
[..]
 /var/lib/doc-base/documents/xinetd-faq
 /var/lib/doc-base/documents/xterm-ctlseqs
 /var/lib/doc-base/documents/xterm-faq
 ArgumentError: invalid byte sequence in UTF-8 
 (/usr/lib/ruby/vendor_ruby/debian.rb:914:in `block in initialize'

So according to the previous findings, my guess is that
/var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8.

Which is interesing for two reasons:
- First, I have the file installed and don't have any problems with
  dhelp (although not with a bulgarian locale but UTF-8 should be
  UTF-8?!)
- Second, at least on my system, it's plain ASCII:

% file -i /var/lib/doc-base/documents/xterm-faq
/var/lib/doc-base/documents/xterm-faq: text/plain; charset=us-ascii

And the file itself looks completely innocent:

#v+
Document: xterm-faq
Section: Terminal Emulators
Title: XTerm Frequently Asked Questions (FAQ)
Author: Thomas Dickey
Abstract: This document provides answers to frequently asked questions
 about the XTerm terminal emulator as it ships with the X.Org distribution
 of the X Window System.

Format: HTML
Index: /usr/share/doc/xterm/xterm.faq.html
Files: /usr/share/doc/xterm/xterm.faq.html

Format: text
Files: /usr/share/doc/xterm/xterm.faq.gz
#v-

This is getting slightly mysterious.

Could you please try `file -i /var/lib/doc-base/documents/xterm-faq'
as well and open the file in a pager/editor to see if it's the same
on your machine?

If I'm understanding this correctly, the origin is
/usr/share/doc-base/xterm-faq which has, according to
/var/lib/dpkg/info/xterm.md5sums, a md5sum of
4f81e4dd965c918abc250beeb54131fb. Confirmed locally:

% md5sum /usr/share/doc-base/xterm-faq
4f81e4dd965c918abc250beeb54131fb  /usr/share/doc-base/xterm-faq

Maybe you could check this out as well to rule out a corrupted file?

  - maybe try `LC_ALL=x LANG=x /etc/cron.weekly/dhelp'
for different versions of x.
 There is no difference.

Ok, thanks again!


Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Nick Drake: Hanging on a Star


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Yavor Doganov
gregor herrmann wrote:
  /var/lib/doc-base/documents/xterm-ctlseqs
  /var/lib/doc-base/documents/xterm-faq
  ArgumentError: invalid byte sequence in UTF-8 
  (/usr/lib/ruby/vendor_ruby/debian.rb:914:in `block in initialize'
 
 So according to the previous findings, my guess is that
 /var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8.

I don't think this has anything to do with xterm-faq, it's just that
this is the last file in alphabetical order.  If I move the file away
I get the same failure -- it chokes on xterm-ctlseqs then.

 Could you please try `file -i /var/lib/doc-base/documents/xterm-faq'
 as well and open the file in a pager/editor to see if it's the same
 on your machine?

Yes, it's the same:

$ file -i /var/lib/doc-base/documents/xterm-faq 
/var/lib/doc-base/documents/xterm-faq: text/plain; charset=us-ascii
$ md5sum /usr/share/doc-base/xterm-faq 
4f81e4dd965c918abc250beeb54131fb  /usr/share/doc-base/xterm-faq

Too bad I don't know Ruby and I'm completely clueless.

It seems that there are two different problems -- Santiago's failure
that he posted on the bug log is at dhelp.rb:185 while mine is at
debian.rb:914 (which is why I suggested it might be a ruby-debian
issue).  If you and Daniel have reproduced and fixed Santiago's bug it
is not surprising that the NMU does not address the bug I am
observing.  (At least that is how I explain the mystery for the time
being, with my limited knowledge.)


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Daniel Getz
On Thu, Dec 18, 2014 at 4:30 PM, Yavor Doganov ya...@gnu.org wrote:

 It seems that there are two different problems -- Santiago's failure
 that he posted on the bug log is at dhelp.rb:185 while mine is at
 debian.rb:914 (which is why I suggested it might be a ruby-debian
 issue).  If you and Daniel have reproduced and fixed Santiago's bug it
 is not surprising that the NMU does not address the bug I am
 observing.  (At least that is how I explain the mystery for the time
 being, with my limited knowledge.)

Yes, that's my understanding, too.

Can you run with the attached patch to debian.rb, and see if it will show
which entry of which file triggers the error?
--- /usr/lib/ruby/vendor_ruby/debian.rb.orig	2014-12-18 19:01:03.233496178 -0100
+++ /usr/lib/ruby/vendor_ruby/debian.rb.debug	2014-12-18 19:00:26.229041877 -0100
@@ -911,7 +911,14 @@
   @provides = {}
   @file = [file]
   @lists = Archives.parseArchiveFile(file) {|info|
-info =~ /Package:\s(.*)$/;
+	begin
+  info =~ /Package:\s(.*)$/;
+rescue = e
+  puts Error parsing file #{file}
+  puts Contents of info:
+  puts info
+  raise e
+end
 	if pkgs.empty? || pkgs.include?($1)
 	  d = Deb.new(info,fields)
 	  add_provides(d)


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread gregor herrmann
On Thu, 18 Dec 2014 19:30:32 +0200, Yavor Doganov wrote:

  So according to the previous findings, my guess is that
  /var/lib/doc-base/documents/xterm-faq can't be interpreted as UTF-8.
 I don't think this has anything to do with xterm-faq, it's just that
 this is the last file in alphabetical order.  If I move the file away
 I get the same failure -- it chokes on xterm-ctlseqs then.


D'oh. Then my guesses were really off track.
 
 Too bad I don't know Ruby and I'm completely clueless.

Same here.
 
 It seems that there are two different problems -- Santiago's failure
 that he posted on the bug log is at dhelp.rb:185 while mine is at
 debian.rb:914 (which is why I suggested it might be a ruby-debian
 issue).

Right, sorry for missing this.
Let's hope that Daniel's debugging idea helps ...


Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Buffy St. Marie: The Universal Soldier


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Yavor Doganov
Daniel Getz wrote:
 Can you run with the attached patch to debian.rb, and see if it will
 show which entry of which file triggers the error?

Thanks; here's the output:

Error parsing file /var/lib/dpkg/available
Contents of info:
Package: ayuda
Priority: extra
Section: misc
Installed-Size: 204
Maintainer: Javier Vi�uales Guti�rrez v...@matrio.com
Architecture: all
Version: 0.1-4
Suggests: manpages-es, doc-linux-es, doc-debian-es
Filename: pool/main/a/ayuda/ayuda_0.1-4_all.deb
Size: 31710
MD5sum: 79c8ded94cce4b054ad883cb139500d7
Description: Help for spanish-speakers
 This package contains a help program called 'ayuda' useful
 for users that speak spanish, and are new to the world of
 Debian GNU/Linux.
 .
 The help provided covers many topics from administration to daily use.

(Ruby backtrace follows.)

The maintainer name is not valid UTF-8:

$ isutf8 /var/lib/dpkg/available
/var/lib/dpkg/available: line 17427, char 1, byte offset 22: invalid UTF-8 code


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread gregor herrmann
On Thu, 18 Dec 2014 23:25:06 +0200, Yavor Doganov wrote:

 Daniel Getz wrote:
  Can you run with the attached patch to debian.rb, and see if it will
  show which entry of which file triggers the error?
 Thanks; here's the output:

 Error parsing file /var/lib/dpkg/available

Wow, that's an interesting finding.

 Contents of info:
 Package: ayuda
 Priority: extra
 Section: misc

There is no package ayuda in Debian (anymore; it was removed in 2005,
according to https://packages.qa.debian.org/a/ayuda.html -- which
also shows the encoding problems :))


Ok, so what are we doing now?

While I would like dhelp to handle this situation a bit more
gracefully, I suggest to downgrade the severity of the bug since it
shouldn't affect anyone running packages contained in recent and
upcoming Debian releases, and keeping it out of jessie for these
cornercases seems a bit to strong for me.

(Of course a fix would be best :))

Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Ludwig Hirsch: Die Spur im Schnee


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Daniel Getz
On Thu, Dec 18, 2014 at 8:49 PM, gregor herrmann gre...@debian.org wrote:

 Ok, so what are we doing now?

 While I would like dhelp to handle this situation a bit more
 gracefully, I suggest to downgrade the severity of the bug since it
 shouldn't affect anyone running packages contained in recent and
 upcoming Debian releases, and keeping it out of jessie for these
 cornercases seems a bit to strong for me.

 (Of course a fix would be best :))

In terms of fixing dhelp, we could in theory catch the UTF-8 error, log a
warning somehow, and continue on to the next package description. One
missing package from the documentation index isn't the end of the world
(and in my experience, dhelp doesn't index all my HTML documentation
anyway.) However, the code in question is in ruby-debian, which is a
separate library used by other packages. Might not be correct behavior for
other users of the library?


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread gregor herrmann
Control: clone -1 -2
Control: reassign -2 ruby-debian
Control: affects -2 dhelp
Control: close -1 768127 0.6.21+nmu6

On Thu, 18 Dec 2014 21:02:58 -0100, Daniel Getz wrote:

  While I would like dhelp to handle this situation a bit more
  gracefully, I suggest to downgrade the severity of the bug since it
  shouldn't affect anyone running packages contained in recent and
  upcoming Debian releases, and keeping it out of jessie for these
  cornercases seems a bit to strong for me.
 In terms of fixing dhelp, we could in theory catch the UTF-8 error, log a
 warning somehow, and continue on to the next package description. One
 missing package from the documentation index isn't the end of the world
 (and in my experience, dhelp doesn't index all my HTML documentation
 anyway.) However, the code in question is in ruby-debian, which is a
 separate library used by other packages. Might not be correct behavior for
 other users of the library?

Right, cloning+reassigning to ruby-debian might make sense.
Let's do this :)
(And close the original bug since it does fix another problem.)

Probably the severity in the cloned bug in ruby-debian shoule be
lowered; the problem might not be present if a non-UTF-9 file is not
opened as UTF-8 ...

Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Rolling Stones: Lucky


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-18 Thread Yavor Doganov
gregor herrmann wrote:
 On Thu, 18 Dec 2014 21:02:58 -0100, Daniel Getz wrote:
  However, the code in question is in ruby-debian, which is a
  separate library used by other packages. Might not be correct
  behavior for other users of the library?
 
 Right, cloning+reassigning to ruby-debian might make sense.
 Let's do this :)
 (And close the original bug since it does fix another problem.)

Thanks.  I used dpkg --update-avail to get rid of that ancient
package and now dhelp works as expected.  (Thought dpkg was doing this
automatically these days...)

 Probably the severity in the cloned bug in ruby-debian shoule be
 lowered; the problem might not be present if a non-UTF-9 file is not
 opened as UTF-8 ...

You're probably right; it's up to the ruby-debian maintainers.

Thanks to both of you for your efforts.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-07 Thread gregor herrmann
Control: tag -1 - moreinfo
Control: tag -1 + confirmed

On Sat, 06 Dec 2014 01:33:58 -0100, Daniel Getz wrote:

I can reproduce the problem with
LC_ALL=C LANG=C /etc/cron.weekly/dhelp

 Attached is a diff with a change to dhelp_parse.rb which sets
 Encoding.default_external explicitly, so that even if LANG=C, it uses UTF-8
 instead of US-ASCII as the default for opening files. By my (limited)
 understanding of Encoding.default_external, this should have the same
 effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would.
 
 On my machine, without the patch, I see the same errors with LANG=C as the
 others here. With the patch, I do not.

Works for me as well.
 

Since I don't speak any ruby I'm a bit hesitant to upload; maybe some
ruby speaker knowing Encoding.default_external can confirm that's the
correct way forwards?

(And: Are we sure all doc-base files are us-ascii or utf-8 encoded?
At least on my machine they are, so maybe that's a non-concern.)


Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Treibhaus: Yellowman Jamaica


signature.asc
Description: Digital Signature


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-07 Thread Daniel Getz
UTF-8 should be the right format for doc-base files, according to
https://lintian.debian.org/tags/doc-base-file-uses-obsolete-national-encoding.html

I also don't know ruby, but from my research setting Encoding.default_external
is considered the wrong thing to do, the right way being to pass -E
UTF-8 as an option to ruby via the command line, or the environment
variable RUBYOPT. I had to explicitly silence a warning because of this.
See
http://docs.ruby-lang.org/en/2.1.0/Encoding.html#method-c-default_external-3D

However, neither of those right ways to set the encoding work well with
using a ruby file directly as a script. (Is ruby not intended to be used in
scripts?!) In the ruby docs, it says the problem is if code gets run before
the change to the encoding. That's avoidable, and I believe I avoided it in
my patch by placing the encoding change before any require imports.

An alternative is to explicitly set the encoding to UTF-8 each time a file
is opened. If someone feels that's a better way, I'm willing to do that and
create a new patch. But like I said, I don't know ruby, so I can't
guarantee correctness beyond trying it and seeing that it works.

- Dan

On Sun, Dec 7, 2014 at 2:06 PM, gregor herrmann gre...@debian.org wrote:

 Control: tag -1 - moreinfo
 Control: tag -1 + confirmed

 On Sat, 06 Dec 2014 01:33:58 -0100, Daniel Getz wrote:

 I can reproduce the problem with
 LC_ALL=C LANG=C /etc/cron.weekly/dhelp

  Attached is a diff with a change to dhelp_parse.rb which sets
  Encoding.default_external explicitly, so that even if LANG=C, it uses
 UTF-8
  instead of US-ASCII as the default for opening files. By my (limited)
  understanding of Encoding.default_external, this should have the same
  effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would.
 
  On my machine, without the patch, I see the same errors with LANG=C as
 the
  others here. With the patch, I do not.

 Works for me as well.


 Since I don't speak any ruby I'm a bit hesitant to upload; maybe some
 ruby speaker knowing Encoding.default_external can confirm that's the
 correct way forwards?

 (And: Are we sure all doc-base files are us-ascii or utf-8 encoded?
 At least on my machine they are, so maybe that's a non-concern.)


 Cheers,
 gregor

 --
  .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key
 0xBB3A68018649AA06
  : :' : Debian GNU/Linux user, admin, and developer  -
 http://www.debian.org/
  `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation
 Europe
`-   NP: Treibhaus: Yellowman Jamaica

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

 iQJ8BAEBCgBmBQJUhGzpXxSAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
 ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXREMUUxMzE2RTkzQTc2MEE4MTA0RDg1RkFC
 QjNBNjgwMTg2NDlBQTA2AAoJELs6aAGGSaoG/LMP/2o9yR4MuLwI+uxzEq0sgiPW
 wz5K4/+98llYpEnHrcEzWIp5sdJF3NkMqEr8eqtycOUUdLismSp3MeH7DByxQX9H
 to/qFXpwM+qTf6dLiNrQykQzkBI+kTg7SszslTIdNbrOqSDR9UGOSZs2IX3OoKac
 N/651M1MfPz6EuyVehUEeLchUJWaiqz+XpLblV10FjnH8UxUzeMg6Dck7bYpGAuT
 +PLfNrurXx1ldoCkoqaCwCzBbKb0ZBu8A0AzdfgWUeudXwmgIF+u0Fs0rQMqUifS
 +QfcS0lMFAxBTBIimDogoyteLhxgE9OaNGqizZv2/xQPPvXOTrzF7BlKSr5SLWw0
 A73YqAhrzU0Rxawl6i7+eKyEYUt59Cc7mJWAKCJ8o10QipDid90GPAJ78Rmjxo8W
 aWb/zGu/DJ70e+D1WEZ+VEwDQs6LgpibY10cjkLOH813b62DahDh9vuHIgvIc7Xa
 3naQRh626lAmpxdCqqDobxMa3o8M2tcbqrIFrQRq69VarW2eDXJVT/MoCUy+vjCS
 Qu5t5vCX+qONuxYnGUAiHsnk7eSGh52EOUtaXjYFvqUA6YWFkSfy0+apaFD1nlj9
 H93c1xAFfDFbE4Aue9oxIenIVXMEH/KtPqYikt0ApHH/IcYiMDc3nGNhUUL4Nvyc
 WuWu7s3lZpbMnI0Cgzly
 =pVVw
 -END PGP SIGNATURE-




Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-05 Thread Daniel Getz
Attached is a diff with a change to dhelp_parse.rb which sets
Encoding.default_external explicitly, so that even if LANG=C, it uses UTF-8
instead of US-ASCII as the default for opening files. By my (limited)
understanding of Encoding.default_external, this should have the same
effect on opening files as replacing LANG=C with LANG=xx_XX.UTF-8 would.

On my machine, without the patch, I see the same errors with LANG=C as the
others here. With the patch, I do not.

Hope to help,

- Dan Getz
diff -Nru dhelp-0.6.21+nmu5/debian/changelog dhelp-0.6.21+nmu6/debian/changelog
--- dhelp-0.6.21+nmu5/debian/changelog  2014-10-15 06:35:28.0 -0100
+++ dhelp-0.6.21+nmu6/debian/changelog  2014-12-06 01:05:28.0 -0100
@@ -1,3 +1,10 @@
+dhelp (0.6.21+nmu6) UNRELEASED; urgency=medium
+
+  * Non-maintainer upload.
+  * Load files as UTF-8, regardless of $LANG
+
+ -- Dan Getz tank...@gmail.com  Sat, 06 Dec 2014 00:41:01 -0100
+
 dhelp (0.6.21+nmu5) unstable; urgency=medium
 
   * Non-maintainer upload.
diff -Nru dhelp-0.6.21+nmu5/src/dhelp_parse.rb 
dhelp-0.6.21+nmu6/src/dhelp_parse.rb
--- dhelp-0.6.21+nmu5/src/dhelp_parse.rb2014-10-15 06:12:27.0 
-0100
+++ dhelp-0.6.21+nmu6/src/dhelp_parse.rb2014-12-06 01:05:04.0 
-0100
@@ -24,6 +24,11 @@
 PREFIX = '/usr'
 DEFAULT_INDEX_ROOT = #{PREFIX}/share/doc/HTML
 
+# Set default file format as UTF-8, without printing a warning
+old_verbose, $VERBOSE = $VERBOSE, false
+Encoding.default_external = UTF-8
+$VERBOSE = old_verbose
+
 require 'dhelp'
 require 'dhelp/exporter/html'
 include Dhelp


Bug#768127: Fails to build the index when invalid UTF-8 is met

2014-12-04 Thread Santiago
Package: dhelp
Version: 0.6.21+nmu5
Followup-For: Bug #768127

I don't know if it helps, but I got a similar error from the weekly cron
task.

LANG=C sudo /etc/cron.weekly/dhelp 
ArgumentError: invalid byte sequence in US-ASCII
(/usr/lib/ruby/vendor_ruby/dhelp.rb:185:in `==='
/usr/lib/ruby/vendor_ruby/dhelp.rb:185:in `block in initialize'
/usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `initialize'
/usr/lib/ruby/vendor_ruby/dhelp.rb:309:in `new'
/usr/lib/ruby/vendor_ruby/dhelp.rb:309:in `block (2 levels) in each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:306:in `each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:306:in `block in each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:305:in `each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:305:in `each'
/usr/lib/ruby/vendor_ruby/dhelp.rb:456:in `_register_docs'
/usr/lib/ruby/vendor_ruby/dhelp.rb:387:in `rebuild'
/usr/sbin/dhelp_parse:204:in `main'
/usr/sbin/dhelp_parse:216:in `main')

It seems to work fine with my default locale:

LANG=es_CO.utf8 sudo /etc/cron.weekly/dhelp

I added a puts @path in /usr/lib/ruby/vendor_ruby/dhelp.rb:184 and it
shows this:

...
/var/lib/doc-base/documents/bogofilter-bogotune-faq
/var/lib/doc-base/documents/developers-reference
/var/lib/doc-base/documents/developers-reference
/var/lib/doc-base/documents/developers-reference
/var/lib/doc-base/documents/developers-reference
ArgumentError: invalid byte sequence in US-ASCII
(/usr/lib/ruby/vendor_ruby/dhelp.rb:186:in `==='
/usr/lib/ruby/vendor_ruby/dhelp.rb:186:in `block in initialize'
/usr/lib/ruby/vendor_ruby/dhelp.rb:183:in `each'
...

The problem rises when it parses the fourth line in
developers-reference:

Author: Adam Di Carlo, Josip Rodin, Raphaël Hertzog, et al

It doesn't complain when I use fr_FR.UTF-8 ! (which is not consisent :P)

Cheers,

Santiago

-- System Information:
Debian Release: 8.0
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages dhelp depends on:
ii  doc-base  0.10.6
ii  libdata-page-perl 2.02-1
ii  libhtml-parser-perl   3.71-1+b3
ii  liblocale-gettext-perl1.05-8+b1
ii  libtemplate-perl  2.24-1.2+b1
ii  liburi-perl   1.64-1
ii  perl-modules  5.20.1-3
ii  poppler-utils 0.26.5-2
ii  pstotext  1.9-6+b1
ii  ruby  1:2.1.0.4
ii  ruby-bdb  0.6.6-1+b2
ii  ruby-debian   0.3.9
ii  ruby-gettext  3.1.2-1
ii  ruby1.8 [ruby-interpreter]1.8.7.358-7.1+deb7u1
ii  ruby1.9.1 [ruby-interpreter]  1.9.3.194-8.1+deb7u2
ii  ruby2.1 [ruby-interpreter]2.1.5-1
ii  swish++   6.1.5-2.2
ii  ucf   3.0030

Versions of packages dhelp recommends:
ii  chromium [www-browser]   38.0.2125.101-3
ii  html2text1.3.2a-18
ii  iceweasel [www-browser]  31.2.0esr-3
ii  opera [www-browser]  12.16.1860
ii  w3m [www-browser]0.5.3-19

Versions of packages dhelp suggests:
ii  apache2 [httpd-cgi]  2.4.10-8
ii  apache2-mpm-prefork [httpd-cgi]  2.4.10-8
pn  catdvi   none
pn  info2www none
pn  man2html none

-- no debconf information


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org