Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-30 Thread Axel Beckert
Control: tag -1 - unreproducible + confirmed upstream
Control: retitle -1 links: No option to specify charset on command-line
Control: severity -1 wishlist

Hi Julian,

Julian Gilbey wrote:
  Your file, at least how it arrived by mail here, contains an
  ISO-Latin-1 character, which shows as circled question mark on an
  UTF-8 using terminal if you just do a cat a.html. (Can you confirm
  that for your terminals?)
 
 Ah, so that is presumably why you dion't see the same as me: it was
 garbled in transit.  I'm attaching a gzipped version; hopefully this
 will reach you intact: it should be UTF-8 encoded.

Much better. I can now reproduce this issue.

 And maybe this is what links is then doing: it is trying to
 interpret both bytes of the UTF-8 file separately. (In the context
 in which I was originally using it, the file was a MIME attachment,
 and the MIME headers specified the UTF-8 encoding.)

Hrm. Indeed. But the issue is gone again if I add the following lines
after html:

head
meta http-equiv=Content-Type content=text/html;charset=utf-8 
/head

 So if links can handle UTF-8 encoded files, it would be very useful to
 also have a command-line flag to specify the encoding.

That's the actual issue. There seems no chance to pass the charset on
the commandline. I'll forward this to upstream.

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-30 Thread Axel Beckert
Control: tag -1 - upstream confirmed

Hi again,

Axel Beckert wrote:
 Control: tag -1 - unreproducible + confirmed upstream
 Control: retitle -1 links: No option to specify charset on command-line
 Control: severity -1 wishlist
[...]
  So if links can handle UTF-8 encoded files, it would be very useful to
  also have a command-line flag to specify the encoding.
 
 That's the actual issue. There seems no chance to pass the charset on
 the commandline. I'll forward this to upstream.

I was to quick with replying: There _is_ a commandline switch for that:

  links -dump -html-assume-codepage utf-8 /tmp/a.html

works for me. I'd close the bug report if that works for you, too.

Point was that I initially just looked for charset and encoding,
but you need to look for codepage. I found it, because I started to
look for iso and utf, too.

I can imagine you had the same issue. :-)

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-30 Thread Julian Gilbey
On Wed, Apr 30, 2014 at 09:30:28AM +0200, Axel Beckert wrote:
 Control: tag -1 - upstream confirmed
 
 Hi again,
 
 Axel Beckert wrote:
  Control: tag -1 - unreproducible + confirmed upstream
  Control: retitle -1 links: No option to specify charset on command-line
  Control: severity -1 wishlist
 [...]
   So if links can handle UTF-8 encoded files, it would be very useful to
   also have a command-line flag to specify the encoding.
  
  That's the actual issue. There seems no chance to pass the charset on
  the commandline. I'll forward this to upstream.
 
 I was to quick with replying: There _is_ a commandline switch for that:
 
   links -dump -html-assume-codepage utf-8 /tmp/a.html
 
 works for me. I'd close the bug report if that works for you, too.
 
 Point was that I initially just looked for charset and encoding,
 but you need to look for codepage. I found it, because I started to
 look for iso and utf, too.
 
 I can imagine you had the same issue. :-)

Awesome!  That does it, thanks!

Please feel free to close this bug (or to suggest to upstream that
they include the words charset and encoding in the manpage ;-)

   Julian


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-30 Thread Axel Beckert
Control: retitle -1 links: man-page does not mention encoding or charset 
near codepage commandline options
Control: severity -1 minor
Control: tag -1 + upstream confirmed

Hi Julian,

Julian Gilbey wrote:
links -dump -html-assume-codepage utf-8 /tmp/a.html
[...]
 Awesome!  That does it, thanks!

 Please feel free to close this bug (or to suggest to upstream that
 they include the words charset and encoding in the manpage ;-)

I'll do the latter. Retitling, severity and tag change again. :-)

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-29 Thread Julian Gilbey
Package: links
Version: 2.8-1+b1

The attached document, when dumped with links -dump /tmp/a.html
renders as:

   Hello!
   This is a non-breaking space character:[A ]

The A is extraneous.

   Julian


Hello!
This is a non-breaking space character:[ ]




Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-29 Thread Axel Beckert
Control: tag -1 + unreproducible moreinfo

Hi,

Julian Gilbey wrote:
 The attached document, when dumped with links -dump /tmp/a.html
 renders as:
 
Hello!
This is a non-breaking space character:[A ]

I'm sorry, I can't reproduce this:

$ links -dump /tmp/a.html
   Hello!
   This is a non-breaking space character:[ ]

Looks perfect to me.

And this despite the HTML code uses 8-bit characters without declaring
a character set! (0xa0 is a non-breaking space in ISO-Latin-1 IIRC.)

Please provide more information about e.g. the terminal type you are
using and its character set.

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-29 Thread Julian Gilbey
On Tue, Apr 29, 2014 at 11:00:18PM +0200, Axel Beckert wrote:
 Control: tag -1 + unreproducible moreinfo
 
 Hi,
 
 Julian Gilbey wrote:
  The attached document, when dumped with links -dump /tmp/a.html
  renders as:
  
 Hello!
 This is a non-breaking space character:[A ]
 
 I'm sorry, I can't reproduce this:
 
 $ links -dump /tmp/a.html
Hello!
This is a non-breaking space character:[ ]
 
 Looks perfect to me.
 
 And this despite the HTML code uses 8-bit characters without declaring
 a character set! (0xa0 is a non-breaking space in ISO-Latin-1 IIRC.)
 
 Please provide more information about e.g. the terminal type you are
 using and its character set.

That's bizarre.

I've tried this in an xterm (xfce4-terminal) and in a console window
(tty1), both with my default locale (en_GB.UTF-8) and in the C locale,
and the same happens with all of these combinations.  I'm not sure
how I would determine the character set I'm using, though.

   Julian


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-29 Thread Axel Beckert
Control: tag -1 - moreinfo

Hi Julian,

thanks for the prompt feedback!

Julian Gilbey wrote:
 I've tried this in an xterm (xfce4-terminal) and in a console window
 (tty1), both with my default locale (en_GB.UTF-8) and in the C locale,
 and the same happens with all of these combinations.  I'm not sure
 how I would determine the character set I'm using, though.

Sometimes the terminal emulator lets you set this. I used an uxterm
and now also tried xfce4-terminal from Wheezy (and then ssh'ed into
the Sid machine for testing), which both use UTF-8 as character set by
default.

Your file, at least how it arrived by mail here, contains an
ISO-Latin-1 character, which shows as circled question mark on an
UTF-8 using terminal if you just do a cat a.html. (Can you confirm
that for your terminals?)

If the A was actually an A with a tilda (e.g. Ã), I could imagine
that it happened on a non-UTF-8-terminal (e.g. an xterm started with
env LANG=C xterm) and links was converting the character to UTF-8
for some reason, but I wasn't able to reproduce it in such a setup
despite my UTF-8 containing prompt then contained a lower-case a with
a tilda (ã) then.

So even this did not reproduce it for me:

env LANG=C xterm → ssh otherhost → env LANG=C links -dump /tmp/a.html

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A

2014-04-29 Thread Julian Gilbey
On Wed, Apr 30, 2014 at 12:11:22AM +0200, Axel Beckert wrote:
 Control: tag -1 - moreinfo
 
 Hi Julian,
 
 thanks for the prompt feedback!

And yours! :-)

 Julian Gilbey wrote:
  I've tried this in an xterm (xfce4-terminal) and in a console window
  (tty1), both with my default locale (en_GB.UTF-8) and in the C locale,
  and the same happens with all of these combinations.  I'm not sure
  how I would determine the character set I'm using, though.
 
 Sometimes the terminal emulator lets you set this. I used an uxterm
 and now also tried xfce4-terminal from Wheezy (and then ssh'ed into
 the Sid machine for testing), which both use UTF-8 as character set by
 default.
 
 Your file, at least how it arrived by mail here, contains an
 ISO-Latin-1 character, which shows as circled question mark on an
 UTF-8 using terminal if you just do a cat a.html. (Can you confirm
 that for your terminals?)

Ah, so that is presumably why you dion't see the same as me: it was
garbled in transit.  I'm attaching a gzipped version; hopefully this
will reach you intact: it should be UTF-8 encoded.  And maybe this is
what links is then doing: it is trying to interpret both bytes of the
UTF-8 file separately.  (In the context in which I was originally
using it, the file was a MIME attachment, and the MIME headers
specified the UTF-8 encoding.)

So if links can handle UTF-8 encoded files, it would be very useful to
also have a command-line flag to specify the encoding.

   Julian


a.html.gz
Description: Binary data