Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Control: tag -1 - unreproducible + confirmed upstream Control: retitle -1 links: No option to specify charset on command-line Control: severity -1 wishlist Hi Julian, Julian Gilbey wrote: Your file, at least how it arrived by mail here, contains an ISO-Latin-1 character, which shows as circled question mark on an UTF-8 using terminal if you just do a cat a.html. (Can you confirm that for your terminals?) Ah, so that is presumably why you dion't see the same as me: it was garbled in transit. I'm attaching a gzipped version; hopefully this will reach you intact: it should be UTF-8 encoded. Much better. I can now reproduce this issue. And maybe this is what links is then doing: it is trying to interpret both bytes of the UTF-8 file separately. (In the context in which I was originally using it, the file was a MIME attachment, and the MIME headers specified the UTF-8 encoding.) Hrm. Indeed. But the issue is gone again if I add the following lines after html: head meta http-equiv=Content-Type content=text/html;charset=utf-8 /head So if links can handle UTF-8 encoded files, it would be very useful to also have a command-line flag to specify the encoding. That's the actual issue. There seems no chance to pass the charset on the commandline. I'll forward this to upstream. Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Control: tag -1 - upstream confirmed Hi again, Axel Beckert wrote: Control: tag -1 - unreproducible + confirmed upstream Control: retitle -1 links: No option to specify charset on command-line Control: severity -1 wishlist [...] So if links can handle UTF-8 encoded files, it would be very useful to also have a command-line flag to specify the encoding. That's the actual issue. There seems no chance to pass the charset on the commandline. I'll forward this to upstream. I was to quick with replying: There _is_ a commandline switch for that: links -dump -html-assume-codepage utf-8 /tmp/a.html works for me. I'd close the bug report if that works for you, too. Point was that I initially just looked for charset and encoding, but you need to look for codepage. I found it, because I started to look for iso and utf, too. I can imagine you had the same issue. :-) Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
On Wed, Apr 30, 2014 at 09:30:28AM +0200, Axel Beckert wrote: Control: tag -1 - upstream confirmed Hi again, Axel Beckert wrote: Control: tag -1 - unreproducible + confirmed upstream Control: retitle -1 links: No option to specify charset on command-line Control: severity -1 wishlist [...] So if links can handle UTF-8 encoded files, it would be very useful to also have a command-line flag to specify the encoding. That's the actual issue. There seems no chance to pass the charset on the commandline. I'll forward this to upstream. I was to quick with replying: There _is_ a commandline switch for that: links -dump -html-assume-codepage utf-8 /tmp/a.html works for me. I'd close the bug report if that works for you, too. Point was that I initially just looked for charset and encoding, but you need to look for codepage. I found it, because I started to look for iso and utf, too. I can imagine you had the same issue. :-) Awesome! That does it, thanks! Please feel free to close this bug (or to suggest to upstream that they include the words charset and encoding in the manpage ;-) Julian -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Control: retitle -1 links: man-page does not mention encoding or charset near codepage commandline options Control: severity -1 minor Control: tag -1 + upstream confirmed Hi Julian, Julian Gilbey wrote: links -dump -html-assume-codepage utf-8 /tmp/a.html [...] Awesome! That does it, thanks! Please feel free to close this bug (or to suggest to upstream that they include the words charset and encoding in the manpage ;-) I'll do the latter. Retitling, severity and tag change again. :-) Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Package: links Version: 2.8-1+b1 The attached document, when dumped with links -dump /tmp/a.html renders as: Hello! This is a non-breaking space character:[A ] The A is extraneous. Julian Hello! This is a non-breaking space character:[ ]
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Control: tag -1 + unreproducible moreinfo Hi, Julian Gilbey wrote: The attached document, when dumped with links -dump /tmp/a.html renders as: Hello! This is a non-breaking space character:[A ] I'm sorry, I can't reproduce this: $ links -dump /tmp/a.html Hello! This is a non-breaking space character:[ ] Looks perfect to me. And this despite the HTML code uses 8-bit characters without declaring a character set! (0xa0 is a non-breaking space in ISO-Latin-1 IIRC.) Please provide more information about e.g. the terminal type you are using and its character set. Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
On Tue, Apr 29, 2014 at 11:00:18PM +0200, Axel Beckert wrote: Control: tag -1 + unreproducible moreinfo Hi, Julian Gilbey wrote: The attached document, when dumped with links -dump /tmp/a.html renders as: Hello! This is a non-breaking space character:[A ] I'm sorry, I can't reproduce this: $ links -dump /tmp/a.html Hello! This is a non-breaking space character:[ ] Looks perfect to me. And this despite the HTML code uses 8-bit characters without declaring a character set! (0xa0 is a non-breaking space in ISO-Latin-1 IIRC.) Please provide more information about e.g. the terminal type you are using and its character set. That's bizarre. I've tried this in an xterm (xfce4-terminal) and in a console window (tty1), both with my default locale (en_GB.UTF-8) and in the C locale, and the same happens with all of these combinations. I'm not sure how I would determine the character set I'm using, though. Julian -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
Control: tag -1 - moreinfo Hi Julian, thanks for the prompt feedback! Julian Gilbey wrote: I've tried this in an xterm (xfce4-terminal) and in a console window (tty1), both with my default locale (en_GB.UTF-8) and in the C locale, and the same happens with all of these combinations. I'm not sure how I would determine the character set I'm using, though. Sometimes the terminal emulator lets you set this. I used an uxterm and now also tried xfce4-terminal from Wheezy (and then ssh'ed into the Sid machine for testing), which both use UTF-8 as character set by default. Your file, at least how it arrived by mail here, contains an ISO-Latin-1 character, which shows as circled question mark on an UTF-8 using terminal if you just do a cat a.html. (Can you confirm that for your terminals?) If the A was actually an A with a tilda (e.g. Ã), I could imagine that it happened on a non-UTF-8-terminal (e.g. an xterm started with env LANG=C xterm) and links was converting the character to UTF-8 for some reason, but I wasn't able to reproduce it in such a setup despite my UTF-8 containing prompt then contained a lower-case a with a tilda (ã) then. So even this did not reproduce it for me: env LANG=C xterm → ssh otherhost → env LANG=C links -dump /tmp/a.html Regards, Axel -- ,''`. | Axel Beckert a...@debian.org, http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#746386: links: incorrectly renders non-breaking space char (0xa0) as A
On Wed, Apr 30, 2014 at 12:11:22AM +0200, Axel Beckert wrote: Control: tag -1 - moreinfo Hi Julian, thanks for the prompt feedback! And yours! :-) Julian Gilbey wrote: I've tried this in an xterm (xfce4-terminal) and in a console window (tty1), both with my default locale (en_GB.UTF-8) and in the C locale, and the same happens with all of these combinations. I'm not sure how I would determine the character set I'm using, though. Sometimes the terminal emulator lets you set this. I used an uxterm and now also tried xfce4-terminal from Wheezy (and then ssh'ed into the Sid machine for testing), which both use UTF-8 as character set by default. Your file, at least how it arrived by mail here, contains an ISO-Latin-1 character, which shows as circled question mark on an UTF-8 using terminal if you just do a cat a.html. (Can you confirm that for your terminals?) Ah, so that is presumably why you dion't see the same as me: it was garbled in transit. I'm attaching a gzipped version; hopefully this will reach you intact: it should be UTF-8 encoded. And maybe this is what links is then doing: it is trying to interpret both bytes of the UTF-8 file separately. (In the context in which I was originally using it, the file was a MIME attachment, and the MIME headers specified the UTF-8 encoding.) So if links can handle UTF-8 encoded files, it would be very useful to also have a command-line flag to specify the encoding. Julian a.html.gz Description: Binary data