Bug#310495: par: Does not handle UTF-8 multibyte characters properly
Hello, I received a response from the upstream author indicating that he is working on a new version of par that will handle UTF-8. Regards, Kapil. -- signature.asc Description: Digital signature
Bug#310495: par: Does not handle UTF-8 multibyte characters properly
tag 310495 forwarded wishlist thanks On Thu, 19 Jan 2006, Teemu Likonen wrote: I don't mind moving it to the wishlist. I don't use par anymore - I can't. But, this is kind of becoming a bug because Linux distributions have moved towards UTF-8 locale and there aren't many languages that can be written with ascii codes 0 - $7f. In Unicode's UTF-8 encoding all the other codes ($80 - $10) need 2 to 4 bytes. So, as par is mainly for reformatting text with human languages, it has become pretty useless nowadays as Unicode and UTF-8 has come. I agree that migrating programs that support char handling to those that have UTF-8 char handling would be a good thing. So I sent the enclosed mail to the upstream author. However, I don't hold out too much hope since the upstream packages have not changed in about four years! Regards, Kapil. -- message sent to upstram author Hello, Thanks for the program par which I have been using for a while now via its Debian package. Recently a bug has been filed against the Debian package of par for its inability to handle multi-byte characters (http://bugs.debian.org/310495 for details). Specifically, in UTF-8 encoding many bytes make up a single unicode character and this disturbs par's count of word length while right-justifying text. Do you have plans to incorporate the handling of such text in par at some point? Alternatively, do you think that this is a feasible/worthwhile addition for someone else to work on (I might be able to find someone---may be even myself)? Thanks and regards, Kapil. -- === signature.asc Description: Digital signature
Bug#310495: par: Does not handle UTF-8 multibyte characters properly
Hello, On Wednesday 18 January 2006 13:58, you wrote: I have not been able to find any program that does UTF-8 multibyte character left and right justification for text files. I have not either, sorry. If you can point me to some source where I can find information on how this can be handled then perhaps I can try to figure out a patch to fix this. I'm not a programmer but I guess one just have to understand how UTF-8 encoding works. The old way was to count strings byte by byte but it's not working anymore with UTF-8. Probably manual page UTF-8(7) is a good start and of course there are Unicode Consortium's official definitions: http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf According to the release notes par is OK with 8-bit characters but not multibyte so this is not a bug in the program vis-a-vis its documentation. Would it be OK with you if this bug was downgraded to wishlist? I don't mind moving it to the wishlist. I don't use par anymore - I can't. But, this is kind of becoming a bug because Linux distributions have moved towards UTF-8 locale and there aren't many languages that can be written with ascii codes 0 - $7f. In Unicode's UTF-8 encoding all the other codes ($80 - $10) need 2 to 4 bytes. So, as par is mainly for reformatting text with human languages, it has become pretty useless nowadays as Unicode and UTF-8 has come. Thanks and regards, Thank you too. A UTF-8 patch would be really nice. - TL pgpB2ov7GLVfU.pgp Description: PGP signature
Bug#310495: par: Does not handle UTF-8 multibyte characters properly
Hello, On Tue, 24 May 2005, Teemu Likonen wrote: Package: par Version: 1.51-1 Severity: important Par does not handle UTF-8 multibyte characters properly. To introduce this problem I made UTF-8 encoded text file of nonsense Finnish text with some multibyte characters (ä's and ö's). I am not maintaining par (yet) but just trying to help out :) I have not been able to find any program that does UTF-8 multibyte character left and right justification for text files. If you can point me to some source where I can find information on how this can be handled then perhaps I can try to figure out a patch to fix this. Meanwhile, the upstream source for par has not changed in a while so probably such a fix will have to be made for the Debian version only. According to the release notes par is OK with 8-bit characters but not multibyte so this is not a bug in the program vis-a-vis its documentation. Would it be OK with you if this bug was downgraded to wishlist? Thanks and regards, Kapil. -- signature.asc Description: Digital signature
Bug#310495: par: Does not handle UTF-8 multibyte characters properly
Package: par Version: 1.51-1 Severity: important Par does not handle UTF-8 multibyte characters properly. To introduce this problem I made UTF-8 encoded text file of nonsense Finnish text with some multibyte characters ('s and 's). $ cat teksti.txt | par j Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. If I convert this text to ISO-8859-15, use par to justify it, and then convert it back to UTF-8 (just to make it show right), I get correct justification. This is because all characters take one byte. $ cat teksti.txt | iconv -f utf-8 -t iso-8859-15 | par j | iconv -f \ iso-8859-15 -t utf-8 Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen virke, jonka avulla testaan, kuinka kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa. I think this problem is not only with left-right justifications but with every occasion par tries to calculate lengths of words and lines which contain UTF-8 multibyte characters. Therefore even the left justification that I mostly use does not quite work as it should (althougt it's still usable). The conversion I showed above maybe a solution of some kind but of course not all the Unicode characters can be converted to an 8 bit charset. So I guess this can be counted as a bug. Thanks - TL -- System Information: Debian Release: 3.1 APT prefers testing APT policy: (850, 'testing'), (800, 'unstable'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.8-2-k7 Locale: LANG=en_IE.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8) Versions of packages par depends on: ii libc6 2.3.2.ds1-21 GNU C Library: Shared libraries an -- no debconf information -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]