Bug#310495: par: Does not handle UTF-8 multibyte characters properly

2006-02-07 Thread Kapil Hari Paranjape
Hello,

I received a response from the upstream author indicating
that he is working on a new version of par that will handle UTF-8.

Regards,

Kapil.
-- 



signature.asc
Description: Digital signature


Bug#310495: par: Does not handle UTF-8 multibyte characters properly

2006-01-20 Thread Kapil Hari Paranjape
tag 310495 forwarded wishlist
thanks

On Thu, 19 Jan 2006, Teemu Likonen wrote:
 I don't mind moving it to the wishlist. I don't use par anymore - I 
 can't. But, this is kind of becoming a bug because Linux distributions 
 have moved towards UTF-8 locale and there aren't many languages that 
 can be written with ascii codes 0 - $7f. In Unicode's UTF-8 encoding 
 all the other codes ($80 - $10) need 2 to 4 bytes. So, as par is 
 mainly for reformatting text with human languages, it has become pretty 
 useless nowadays as Unicode and UTF-8 has come.

I agree that migrating programs that support char handling to those
that have UTF-8 char handling would be a good thing.

So I sent the enclosed mail to the upstream author. However, I don't
hold out too much hope since the upstream packages have not changed in
about four years!

Regards,

Kapil.
--

message sent to upstram author
Hello,
 
Thanks for the program par which I have been using for a while now
via its Debian package. Recently a bug has been filed against the
Debian package of par for its inability to handle multi-byte
characters (http://bugs.debian.org/310495 for details). Specifically,
in UTF-8 encoding many bytes make up a single unicode character and
this disturbs par's count of word length while right-justifying
text.

Do you have plans to incorporate the handling of such text in par
at some point? Alternatively, do you think that this is a
feasible/worthwhile addition for someone else to work on (I might be
able to find someone---may be even myself)?
 
Thanks and regards,
 
Kapil.
--

===


signature.asc
Description: Digital signature


Bug#310495: par: Does not handle UTF-8 multibyte characters properly

2006-01-19 Thread Teemu Likonen
Hello,

On Wednesday 18 January 2006 13:58, you wrote:
 I have not been able to find any program that does UTF-8 multibyte
 character left and right justification for text files.

I have not either, sorry.

 If you can 
 point me to some source where I can find information on how this can
 be handled then perhaps I can try to figure out a patch to fix this.

I'm not a programmer but I guess one just have to understand how UTF-8 
encoding works. The old way was to count strings byte by byte but 
it's not working anymore with UTF-8. Probably manual page UTF-8(7) is a 
good start and of course there are Unicode Consortium's official 
definitions:

http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

 According to the release notes par is OK with 8-bit characters but
 not multibyte so this is not a bug in the program vis-a-vis its
 documentation. Would it be OK with you if this bug was downgraded to
 wishlist?

I don't mind moving it to the wishlist. I don't use par anymore - I 
can't. But, this is kind of becoming a bug because Linux distributions 
have moved towards UTF-8 locale and there aren't many languages that 
can be written with ascii codes 0 - $7f. In Unicode's UTF-8 encoding 
all the other codes ($80 - $10) need 2 to 4 bytes. So, as par is 
mainly for reformatting text with human languages, it has become pretty 
useless nowadays as Unicode and UTF-8 has come.

 Thanks and regards,

Thank you too. A UTF-8 patch would be really nice.

 - TL


pgpB2ov7GLVfU.pgp
Description: PGP signature


Bug#310495: par: Does not handle UTF-8 multibyte characters properly

2006-01-18 Thread Kapil Hari Paranjape
Hello,

On Tue, 24 May 2005, Teemu Likonen wrote:
 Package: par
 Version: 1.51-1
 Severity: important
 
 
 Par does not handle UTF-8 multibyte characters properly. To introduce
 this problem I made UTF-8 encoded text file of nonsense Finnish text
 with some multibyte characters (ä's and ö's).

I am not maintaining par (yet) but just trying to help out :)

I have not been able to find any program that does UTF-8 multibyte
character left and right justification for text files. If you can
point me to some source where I can find information on how this can
be handled then perhaps I can try to figure out a patch to fix this.

Meanwhile, the upstream source for par has not changed in a while so
probably such a fix will have to be made for the Debian version only.

According to the release notes par is OK with 8-bit characters but
not multibyte so this is not a bug in the program vis-a-vis its
documentation. Would it be OK with you if this bug was downgraded to
wishlist?

Thanks and regards,

Kapil.
--



signature.asc
Description: Digital signature


Bug#310495: par: Does not handle UTF-8 multibyte characters properly

2005-05-23 Thread Teemu Likonen
Package: par
Version: 1.51-1
Severity: important


Par does not handle UTF-8 multibyte characters properly. To introduce
this problem I made UTF-8 encoded text file of nonsense Finnish text
with some multibyte characters ('s and 's).

$ cat teksti.txt | par j

Tss   tyhjnpivinen  virke,   jonka   avulla  testaan,   kuinka
kkset  ja UTF-8  -koodaus  toimivat par-tykalun  kanssa. Tss
tyhjnpivinen virke,  jonka avulla  testaan, kuinka  kkset ja
UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen
virke,  jonka  avulla  testaan,  kuinka kkset  ja  UTF-8  -koodaus
toimivat par-tykalun  kanssa. Tss tyhjnpivinen  virke, jonka
avulla   testaan,  kuinka   kkset  ja   UTF-8  -koodaus   toimivat
par-tykalun  kanssa. Tss  tyhjnpivinen virke,  jonka  avulla
testaan,  kuinka kkset  ja UTF-8  -koodaus toimivat  par-tykalun
kanssa.  Tss tyhjnpivinen virke,  jonka avulla testaan, kuinka
kkset  ja UTF-8  -koodaus  toimivat par-tykalun  kanssa. Tss
tyhjnpivinen virke,  jonka avulla  testaan, kuinka  kkset ja
UTF-8 -koodaus toimivat par-tykalun kanssa. Tss tyhjnpivinen
virke,  jonka  avulla  testaan,  kuinka kkset  ja  UTF-8  -koodaus
toimivat par-tykalun  kanssa. Tss tyhjnpivinen  virke, jonka
avulla   testaan,  kuinka   kkset  ja   UTF-8  -koodaus   toimivat
par-tykalun  kanssa. Tss  tyhjnpivinen virke,  jonka  avulla
testaan,  kuinka kkset  ja UTF-8  -koodaus toimivat  par-tykalun
kanssa.  Tss tyhjnpivinen virke,  jonka avulla testaan, kuinka
kkset ja UTF-8 -koodaus toimivat par-tykalun kanssa.


If I convert this text to ISO-8859-15, use par to justify it, and then
convert it back to UTF-8 (just to make it show right), I get correct
justification. This is because all characters take one byte.

$ cat teksti.txt | iconv -f utf-8 -t iso-8859-15 | par j | iconv -f \
iso-8859-15 -t utf-8

Tss  tyhjnpivinen  virke,  jonka avulla  testaan,  kuinka  kkset
ja  UTF-8 -koodaus  toimivat par-tykalun  kanssa. Tss tyhjnpivinen
virke, jonka avulla testaan, kuinka  kkset ja UTF-8 -koodaus toimivat
par-tykalun kanssa. Tss tyhjnpivinen  virke, jonka avulla testaan,
kuinka kkset  ja UTF-8  -koodaus toimivat  par-tykalun kanssa. Tss
tyhjnpivinen virke,  jonka avulla  testaan, kuinka kkset  ja UTF-8
-koodaus  toimivat  par-tykalun  kanssa. Tss  tyhjnpivinen  virke,
jonka  avulla  testaan,  kuinka  kkset  ja  UTF-8  -koodaus  toimivat
par-tykalun kanssa.  Tss tyhjnpivinen virke, jonka avulla testaan,
kuinka kkset  ja UTF-8  -koodaus toimivat  par-tykalun kanssa. Tss
tyhjnpivinen virke,  jonka avulla  testaan, kuinka kkset  ja UTF-8
-koodaus  toimivat  par-tykalun  kanssa. Tss  tyhjnpivinen  virke,
jonka  avulla  testaan,  kuinka  kkset  ja  UTF-8  -koodaus  toimivat
par-tykalun kanssa. Tss tyhjnpivinen  virke, jonka avulla testaan,
kuinka kkset  ja UTF-8  -koodaus toimivat  par-tykalun kanssa. Tss
tyhjnpivinen virke,  jonka avulla  testaan, kuinka kkset  ja UTF-8
-koodaus  toimivat par-tykalun  kanssa.   Tss tyhjnpivinen  virke,
jonka  avulla  testaan,  kuinka  kkset  ja  UTF-8  -koodaus  toimivat
par-tykalun kanssa.


I think this problem is not only with left-right justifications but
with every occasion par tries to calculate lengths of words and lines
which contain UTF-8 multibyte characters. Therefore even the left
justification that I mostly use does not quite work as it should
(althougt it's still usable). The conversion I showed above maybe a
solution of some kind but of course not all the Unicode characters can
be converted to an 8 bit charset. So I guess this can be counted as a 
bug.

Thanks

 - TL


-- System Information:
Debian Release: 3.1
  APT prefers testing
  APT policy: (850, 'testing'), (800, 'unstable'), (1, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.8-2-k7
Locale: LANG=en_IE.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages par depends on:
ii  libc6   2.3.2.ds1-21 GNU C Library: Shared libraries an

-- no debconf information


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]