python 3 and Unicode line breaking

2011-01-13 Thread leoboiko
Hi,

Is there an equivalent to the textwrap module that knows about the
Unicode line breaking algorithm (UAX #14, http://unicode.org/reports/tr14/
)?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python 3 and Unicode line breaking

2011-01-14 Thread leoboiko
Of course I searched for one and couldn’t find; that goes without
saying.  Otherwise I wouldn’t even bother writing a message, isn’t
it?  I disagree people should cruft their messages with details about
how they failed to find information, as that is unrelated to the
question at hand and has no point other than polluting people’s
mailboxes.

I also see no reason to reply to a simple question with such
discourtesy, and cannot understand why someone would be so aggressive
to a stranger.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python 3 and Unicode line breaking

2011-01-14 Thread leoboiko
On Jan 14, 11:48 am, Stefan Behnel  wrote:
> Sadly, the OP did not clearly state that the required feature
> is really not supported by "textwrap" and in what way textwrap
> behaves differently. That would have helped in answering.

Oh, textwrap doesn’t work for arbitrary Unicode text at all.  For
example, it separates combining sequences:

>>> s = "tiếng Việt" # precomposed
>>> len(s)
10
>>> s = "tiếng Việt" # combining
>>> len(s) # number of unicode characters; ≠ line length
14
>>> print(textwrap.fill(s, width=4)) # breaks sequences
tiê
ng
Viê
t

It also doesn’t know about double-width characters:

>>> s1 = "日本語のテキト"
>>> s2 = "12345678901234" # both s1 and s2 use 14 columns
>>> print(textwrap.fill(s1, width=7))
日本語のテキト
>>> print(textwrap.fill(s2, width=7))
1234567
8901234

It doesn’t know about non-ascii punctuation:

>>> print(textwrap.fill("abc-def", width=5)) # ASCII minus-hyphen
abc-
def
>>> print(textwrap.fill("abc‐def", width=5)) # true hyphen U+2010
abc‐d
ef

It doesn’t know East Asian filling rules (though this is
perhaps pushing it a bit beyond textwrap’s goals):

>>> print(textwrap.fill("日本語、中国語", width=3))
日本語
、中国 # should avoid linebreak before CJK punctuation
語


And it generally doesn’t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough.  We
can’t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (that’s why the
consortium provides a ready-made algorithm).  It’s just that,
with python3’s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation.  I thought someone
would surely have written one and I simply couldn’t find, so I
asked precisely that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python 3 and Unicode line breaking

2011-01-14 Thread leoboiko
On Jan 14, 8:10 pm, Steven D'Aprano  wrote:
> The only other person I can see who has attempted to actually help the OP
> is Stefan Behnel, who tried to get more information about the problem
> being solved in order to better answer the question. The OP has, so far
> as I can see, not responded, although he has taken the time to write to
> me in private to argue further.

I have written in private because I really feel this discussion is out-
of-place here.  This thread is already in the first page of google
results for “python unicode line breaking”, “python uax #14” etc.  I
feel it would be good to use this place to discuss Unicode line
breaking, not best practices on asking questions, or in how
disappointly impolite the Internet has become.  (Briefly: As a tech
support professional myself, I prefer direct, concise questions than
crufty ones; and I try to ask questions in the most direct manner
precisely _because_ I don’t want to waste the time of kind volunteers
with my problems.)


As for taking the time to provide information, I wonder if there was
any technical problem that prevented you from seeing my reply to
Stefan, sent Jan 14, 12:29PM? He asked how exacly the stdlib module
“textwrap” differs from the Unicode algorithm, so I provided some
commented examples.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python 3 and Unicode line breaking

2011-01-17 Thread leoboiko
On Jan 14, 11:28 pm, Steven D'Aprano  wrote:
> Does this help?
>
> http://packages.python.org/kitchen/api-text-display.html

Ooh, it doesn’t appear to be a full line-breaking
implementation but it certainly helps for what I want to do
in my project! Thanks much!

(There’s also the alternative of using something like PyICU
to access a C library, something I had forgotten about
entirely.)

Antoine wrote:
> If you're willing to help on that matter (or some aspects of them,
> textwrap-specific or not), you can open an issue on
> http://bugs.python.org and propose a patch.

I’m not sure my poor coding is good enough to contribute but I’ll
keep this is mind if I find myself implementing the algorithm or
wanting to patch textwrap.  Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list