python 3 and Unicode line breaking
Hi, Is there an equivalent to the textwrap module that knows about the Unicode line breaking algorithm (UAX #14, http://unicode.org/reports/tr14/ )? -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3 and Unicode line breaking
Of course I searched for one and couldn’t find; that goes without saying. Otherwise I wouldn’t even bother writing a message, isn’t it? I disagree people should cruft their messages with details about how they failed to find information, as that is unrelated to the question at hand and has no point other than polluting people’s mailboxes. I also see no reason to reply to a simple question with such discourtesy, and cannot understand why someone would be so aggressive to a stranger. -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3 and Unicode line breaking
On Jan 14, 11:48 am, Stefan Behnel wrote: > Sadly, the OP did not clearly state that the required feature > is really not supported by "textwrap" and in what way textwrap > behaves differently. That would have helped in answering. Oh, textwrap doesn’t work for arbitrary Unicode text at all. For example, it separates combining sequences: >>> s = "tiếng Việt" # precomposed >>> len(s) 10 >>> s = "tiếng Việt" # combining >>> len(s) # number of unicode characters; ≠ line length 14 >>> print(textwrap.fill(s, width=4)) # breaks sequences tiê ng Viê t It also doesn’t know about double-width characters: >>> s1 = "日本語のテキト" >>> s2 = "12345678901234" # both s1 and s2 use 14 columns >>> print(textwrap.fill(s1, width=7)) 日本語のテキト >>> print(textwrap.fill(s2, width=7)) 1234567 8901234 It doesn’t know about non-ascii punctuation: >>> print(textwrap.fill("abc-def", width=5)) # ASCII minus-hyphen abc- def >>> print(textwrap.fill("abc‐def", width=5)) # true hyphen U+2010 abc‐d ef It doesn’t know East Asian filling rules (though this is perhaps pushing it a bit beyond textwrap’s goals): >>> print(textwrap.fill("日本語、中国語", width=3)) 日本語 、中国 # should avoid linebreak before CJK punctuation 語 And it generally doesn’t try to pick good places to break lines at all, just making the assumption that 1 character = 1 column and that breaking on ASCII whitespaces/hyphens is enough. We can’t really blame textwrap for that, it is a very simple module and Unicode line breaking gets complex fast (that’s why the consortium provides a ready-made algorithm). It’s just that, with python3’s emphasis on Unicode support, I was surprised not to be able to find an UAX #14 implementation. I thought someone would surely have written one and I simply couldn’t find, so I asked precisely that. -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3 and Unicode line breaking
On Jan 14, 8:10 pm, Steven D'Aprano wrote: > The only other person I can see who has attempted to actually help the OP > is Stefan Behnel, who tried to get more information about the problem > being solved in order to better answer the question. The OP has, so far > as I can see, not responded, although he has taken the time to write to > me in private to argue further. I have written in private because I really feel this discussion is out- of-place here. This thread is already in the first page of google results for “python unicode line breaking”, “python uax #14” etc. I feel it would be good to use this place to discuss Unicode line breaking, not best practices on asking questions, or in how disappointly impolite the Internet has become. (Briefly: As a tech support professional myself, I prefer direct, concise questions than crufty ones; and I try to ask questions in the most direct manner precisely _because_ I don’t want to waste the time of kind volunteers with my problems.) As for taking the time to provide information, I wonder if there was any technical problem that prevented you from seeing my reply to Stefan, sent Jan 14, 12:29PM? He asked how exacly the stdlib module “textwrap” differs from the Unicode algorithm, so I provided some commented examples. -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3 and Unicode line breaking
On Jan 14, 11:28 pm, Steven D'Aprano wrote: > Does this help? > > http://packages.python.org/kitchen/api-text-display.html Ooh, it doesn’t appear to be a full line-breaking implementation but it certainly helps for what I want to do in my project! Thanks much! (There’s also the alternative of using something like PyICU to access a C library, something I had forgotten about entirely.) Antoine wrote: > If you're willing to help on that matter (or some aspects of them, > textwrap-specific or not), you can open an issue on > http://bugs.python.org and propose a patch. I’m not sure my poor coding is good enough to contribute but I’ll keep this is mind if I find myself implementing the algorithm or wanting to patch textwrap. Thanks. -- http://mail.python.org/mailman/listinfo/python-list