[issue30717] str.center() is not unicode aware

2017-08-02 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hi,

Are you guys still interested? I haven't heard from you in a while

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-23 Thread Socob

Changes by Socob <206a8...@opayq.com>:


--
nosy: +Socob

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-15 Thread Christian Heimes

Changes by Christian Heimes :


--
assignee: christian.heimes -> 
components: +Interpreter Core -SSL, Tests, Tkinter
nosy:  -christian.heimes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I think it at least plausible that we should add implementations of some of the 
unicode standard's algorithms.  Victor and Serhiy, as two of the active core 
devs most involved with unicode issues, what do you think?

--
nosy: +haypo, serhiy.storchaka, terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello Steven!

Thanks for your reactivity!

unicodedata.grapheme_cluster_break() takes a unicode code point as an argument 
and return its GraphemeBreakProperty as a string. Possible values are listed 
here: http://www.unicode.org/reports/tr29/#CR

help(unicodedata.grapheme_cluster_break) says:
grapheme_cluster_break(chr, /)
Returns the GraphemeBreakProperty assigned to the character chr as string.



unicodedata.break_graphemes() takes a unicode string as argument and returns an 
GraphemeClusterIterator that spits consecutive graphemes clusters.

help(unicodedata.break_graphemes) says:

break_graphemes(unistr, /)
Returns an iterator to iterate over grapheme clusters in unistr.

It uses extended grapheme cluster rules from TR29.


Is there anything else you would like to know? Don't hesitate to ask :)

Thank you for your time!

--
assignee:  -> christian.heimes
components: +SSL, Tests, Tkinter -Library (Lib)
nosy: +christian.heimes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Steven D'Aprano

Steven D'Aprano added the comment:

Thank you, but I cannot review your C code.

Can you start by telling us what the two functions:

unicodedata.grapheme_cluster_break()
unicodedata.break_graphemes()

take as arguments, and what they return? If we were to call 
help(function), what would we see?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello,

I implemented unicodedata.break_graphemes() that returns an iterators that 
spits consecutive graphemes.

This is a "test" implementation meant to see what doesn't fits Python's style 
and design, to discuss naming and implementation details.

https://github.com/python/cpython/pull/2673

Thanks for your time and interest

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-11 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello to all of you, sorry for the delay. Been busy.

I added the base code needed to built the grapheme cluster break algorithm. We 
now have the GraphemeBreakProperty available via 
unicodedata.grapheme_cluster_break()

Can you check that the implementation correctly fits the design? I was not sure 
about adding that prop to unicodedata_db ou unicodectype_db, tbh.

If it's all correct, I'll move forward with the automaton and the grapheme 
cluster breaking algorithm.

Thanks!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-11 Thread Roundup Robot

Changes by Roundup Robot :


--
pull_requests: +2741

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-01 Thread R. David Murray

R. David Murray added the comment:

See also issue 12568.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Mariatta Wijaya

Changes by Mariatta Wijaya :


--
stage:  -> needs patch
type:  -> enhancement

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Steven D'Aprano

Steven D'Aprano added the comment:

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

talks about *grapheme clusters*, not "graphemes" alone, and it seems clear to 
me that they are language dependent. For example, it says:

The Unicode Standard provides default algorithms for determining grapheme 
cluster boundaries, with two variants: legacy grapheme clusters and extended 
grapheme clusters. The most appropriate variant depends on the language and 
operation involved. ... These algorithms can be adapted to produce tailored 
grapheme clusters for specific locales...


Nevertheless, even just a basic API to either the *legacy grapheme cluster* or 
the *extended grapheme cluster* algorithms would be a good start.

Can I suggest that the unicodedata module might be the right place for it?

And thank you for volunteering to do the work on this!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Thanks for all those interesting cases you brought here! I didn't think of that 
at all!

I'm using the word "grapheme" as per the definition given in UAX TR29 which is 
*not* language/locale dependant [1].

This annex is very specific and precise about where to break "grapheme cluster" 
aka "when does a character starts and ends". Sadly, it's a bit more complex 
than just accumulating based on the Combining property. This annex gives a set 
of rules to implement, based on Grapheme_Cluster_Break property, and while 
those rules may naively be implemented as comparing adjacent pairs of code 
points, this is wrong and can be correctly and efficiently implemented as an 
automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided 
by Unicode).

We can definitely do a generator like you propose, or rather do it in the C 
layer to gain more efficiency and coherence since the other string / Unicode 
operations are in the C layer (upper, lower, casefold, etc)

Let me know what you guys think, what (and if) I should contribute :)

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] 
https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Steven D'Aprano

Steven D'Aprano added the comment:

I don't think graphemes is the right term here. Graphemes are language 
dependent, for instance "dž" may be considered a grapheme in Croatian.

https://en.wikipedia.org/wiki/D%C5%BE
http://www.unicode.org/glossary/#grapheme

I believe you are referring to combining characters:

http://www.unicode.org/faq/char_combmark.html

It is unfortunate that Python's string methods are naive about combining 
characters, and just count code points, but I'm not sure what the alternative 
is. For example the human reader may be surprised that these give two different 
results:

py> len("naïve")
5
py> len("naïve")
6

I'm not sure if the effect will survive copying and pasting, but the first 
string uses 

U+00EF LATIN SMALL LETTER I WITH DIAERESIS

while the second uses:

U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS

And check out this surprising result:

py> "xïoz"[::-1]
'zöix'


It seems to me that it would be great if Python was fully aware of combining 
characters, its not so great if it is naïve, but it would be simply terrible if 
only a few methods were aware and the rest naïve.

I don't have a good solution to this, but perhaps an iterator over (base 
character + combining marks) would be a good first step. Something like this?

import unicodedata

def chars(string):
accum = []
for c in string:
cat = unicodedata.category(c)
if cat == 'Mn':
accum.append(c)
else:
if accum:
yield accum
accum = []
accum.append(c)
if accum:
yield accum

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Obviously, I'm talking about str.center() but all functions needing a count of 
graphemes are then not totally correct.

I can fix that and add the corresponding function, or an iterator over 
graphemes, or whatever seems right :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

New submission from Guillaume Sanchez:

"a⃑".center(width=5, fillchar=".")
produces
'..a⃑.' instead of '..a⃑..'

The reason is that "a⃑" is composed of two code points (2 UCS4 chars), one 'a' 
and one combining code point "above arrow". str.center() counts the size of the 
string and fills it both sides with `fillchar` until the size reaches `width`. 
However, this size is certainly intended to be the number of characters and not 
the number of code points.

The correct way to count characters is to use the grapheme clustering algorithm 
from UAX TR29.

Turns out I implemented this myself already, and might do the PR if asked so, 
with a little help to make the C <-> Python glue.

Thanks for your time.

--
components: Library (Lib)
messages: 296478
nosy: Guillaume Sanchez
priority: normal
severity: normal
status: open
title: str.center() is not unicode aware
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com