[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2017-08-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Issue30717 has a patch.

--
resolution:  -> duplicate
stage: needs patch -> resolved
status: open -> closed
superseder:  -> Add unicode grapheme cluster break algorithm

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2017-07-23 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +serhiy.storchaka
versions: +Python 3.7 -Python 3.4, Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2017-07-23 Thread Socob

Changes by Socob <206a8...@opayq.com>:


--
nosy: +Socob

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2013-07-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

It may be useful to also add the start position of the grapheme to the iterator 
output.

Related to this, please also see this pre-PEP I once wrote for a Unicode 
indexing module:

http://mail.python.org/pipermail/python-dev/2001-July/015938.html

--
components: +Unicode
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18406
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2013-07-09 Thread Matthew Barnett

Matthew Barnett added the comment:

This is basically what the regex module does, written in Python:

def get_grapheme_cluster_break(codepoint):
Gets the Grapheme Cluster Break property of a codepoint.

The properties defined here:


http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt

# The return value is one of:
#
# Other
# CR
# LF
# Control
# Extend
# Prepend
#  Regional_Indicator
# SpacingMark
# L
# V
# T
# LV
# LVT
...

def at_grapheme_boundary(string, index):
Checks whether the codepoint at 'index' is on a grapheme boundary.

The rules are defined here:

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

# Break at the start and end of the text.
if index = 0 or index = len(string):
return True

prop = get_grapheme_cluster_break(string[index])
prop_m1 = get_grapheme_cluster_break(string[index - 1])

# Don't break within CRLF.
if prop_m1 == CR and prop == LF:
return False

# Otherwise break before and after controls (including CR and LF).
if prop_m1 in (Control, CR, LF) or prop in (Control, CR, 
LF):
return True

# Don't break Hangul syllable sequences.
if prop_m1 == L and prop in (L, V, LV, LVT):
return False
if prop_m1 in (LV, V) and prop in (V,  T):
return False
if prop_m1 in (LVT, T) and prop == T:
return False

# Don't break between regional indicator symbols.
if (prop_m1 == REGIONALINDICATOR and prop ==
  REGIONALINDICATOR):
return False

# Don't break just before Extend characters.
if prop == Extend:
return False

# Don't break before SpacingMarks, or after Prepend characters.
if prop == SpacingMark:
return False

if prop_m1 == Prepend:
return False

# Otherwise, break everywhere.
return True

--
nosy: +mrabarnett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18406
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2013-07-08 Thread David P. Kendal

New submission from David P. Kendal:

On python-ideas I proposed the addition of a way to iterate over the graphemes 
of a string, either as part of the unicodedata library or as a method on the 
built-in str type. 
http://mail.python.org/pipermail/python-ideas/2013-July/021916.html

I provided a sample implementation, but MRAB pointed out that my definition 
of a grapheme is slightly wrong; it's a little more complex than just 
character followed by combiners. 
http://mail.python.org/pipermail/python-ideas/2013-July/021917.html

M.-A. Lenburg asked me to open this issue. 
http://mail.python.org/pipermail/python-ideas/2013-July/021929.html

--
messages: 192684
nosy: dpk
priority: normal
severity: normal
status: open
title: unicodedata.itergraphemes / str.itergraphemes / str.graphemes
type: enhancement
versions: Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18406
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2013-07-08 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +benjamin.peterson, ezio.melotti, loewis
stage:  - needs patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18406
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18406] unicodedata.itergraphemes / str.itergraphemes / str.graphemes

2013-07-08 Thread Chris Rebert

Changes by Chris Rebert pyb...@rebertia.com:


--
nosy: +cvrebert

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18406
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com