[issue30717] Add unicode grapheme cluster break algorithm

2021-06-29 Thread Jakub Wilk


Change by Jakub Wilk :


--
nosy: +jwilk

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-07 Thread Manish


Manish  added the comment:

> Does `unicode-segmentation` support all platforms that CPython supports?

It's no-std, so it supports everything the base Rust compiler supports (which 
is basically everything llvm supports).

And yeah, if there's something that doesn't match with the support matrix this 
isn't going to work. 


However, I suggested this more for the potential PyPI package. If you're 
working this into CPython you'd have to figure out how best to include Rust 
stuff in your build system, which seems like a giant chunk of scope creep :)



For including in CPython I'd suggest looking through unicode-segmentation and 
writing a C version of it. We use a python script[1] to generate the data 
tables, this might be something y'all can use. Swift's UAX 29 implementation is 
also quite interesting, however it's baked in deeply to the language so it's 
less useful as a starting point.


 [1]: 
https://github.com/unicode-rs/unicode-segmentation/blob/master/scripts/unicode.py

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-06 Thread Paul Ganssle


Paul Ganssle  added the comment:

> Oh, also, if y'all are fine with binding to Rust (through a C ABI) I'd love 
> to help y'all use unicode-segmentation, which is much less work that pulling 
> in ICU. Otherwise if y'all have implementation questions I can answer them. 
> This spec is kinda tricky to implement efficiently, but it's not super hard.

Is the idea here that we'd take on a new dependency on the compiled 
`unicode-segmentation` binary, rather than adding Rust into our build system? 
Does `unicode-segmentation` support all platforms that CPython supports? I was 
under the impression that Rust requires llvm and llvm doesn't necessarily have 
the same support matrix as CPython (I'd love to be corrected if I'm wrong on 
this).

(Note: I don't actually know what the process is for taking on new dependencies 
like this, just trying to point at one possible stumbling block.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-06 Thread Manish


Manish  added the comment:

> one never needs to look at more than two adjacent code points to tell 
whether or not a grapheme break will occur between them, so this ought 
to be pretty efficient. 


That note is outdated (and has been outdated since Unicode 9). The regional 
indicator rules (GB12 and GB13) and the emoji rule (GB11) require arbitrary 
lookbehind (though thankfully not arbitrary lookahead).

I think the ideal API surface is an iterator and nothing else. Everything else 
can be derived from the iterator. It's theoretically possible to expose an 
is_grapheme_break that's faster than just iterating -- look at the code in 
unicode-segmentation's _reverse_ iterator to see how -- but it's going to be 
tricky to get right. Building the iterator on top of is_grapheme_break is not a 
good idea.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-06 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> I think it would be a mistake to make the stdlib use this for most 
> notions of what a "character" is, as I said this notion is also 
> inaccurate. Having an iterator library somewhere that you can use and 
> compose is great, changing the internal workings of string operations 
> would be a major change, and not entirely productive.

Agreed. 

I won't pretend to be able to predict what Python 5.0 will bring *wink* 
but there's too much history around the "code point = character" notion 
for the language to change now.

If the language can expose a grapheme iterator, then people can 
experiment with grapheme-based APIs in libraries.

(By grapheme I mean "extended grapheme cluster", but that's a mouthful. 
Sorry linguists.)

What do you think of these as a set of grapheme primitives?

(1) is_grapheme_break(string, i)

Return True if a grapheme break would occur *before* string[i].

(2) graphemes(string, start=0, end=len(string))

Iterate over graphemes in string[start:end].

(3) graphemes_reversed(string, start=0, end=len(string))

Iterate over graphemes in reverse order.

I *think* is_grapheme_break would be enough for people to implement 
their own versions of graphemes and graphemes_reversed. Here's an 
untested version:

def graphemes(string, start, end):
cluster = []
for i in range(start, end):
c = string[i]
if is_grapheme_break(string, i):
if i != start:
# don't yield the empty cluster at Start Of Text
yield ''.join(cluster)
cluster = [c]
else:
cluster.append(c)
if cluster:
yield ''.join(cluster)

Regarding is_grapheme_break, if I understand the note here:

https://www.unicode.org/reports/tr29/#Testing

one never needs to look at more than two adjacent code points to tell 
whether or not a grapheme break will occur between them, so this ought 
to be pretty efficient. At worst, it needs to look at string[i-1] and 
string[i], if they exist.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-05 Thread Manish


Manish  added the comment:

Oh, also, if y'all are fine with binding to Rust (through a C ABI) I'd love to 
help y'all use unicode-segmentation, which is much less work that pulling in 
ICU. Otherwise if y'all have implementation questions I can answer them. This 
spec is kinda tricky to implement efficiently, but it's not super hard.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2020-01-05 Thread Manish

Manish  added the comment:

Hi,

Unicodey person here, I'm involved in Unicode itself and also maintain an 
implementation of this particular spec[1].


So, firstly,

> "a⃑".center(width=5, fillchar=".")

If you're trying to do terminal width stuff, extended grapheme clusters *will 
not* solve the problem for you. There is no algorithm specified in Unicode that 
does this, because this is font dependent. Extended grapheme clusters are 
better than code points for this, however, and will not ever produce *worse* 
results.


It's fine to expose this, but it's worth adding caveats.

Also, yes, please do not expose a direct indexing function. Aside from almost 
all Unicode algorithms being streaming algorithms and thus inefficient to index 
directly, needing to directly index a grapheme cluster is almost always a sign 
that you are making a mistake.

> Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". 
> ICU provides much much more than what I'm willing to provide. My goal here is 
> just to "fix" how the python's standard library iterates over characters. 
> Noting more, nothing less.

I think it would be a mistake to make the stdlib use this for most notions of 
what a "character" is, as I said this notion is also inaccurate. Having an 
iterator library somewhere that you can use and compose is great, changing the 
internal workings of string operations would be a major change, and not 
entirely productive.

There's only one language I can think of that uses extended grapheme clusters 
as its default notion of "character": Swift. Swift is largely designed for UI 
stuff, and it makes sense in this context. This is also baked in very deeply to 
the language (e.g. their Character class is a thin wrapper around String, since 
grapheme clusters can be arbitrarily large). You'd need a pretty major paradigm 
shift for python to make a similar change, and it doesn't make as much sense 
for python in the first place.

Starting off with a library published to pypi makes sense to me.


 [1]: https://github.com/unicode-rs/unicode-segmentation/

--
nosy: +Manishearth

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2019-02-19 Thread Bert JW Regeer


Change by Bert JW Regeer :


--
nosy: +Bert JW Regeer

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2019-02-19 Thread Jens Troeger


Change by Jens Troeger :


--
nosy: +_savage

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2018-09-09 Thread Matej Cepl


Change by Matej Cepl :


--
nosy: +mcepl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2018-08-23 Thread Paul Ganssle


Change by Paul Ganssle :


--
nosy: +p-ganssle

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2018-08-19 Thread Xiang Zhang


Change by Xiang Zhang :


--
nosy: +xiang.zhang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2018-08-18 Thread Bian Jiaping


Change by Bian Jiaping :


--
nosy: +bianjp

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2018-02-12 Thread INADA Naoki

INADA Naoki  added the comment:

We missed 3.7 train.
I'm sorry about I couldn't review it.  But I have many shine features
I want in 3.7 and I have no time to review all.
Especially, I need to understand tr29.  It was hard job to me.

I think publishing this (and any other functions relating to unicode)
to PyPI is better than waiting 3.8.
It make possible to discuss API design with working code, and make it "battle 
tested" before adding it to standard library.

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-07 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

> I don't think unicodedata is the right place

I do agree with that. A new module sounds good, would it be a problem if that 
module would contain very few functions at first?

> Can we mark this as having a Provisional API to give us time to decide on the 
> best API before locking it in permanently?

I'm not sure it's my call to make, but I would gladly consider that option.

> we should go through a PEP.

Why not. I may need a bit of guidance though.

> If you want state keeping for iterating over multiple  parts of 
> the string, you can use an iterator.

Sure thing. It just wasn't specified like this in the proto-PEP.

> The APIs were inspired by the standard string.find() APIs, that's why they 
> work on indexes and don't return Unicode strings. As such, they serve a 
> different use case than an iterator.

I personally like having a generator returning slice objects, as suggested 
above. What would be some good objections to this?

> Wouldn't this be a typical case where we'd expect a module to evolve and gain 
> usage on PyPI first, before adding it to the stdlib? [...] they might give 
> inspiration for a suitable API design

I'll give it a look.

> The well known library for Unicode support in C++ and Java is ICU

Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". ICU 
provides much much more than what I'm willing to provide. My goal here is just 
to "fix" how the python's standard library iterates over characters. Noting 
more, nothing less.

One might think that splitlines() should be "fixed" too, and there is clearly 
matter to discuss here. Same for words splitting. However, I do not intend to 
bring normalization, which you already have, collations, locale dependant 
upcasing or lowercasing, etc. We might need a wheel, but we don't have to take 
the whole truck.

How do we discuss all of this? Who's in charge of making decisions? How long 
should we debate? That's my first time contributing to Python and I'm new to 
all of that.

Thanks for your time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The well known library for Unicode support in C++ and Java is ICU 
(International Components for Unicode). There is a Python wrapper [1].

This is a large complex library that covers many aspects of Unicode support. 
It's interface looks rather Javaic than Pythonic. Some parts of it already are 
covered by other parts of the stdlib (the str class, the codecs and locale 
modules).

[1] https://pypi.python.org/pypi/PyICU/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Stefan Behnel

Stefan Behnel added the comment:

Wouldn't this be a typical case where we'd expect a module to evolve and gain 
usage on PyPI first, before adding it to the stdlib?

Searching for "grapheme" in PyPI gives some results for me. Even if they do not 
cover what this ticket asks for, they might give inspiration for a suitable API 
design. And I'm probably missing other related packages by lack of a better 
search term.

https://pypi.python.org/pypi?%3Aaction=search&term=grapheme

--
nosy: +scoder

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 03.08.2017 15:05, Guillaume Sanchez wrote:
> 
> Guillaume Sanchez added the comment:
> 
> I have a few criticism to do against that proto-PEP
> 
> http://mail.python.org/pipermail/python-dev/2001-July/015938.html
> 
> In particular, the fact that all those functions return an index prevents any 
> state keeping.

If you want state keeping for iterating over multiple 
parts of the string, you can use an iterator.

The APIs were inspired by the standard string.find() APIs, that's
why they work on indexes and don't return Unicode strings. As
such, they serve a different use case than an iterator.

With the APIs, scanning would always start at the given index
in the string and move forward/backward to the start of the next
.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Steven D'Aprano

Steven D'Aprano added the comment:

On Thu, Aug 03, 2017 at 11:21:38AM +, Serhiy Storchaka wrote:

> Should iterators provide just substrings or their positions?
[...]

I think we're breaking new ground here and I'm not sure what the right 
API should be. Should we follow Perl 6?

https://docs.perl6.org/type/Str

Go has a "norm" package for dealing with normalised "characters" 
(graphemes).

https://blog.golang.org/normalization

http://godoc.org/golang.org/x/text/unicode/norm

Are my comments unacceptible scope-creep? We've gone from talking about 
a grapheme cluster break algorithm to me talking about Perl6 and Go 
which have rich string APIs based on graphemes.

I'm not even sure of the best place for this:

- unicodedata
- string
- a new module?

I don't think unicodedata is the right place -- that should be for data 
and processing of individual unicode code points, not string handling, 
and it shouldn't become a grab-bag of random unrelated functions just 
because they have something to do with Unicode.

Can we mark this as having a Provisional API to give us time to decide on the 
best API before locking it in permanently?

https://www.python.org/dev/peps/pep-0411/

I'm reluctant to say this, because it's a lot more work, but maybe this 
is complicated enough that we should go through a PEP.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

I have a few criticism to do against that proto-PEP

http://mail.python.org/pipermail/python-dev/2001-July/015938.html

In particular, the fact that all those functions return an index prevents any 
state keeping.

That's a problem because:

> next_(u, index) -> integer

As you've seen it, in grapheme clustering (as well as words and line breaking), 
we have to have an automaton to decide on the breaking point. Which means that 
starting at an arbitrary index is not possible.

> prev_(u, index) -> integer

Is it really necessary? It means implementing the same logic to go backward. In 
our current case, we'd need a backward grapheme cluster break automaton too.

> _start(u, index) -> integer
> _end(u, index) -> integer

Not doable in O(1) for the same reason as next_(). We need a 
context, and the code point itself cannot give enough information to know if 
it's the start/end of a given indextype.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Thanks for your consideration. I'm currently fixing what's been asked in the 
reviews.

> But it would be useful to provide also word and sentence iterators.

I'll gladly do that as well!

> I think emitting a pair (pos, substring) would be more useful.

That means emitting a pair like ((start, end), substr) ? Is it pythonic to 
return a structure like this?

For what it's worth, I don't like it, but I definitely understand the value of 
it. I'd prefer having two versions. One returning indexes, the other returning 
substrings.

But...

> Alternatively an iterator could emit slice objects.

I really like that. Do we have a clear common agreement or preference on any 
option?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Issue18406 is closed as a duplicate of this issue. There are useful links in 
issue18406. In particular see a proto-PEP of Unicode Indexing Helper Module:

http://mail.python.org/pipermail/python-dev/2001-July/015938.html

I agreed that providing grapheme iterator would be useful. But it would be 
useful to provide also word and sentence iterators.

Should iterators provide just substrings or their positions? I think emitting a 
pair (pos, substring) would be more useful. It is easier to create an iterator 
of substrings from the iterator of pairs than opposite.

Alternatively an iterator could emit slice objects. Or special objects similar 
to re match objects.

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
components: +Unicode
nosy: +benjamin.peterson, ezio.melotti, lemburg, loewis
stage: needs patch -> patch review
title: str.center() is not unicode aware -> Add unicode grapheme cluster break 
algorithm

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com