[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. wrote: > > Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no > currency symbols, so False answer): > > >>> unicode_currency = {chr(c) for c in range(0x) if > >>> unicodedata.category(chr(c)) == "Sc"} > >>> wordlist = open('/u

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no currency symbols, so False answer): >>> unicode_currency = {chr(c) for c in range(0x) if >>> unicodedata.category(chr(c)) == "Sc"} >>> wordlist = open('/usr/local/share/sowpods').read() >>> len(wordlist) 2707021 >>> %timeit

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D. wrote: > > Yeah... oops. Obviously I typed the version in email. Should have done it in > the shell. But you got the intention of set-ifying the characters in the > large string. Yep. I thought of that as I was originally writing, but absent bench

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string. Yes on lies, damn lies, and benchmarks. On Fri, Jun 2, 2023, 7:29 PM Chris Angelico wrote: > On Sat, 3 Jun 2023 at 08:28, David Mer

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. wrote: > > This is just bar talk at this point. I think we've shown that this is > easy enough to do that programmers can roll their own. > > But as idle chat goes, note that in your code: > >set(unicodedata.category(ch) for ch in s) > > If `s`

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own. But as idle chat goes, note that in your code: set(unicodedata.category(ch) for ch in s) If `s` is a billion characters long, then we make a billion calls to the `.cat

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D. wrote: > > Sure. That's fine. With a sufficiently long strings my code is faster, but > for "typical" strings yours will be. Really? How? Your code has to build a set of every character in the string; mine builds a set of every category in the stri

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be. On Fri, Jun 2, 2023, 5:20 PM Chris Angelico wrote: > On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. > wrote: > > > > def does_string_have_currency_mark(s): > > return bool(set(s)

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. wrote: > > def does_string_have_currency_mark(s): > return bool(set(s) & set(unicode_categories['Sc']) > > def does_string_have_numeric_digit(s): ... > > ... and so on. Those seem like questions one asks often enough. Not > every day, but more t

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
def does_string_have_currency_mark(s): return bool(set(s) & set(unicode_categories['Sc']) def does_string_have_numeric_digit(s): ... ... and so on. Those seem like questions one asks often enough. Not every day, but more than never. On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico wrote: > >

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 06:54, David Mertz, Ph.D. wrote: > > If we're talking PyPI, it would be nice to have: > > unicode_categories = {"Zs": [...], "Ll": [...], ...} > > For all the various categories. It would just take one pass through > all the characters to generate it, but then every category

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
If we're talking PyPI, it would be nice to have: unicode_categories = {"Zs": [...], "Ll": [...], ...} For all the various categories. It would just take one pass through all the characters to generate it, but then every category would be fast to access later. On the other hand, it's a few lines

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Marc-Andre Lemburg
On 01.06.2023 20:06, David Mertz, Ph.D. wrote: I guess this is pretty general for the described need: %time unicode_whitespace = [chr(c) for c in range(0x) if unicodedata.category(chr(c)) == "Zs"] Use sys.maxunicode instead of 0x CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms W

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Barry
> On 1 Jun 2023, at 19:10, David Mertz, Ph.D. wrote: > > %time unicode_whitespace = [chr(c) for c in range(0x) if > unicodedata.category(chr(c)) == "Zs"] Try 0x10 to get all of unicode. Barry ___ Python-ideas mailing list -- python-ideas@

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Ethan Furman
On 6/1/23 11:06, David Mertz, Ph.D. wrote: > I guess this is pretty general for the described need: > > >>> unicode_whitespace = [chr(c) for c in range(0x) if unicodedata.category(chr(c)) == "Zs"] Using the module-level `__getattr__` that could be a lazy attribute. -- ~Ethan~ _

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. wrote: > OK, fair enough. What about "has whitespace (including Unicode beyond > ASCII)"? > >>> import re >>> r = re.compile(r'\s', re.U) >>> r.search('ab\u2002cd') ❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)" "r.search('ab\u2002cd'

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Richard Damon
On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote: I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't act like a "Space" character. If used as the BOM mark, it is intended that it

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
I guess this is pretty general for the described need: >>> %time unicode_whitespace = [chr(c) for c in range(0x) if >>> unicodedata.category(chr(c)) == "Zs"] CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms >>> unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001'

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Marc-Andre Lemburg
On 01.06.2023 18:18, Paul Moore wrote: On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio mailto:antonio...@gmail.com>> wrote: I suggest including a simple str variable in unicodedata module to mirror string.whitespace, so it would contain all characters defined in CPython f

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"? On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico wrote: > > On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D. wrote: > > > > It feels to me like "split on whitespace" or "remove whitespace" are > > quite common operations.

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Chris Angelico
On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D. wrote: > > It feels to me like "split on whitespace" or "remove whitespace" are > quite common operations. I've been frustrated a number of times by > settling for the ASCII whitespace class when I really wanted the > Unicode whitespace class. > Th

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
It feels to me like "split on whitespace" or "remove whitespace" are quite common operations. I've been frustrated a number of times by settling for the ASCII whitespace class when I really wanted the Unicode whitespace class. On Thu, Jun 1, 2023 at 12:20 PM Paul Moore wrote: > > On Thu, 1 Jun 2

[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio < antonio...@gmail.com> wrote: > I suggest including a simple str variable in unicodedata module to mirror > string.whitespace, so it would contain all characters defined in CPython > function [_PyUnicode_IsWhitespace()]( > https://github.