[issue1524081] logging using the SysLog handler fails if locale is set

2020-08-17 Thread Greg Price


Greg Price  added the comment:

For the record because this issue is mentioned in a comment in 
logging/handlers.py and people are sometimes confused by it today:

> This happens because in that particular locale,
> "INFO".lower() != "info"

Since Python 3, this no longer happens: str.lower() and friends do not depend 
on the current locale.

Specifically, the lower() and similar methods on Unicode strings (now "str", 
previously "unicode") were always independent of the current locale.  The 
corresponding methods on byte strings (now "bytes", previously "str") did have 
this locale-dependent behavior, but that was replaced in commit 
6ccd3f2dbcb98b33a71ffa6eae949deae797c09c, in 2007.

See also #37848, for a 2019 discussion of potentially adding an optional 
parameter to use an *explicit* locale.

--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue1524081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-09-20 Thread Greg Price


Greg Price  added the comment:

> labeling long-stable code as "evil".

Let me apologize about this bit -- I was thinking of the term in quotes 
(referring to the earlier comment), but I didn't write it clearly that way.  I 
don't think any of this code is evil, past or present, and I regret that I 
carelessly gave that impression.

I think there isn't currently any problem to be solved here, and I hope we can 
all put our energy elsewhere.  (For my part, I just spent the balance of the 
evening implementing an old feature request on unicodedata.)


Victor, I'd still be quite interested in better understanding your thinking (as 
this example relates to macros in general) if you'd like to reply, though this 
thread may not be the right place to continue.  You can find my email in Git, 
and I'm on Zulip and Discourse; and I'd be happy to start or follow a thread in 
a forum you think appropriate.  Or if you'd rather drop it entirely, that's 
fine too.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2019-09-20 Thread Greg Price


Greg Price  added the comment:

I've gone and implemented a version of this that's integrated into 
Tools/unicode/makeunicodedata.py , and into the unicodedata module.  Patch 
attached.  Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
  'W': ['Wide']}}


Note that the values are lists.  That's because a value can have multiple 
aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']


This implementation also provides the reverse mapping, from an alias to the 
"short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...


This draft doesn't have tests or docs, but it's otherwise complete. I've posted 
it at this stage for feedback on a few open questions:

* This version is in C; at import time some C code builds up the dicts, from 
static tables in the header generated by makeunicodedata.py .  It's not *that* 
much code... but it sure would be more convenient to do in Python instead.

  Should the unicodedata module perhaps have a Python part?  I'd be happy to go 
about that -- rename the existing C module to _unicodedata and add a small 
unicodedata.py wrapper -- if there's a feeling that it'd be a good idea.  Then 
this could go there instead of using the C code I've just written.


* Is this API the right one?
  * This version has e.g. 
unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

  * Perhaps make category/bidirectional/east_asian_width into attributes rather 
than keys? So e.g. 
unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

  * Or: the standard says "loose matching" should be applied to these names, so 
e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. 
To accomplish that, perhaps make it not dicts at all but functions?

So e.g. unicodedata.property_value_by_alias('decimal number') == 
unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

  * There's also room for bikeshedding on the names.


* How shall we handle ucd_3_2_0 for this feature?

  This implementation doesn't attempt to record the older version of the data.  
My reasoning is that because the applications of the old data are quite 
specific and they haven't needed this information yet, it seems unlikely anyone 
will ever really want to know from this module just which aliases existed 
already in 3.2.0 and which didn't yet.

  OTOH, as a convenience I've caused e.g. 
unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the 
same object as unicodedata.property_value_by_alias . This allows 
unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata 
module itself, while minimizing the complexity it adds to the implementation.

  Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's 
still easy to get at them -- just get them from the module itself -- and it 
makes it explicit that you're getting current rather than old data.

--
keywords: +patch
nosy: +Greg Price
versions: +Python 3.9 -Python 3.8
Added file: https://bugs.python.org/file48616/prop-val-aliases.patch

___
Python tracker 
<https://bugs.python.org/issue16684>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-09-19 Thread Greg Price


Greg Price  added the comment:

I hesitate to come back to this thread, because as Raymond says it's consumed a 
lot of time already.  But I think this point is of broader interest than the 
specific few lines we've been discussing:

> When small integer are disabled at compilation time (NSMALLPOSINTS=0 and  a 
> NSMALLNEGINTS=0), I suggest to remove code related to small integers. IMHO 
> CHECK_SMALL_INT() macro is a good practical solution for that.

Victor, can you be more specific about the problem this is solving? I think the 
existing code in master really doesn't leave any problems in place that 
CHECK_SMALL_INT solves.

In master (as of 2702638ea), here's the code that CHECK_SMALL_INT would replace:

if (IS_SMALL_INT(ival)) {
return get_small_int((sdigit)ival);
}

When NSMALLPOSINTS=0 and NSMALLNEGINTS=0 , this expands in the preprocessor to 
the equivalent of:

if (0) {
return (Py_UNREACHABLE(), NULL);
}

(Specifically, it expands to whatever that expands to, since Py_UNREACHABLE and 
possibly NULL are also macros.)

A compiler that's attempting to do any significant optimization at all has to 
be able to discard code that's inside `if (0)`; dead-code elimination is a 
necessary primitive for lots of other optimizations.

(And at a quick empirical sanity-check: gcc, clang, and msvc all do so at any 
optimization setting other than "disabled".  In fact gcc and clang do so even 
with optimizations disabled.)


You made the case very nicely above that macros are evil.  The IS_SMALL_INT 
macro is fairly mild, and it has a performance benefit on some platforms to 
justify it.  But CHECK_SMALL_INT causes a return from the enclosing function, 
which seems quite high on the "evil macro" scale.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-09-18 Thread Greg Price


Greg Price  added the comment:

Thanks Benjamin for reviewing and merging this series!

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18236] str.isspace should use the Unicode White_Space property

2019-09-18 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15849
stage: needs patch -> patch review
pull_request: https://github.com/python/cpython/pull/16254

___
Python tracker 
<https://bugs.python.org/issue18236>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-09-17 Thread Greg Price


Greg Price  added the comment:

> We're wasted a lot of dev time on something that never promised much value in 
> the first place.  So, I'll revert and close this tracker issue

OK, so be it.  I'll certainly agree that this series of threads consumed a lot 
more time than I anticipated or hoped, though I think we have different 
analyses of why that was.

(As you can probably guess, my previous message crossed with yours.)

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38015] inline function generates slightly inefficient machine code

2019-09-17 Thread Greg Price


Greg Price  added the comment:

See followup at https://bugs.python.org/issue38205 and 
https://bugs.python.org/issue37812#msg352670 .

The patch in GH-15710 had a side effect of introducing a call to 
`Py_UNREACHABLE` inside a comma-expression.  A subsequent commit 3ab61473b 
changed `Py_UNREACHABLE` in a way that made that a syntax error; the issue 
wasn't caught then because the code is only compiled if `NSMALLNEGINTS + 
NSMALLPOSINTS <= 0`.  Detailed discussion on those threads.

--

___
Python tracker 
<https://bugs.python.org/issue38015>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-09-17 Thread Greg Price


Greg Price  added the comment:

> if using a static inline function is causing issues

Separately from whether there was or wasn't such an issue here, I think it's 
interesting to note that the build failure bpo-38205 is an example of exactly 
the opposite!  It was caused by a combination of
(a) using a macro *instead* of a plain old C function;
(b) using avoidable preprocessor conditionals.

And both of them led to that build failure in classic ways.

Specifically, the build failure happened because

(a) this `Py_UNREACHABLE` call was in an unusual syntactic context, squished 
into an expression, in order to use it in a macro.

(b) the call was behind an `#ifdef`/`#else`; and the configuration that 
included it wasn't one that got test-built by the authors of 3ab61473b (which 
modified `Py_UNREACHABLE`), nor by CI.

When conditional preprocessing is kept to a minimum -- here, if the `#if 
NSMALLNEGINTS + NSMALLPOSINTS > 0` conditional enclosed just the `small_ints` 
array that it needs to -- then this kind of build regression on non-default 
configurations can't so easily happen.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-09-17 Thread Greg Price


Greg Price  added the comment:

Thanks Victor for linking that issue back here.

> A first change converted a macro to a static inline function. The second 
> change converted the static inline fnuction to a macro

Not quite. The first change converted a macro `CHECK_SMALL_INT` to an 
equivalent sequence, including a `return` that had been hidden inside the macro:

if (is_small_int(ival)) {
return get_small_int((sdigit)ival);
}

The second change converted `is_small_int` to a macro `IS_SMALL_INT`.

The second change also changed an `assert(0)` to say `Py_UNREACHABLE()`. I 
don't know why it did that -- it seems quite orthogonal to the rest of the 
change.


> Morever, if using a static inline function is causing issues,

The change that caused the build failure you're seeing (GH-15710) was intended 
for bpo-38015 .  Details there; briefly, common compilers on x86_32 would emit 
some unsightly extra instructions with `is_small_int`, and converting it to the 
macro `IS_SMALL_INT` eliminated those extra instructions.

It's not clear to me if anyone benchmarked to see if the conversion to a macro 
had any measurable performance benefit.  There's some measurement here:
https://bugs.python.org/issue38015#msg351315
which finds no benefit.  I'm not sure if that was measuring the change in 
GH-15710, or if it was an attempt to replicate the findings in msg351255 (which 
it quotes) about a more invasive change.

So I think the short of it is that the static inline function was not causing 
any issue, and an easy fix for the build failure is to revert GH-15710.

Alternatively, another easy fix would be to replace just the `Py_UNREACHABLE()` 
introduced by GH-15710 with the original `assert(0)`.

Either way, I don't think there's any reason here to revert GH-15216.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-09-10 Thread Greg Price


Greg Price  added the comment:

Thanks @zach.ware for the review and merge of GH-15451!

That was the "minimal" fix, fixing rules that apply to files we have in the 
repo now.  So `rg` will no longer ignore `PC/pyconfig.h`. :-)

I've just sent GH-15823 with the "more thorough" fix on top of that, which 
tries to make it easy not to fall into the same Git pitfall again.

--
status: closed -> open

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-09-10 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15472
pull_request: https://github.com/python/cpython/pull/15823

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38079] _PyObject_VAR_SIZE should avoid arithmetic overflow

2019-09-10 Thread Greg Price


Greg Price  added the comment:

(The tracker just linked GH-14838 to this issue because I mentioned it in a 
comment there, but it's not for this issue -- it's that recent fix for an 
11-year-old bug in a callsite's overflow check.)

--

___
Python tracker 
<https://bugs.python.org/issue38079>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38079] _PyObject_VAR_SIZE should avoid arithmetic overflow

2019-09-10 Thread Greg Price


New submission from Greg Price :

Currently `_PyObject_VAR_SIZE` effectively has a precondition that it must not 
be passed arguments which would overflow in its arithmetic. If that's violated, 
it overflows... which is undefined behavior already, and in fact the likely 
next thing that happens is we allocate a short buffer for a giant array of 
items, then overflow the buffer, bam!

(The arithmetic is on `Py_ssize_t`, a signed integer type; that's why overflow 
is UB.)

It's quite difficult to maintain that precondition at call sites, because it 
effectively means writing a check that duplicates the content of 
`_PyObject_VAR_SIZE` modulo some algebraic rearrangement.  Empirically we don't 
consistently manage that: we just fixed an 11-year-old bug where such a check 
was wrong (luckily in the safe/over-conservative direction), and some other 
call paths don't immediately appear to be checking at all.

So, I think it would be helpful to move that check into `_PyObject_VAR_SIZE` 
itself.

Cc'ing the three folks on the interest list "coverity scan", as that one seems 
the closest match for who'd likely be interested in commenting on this.

Details below, largely from GH-14838 which got me looking at this.

---

In GH-14838, @sir-sigurd fixed an overflow check on a call to 
PyObject_GC_NewVar.  The call site looked like this:

if ((size_t)size > ((size_t)PY_SSIZE_T_MAX - sizeof(PyTupleObject) -
sizeof(PyObject *)) / sizeof(PyObject *)) {
return (PyTupleObject *)PyErr_NoMemory();
}
op = PyObject_GC_NewVar(PyTupleObject, _Type, size);

but the `- sizeof(PyObject *)` had the wrong sign.  Happily the bug made the 
check too conservative, but that's basically a lucky coin-flip we had -- a 
similar bug could just as easily have gone the other direction, making the code 
vulnerable to arithmetic overflow, then memory corruption from overflowing a 
too-small buffer.

The bug was present for 11 years.

Sergey, I am still very curious how you discovered the bug. :-)

I feel like there's an opportunity for a more complete fix as a followup to 
this; that's this thread.

---

My next question on reading that fix was, OK, how can we write this so it's 
hard to get wrong?

I think the point is that `PyObject_GC_NewVar` really has a precondition, that 
the `nitems` it's passed (`size` in this caller) not be so big that the 
allocation will overflow. Specifically `_PyObject_VAR_SIZE(tp, nitems)` needs 
to not overflow. All the trickiness in this overflow check is basically an 
algebraic rearrangement of what `_PyObject_VAR_SIZE(_Type, size)` would 
say, plus substituting in the values actually found at `_Type`.

So, a sort of minimal fix would be to make something with a name like 
`PyObject_GC_NewVar_WouldOverflow` that encodes that. Maybe that in turn would 
just be a macro to delegate to a `_PyObject_VAR_SIZE_WOULD_OVERFLOW`, so that 
each layer only has to know what it actually knows.

I think that immediately raises a next question, though, which is: are other 
call sites of `PyObject_GC_NewVar` checking properly for overflow? At a quick 
grep, I see some that don't appear to be checking.

And: why should they have to? It seems like the best place for this check is 
immediately before the possibly-overflowing computation; or if not there, then 
the closer to it the better.

So, the ideal solution: `_PyObject_VAR_SIZE` itself would be augmented to first 
do this check, and only if it won't overflow then go and actually multiply.

There are only a handful of call sites of `_PyObject_VAR_SIZE`, so it wouldn't 
even be much of a migration to then take care of each of them.

Then we wouldn't need these prophylactic checks at callers' callers -- which 
(a) this bug demonstrates are hard to consistently get right, and (b) the grep 
mentioned above suggests we may not be consistently doing in the first place.

--
components: Interpreter Core
messages: 351573
nosy: Greg Price, brett.cannon, christian.heimes, sir-sigurd, twouters
priority: normal
severity: normal
status: open
title: _PyObject_VAR_SIZE should avoid arithmetic overflow
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue38079>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38015] inline function generates slightly inefficient machine code

2019-09-08 Thread Greg Price


Greg Price  added the comment:

(Just to help keep discussions together: some earlier discussion was on 
GH-15216 .)

Because is_small_int / IS_SMALL_INT is so small, there's not much cost in the 
source code to making it a macro (as GH-15710 did).

But I think it'd be a mistake to go a lot farther than that and convert 
significantly larger chunks of code like get_small_int to macros (as GH-15718 
would), unless the payoff were really commensurate.  It'd be better to focus 
optimization efforts where profiling shows a lot of time is really being spent.

--

___
Python tracker 
<https://bugs.python.org/issue38015>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38043] small cleanups in Unicode normalization code

2019-09-05 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15368
pull_request: https://github.com/python/cpython/pull/15558

___
Python tracker 
<https://bugs.python.org/issue38043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38043] small cleanups in Unicode normalization code

2019-09-05 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15367
pull_request: https://github.com/python/cpython/pull/15712

___
Python tracker 
<https://bugs.python.org/issue38043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38043] small cleanups in Unicode normalization code

2019-09-05 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15366
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15711

___
Python tracker 
<https://bugs.python.org/issue38043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38043] small cleanups in Unicode normalization code

2019-09-05 Thread Greg Price


New submission from Greg Price :

Benjamin noticed in reviewing GH-15558 (for #37966) several points where the 
existing code around Unicode normalization can be improved:

* on the `QuickcheckResult` enum:
  > Maybe `makeunicodedata.py` should output this enum (with better name 
namespacing)

* > merging `test_normalization` into this file [i.e. `test_unicodedata.py`] 
for clarity

* > These "boolean int" parameters could be actual `bool`s. [sc. the `nfc` and 
`k` parameters to `is_normalized_quickcheck`]

None of these are super hard, so good to knock them out while we're thinking of 
them.

--
components: Unicode
messages: 351229
nosy: Greg Price, benjamin.peterson, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: small cleanups in Unicode normalization code
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue38043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37837] add internal _PyLong_FromUnsignedChar() function

2019-08-27 Thread Greg Price


Greg Price  added the comment:

Ah OK, that makes sense of it then :)

> But the most important thing is that using PyLong_FromUnsignedLong() instead 
> of _PyLong_FromUnsignedChar() on top of GH-15192 is producing the same 
> results: striter_next() uses small_ints[] directly. However that's not true 
> for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think 
> that's due to how code is profiled so I'm satisfied with these results more 
> or less.

Very interesting! That's a good comparison to have made, too.

I agree with your conclusions.

--

___
Python tracker 
<https://bugs.python.org/issue37837>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2019-08-27 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue6331>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37966] is_normalized is much slower at "no" than the standard's algorithm

2019-08-27 Thread Greg Price


Greg Price  added the comment:

Fix posted, as GH-15558.

Adding cc's for the folks in the thread on #32285, where this function was 
originally added.

--
components: +Unicode
nosy: +Maxime Belanger, benjamin.peterson, ezio.melotti, steven.daprano, 
vstinner
title: is_normalized is much slower than the standard's algorithm -> 
is_normalized is much slower at "no" than the standard's algorithm
versions: +Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37966>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37966] is_normalized is much slower than the standard's algorithm

2019-08-27 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15231
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15558

___
Python tracker 
<https://bugs.python.org/issue37966>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37966] is_normalized is much slower than the standard's algorithm

2019-08-27 Thread Greg Price


New submission from Greg Price :

In 3.8 we add a new function `unicodedata.is_normalized`.  The result is 
equivalent to `str == unicodedata.normalize(form, str)`, but the implementation 
uses a version of the "quick check" algorithm from UAX #15 as an optimization 
to try to avoid having to copy the whole string.  This was added in issue 
#32285, commit 2810dd7be.

However, it turns out the code doesn't actually implement the same algorithm as 
UAX #15, and as a result we often miss the optimization and end up having to 
compute the whole normalized string after all.

Here's a quick demo on my desktop.  We pass a long string made entirely out of 
a character for which the quick-check algorithm always says `NO`, it's not 
normalized:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \
'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \
's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop

That's the same 4.4 ms (for a 1 MB string) with or without the attempted 
optimization.

Here it is after a patch that makes the algorithm run as in the standard:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \
'unicodedata.is_normalized("NFD", s)'
500 loops, best of 5: 58.2 nsec per loop

Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).

The root cause of the issue is that our `is_normalized` static helper, which 
the new function relies on, was never written as a full implementation of the 
quick-check algorithm.  The full algorithm can return YES, MAYBE, or NO; but 
originally this helper's only caller was the implementation of 
`unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO.  So the 
helper often returns MAYBE when the standard algorithm would say NO.

(More precisely, perhaps: it's fine that this helper was never a full 
implementation... but it didn't say that anywhere, even while referring to the 
standard algorithm, and as a result set us up for future confusion.)

That's exactly what's happening in the example above: the standard quick-check 
algorithm would say NO, but our helper says MAYBE.  Which for 
`unicodedata.is_normalized` means it has to go compute the whole normalized 
string.

--
messages: 350651
nosy: Greg Price
priority: normal
severity: normal
status: open
title: is_normalized is much slower than the standard's algorithm
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue37966>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-08-26 Thread Greg Price


Greg Price  added the comment:

> I have a minimal fix which takes care of all the files above. I'll post that 
> shortly, and I may also write up a more thorough fix that tries to make it 
> easy not to fall into the same Git pitfall again.

Both now done.
* GH-15451 is that minimal fix.
* Just now I've posted https://github.com/gnprice/cpython/pull/1 , which is the 
more thorough fix. It's written on top of the minimal fix, but I could also 
squash them together as one PR.

Also just posted GH-15542 . This is a small edit, deleting some lines whose 
author evidently never intended them to get committed to the repo.  I'd noticed 
those lines before and been mildly puzzled; writing up the thorough fix caused 
me to look in the history and see how they were added, and resolve the mystery 
:).  I've kept that as a separate PR just to reduce the number of moving parts 
in the main one, because the reasoning for it is quite independent.

--

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-08-26 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15219
pull_request: https://github.com/python/cpython/pull/15542

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37837] add internal _PyLong_FromUnsignedChar() function

2019-08-25 Thread Greg Price


Greg Price  added the comment:

Very interesting, thanks!

It looks like with LTO enabled, this optimization has no effect at all.

This change adds significant complexity, and it seems like the hoped-for payoff 
is entirely in terms of performance on rather narrowly-focused microbenchmarks. 
 In general I think that sets the bar rather high for making sure the 
performance gains are meaningful enough to justify the increase in complexity 
in the code.

In particular, I expect most anyone running Python and concerned with 
performance to be using LTO.  (It's standard in distro builds of Python, so 
that covers probably most users already.)  That means if the optimization 
doesn't do anything in the presence of LTO, it doesn't really count. ;-)

---

Now, I am surprised at the specifics of the result! I expected that LTO would 
probably pick up the equivalent optimization on its own, so that this change 
didn't have an effect. Instead, it looks like with LTO, this microbenchmark 
performs the same as it does without LTO *before* this change. That suggests 
that LTO may instead be blocking this optimization.

In that case, there may still be an opportunity: if you can work out why the 
change doesn't help under LTO, maybe you can find a way to make this 
optimization happen under LTO after all.

--

___
Python tracker 
<https://bugs.python.org/issue37837>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18236] str.isspace should use the Unicode White_Space property

2019-08-24 Thread Greg Price


Greg Price  added the comment:

> I've gone and made a patch for this change

Update:

* The preparatory changes in #37760 are now almost all merged; GH-15265 is the 
one piece remaining, and I'd be grateful for a review.

It's a generally straightforward and boring change that converts the main data 
structures of makeunicodedata.py from using length-18 tuples as records to 
using a dataclass, which I think makes subsequent changes that add features to 
that script much easier both to write and to review.

* I have a slightly updated version of the fix itself, which differs mainly by 
adding a test: https://github.com/gnprice/cpython/commit/9b3bf6739 Comments 
welcome there too.

--

___
Python tracker 
<https://bugs.python.org/issue18236>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37837] add internal _PyLong_FromUnsignedChar() function

2019-08-24 Thread Greg Price


Greg Price  added the comment:

> Is there a particular reason to specifically call PyLong_FromSize_t? Seems 
> like PyLong_FromLong is the natural default (and what we default to in the 
> rest of the code), and it's what this ends up calling anyway.

Ah I see, the patch is meant to go on top of GH-15192, which makes 
PyLong_FromSize_t apply the small-int optimization itself.

As I just suggested on GH-15192, I'd like to see that PR apply the small-int 
optimization in the more broadly-used PyLong_FromUnsignedLong... and then I 
think the natural thing for this new function to do is to call that.

Still quite curious how LTO does, and also curious what compiler and flags 
you're using in benchmarks.

--

___
Python tracker 
<https://bugs.python.org/issue37837>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37837] add internal _PyLong_FromUnsignedChar() function

2019-08-24 Thread Greg Price


Greg Price  added the comment:

Oh also:

* What compiler, and what compilation flags, are you using in your 
benchmarking?  That seems relevant :)

--

___
Python tracker 
<https://bugs.python.org/issue37837>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37837] add internal _PyLong_FromUnsignedChar() function

2019-08-24 Thread Greg Price


Greg Price  added the comment:

Hmm, I'm a bit confused because:

* Your patch at GH-15251 replaces a number of calls to PyLong_FromLong with 
calls to the new _PyLong_FromUnsignedChar.

* That function, in turn, just calls PyLong_FromSize_t.

* And that function begins:

PyObject *
PyLong_FromSize_t(size_t ival)
{
PyLongObject *v;
size_t t;
int ndigits = 0;

if (ival < PyLong_BASE)
return PyLong_FromLong((long)ival);
// ...


* So, it seems like after your patch we still end up calling PyLong_FromLong at 
each of these callsites, just after a couple more indirections than before.

Given the magic of compilers and of hardware branch prediction, it wouldn't at 
all surprise me for those indirections to not make anything slower... but if 
the measurements are coming out *faster*, then I feel like something else must 
be going on. ;-)

Ohhh, I see -- I bet it's that at _PyLong_FromUnsignedChar, the compiler can 
see that `is_small_int(ival)` is always true, so the whole function just turns 
into get_small_int.  Whereas when compiling a call to PyLong_FromLong from some 
other file (other translation unit), it can't see that and can't make the 
optimization.

Two questions, then:

* How do the measurements look under LTO? I wonder if with LTO the linker is 
able to make the same optimization that this change helps the compiler make.

* Is there a particular reason to specifically call PyLong_FromSize_t? Seems 
like PyLong_FromLong is the natural default (and what we default to in the rest 
of the code), and it's what this ends up calling anyway.

--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue37837>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-24 Thread Greg Price


Greg Price  added the comment:

> May I suggest directing your efforts towards fixing known bugs or 
> implementing requested features.

Well, I would certainly be grateful for a review on my fix to #18236. ;-) 
There's also a small docs bug at GH-15301.

I do think there's significant value in making code easier to read and less 
tricky. If the project continues to be successful for a long time to come, then 
that means the code will be read many, many more times than it's written. But 
one particular spot where it seems our experiences interestingly differ is:

> They are a bit tedious to review and are eating up our time in the back and 
> forth.

As a reviewer I generally find it much less work to review a change when it's 
intended to have no effect on the code's behavior. First, because it's easier 
to confirm no effect than to pin down what the effects are; then because the 
whole set of questions about whether the effects are desirable doesn't arise. 
As a result I often ask contributors (to Zulip, say) to split a change into a 
series of small pure refactors, followed by a very focused diff for the 
behavior change.

So that's certainly background to my sending as many PRs that don't change any 
behavior as PRs that do.

I actually have quite a number of draft changes built up over the last few 
weeks. I've held back on sending them all at once, partly because I've felt I 
have enough open PRs and I wanted to get a better sense of how reviews go. 
Perhaps I'll go pick out a couple more of them that are bugfixes, features, and 
docs to send next.

(You didn't mention docs just now, but given the care I see you take in adding 
to them and in revising What's New, I think we agree that work there is 
valuable.)

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-08-24 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15143
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15451

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37936] gitignore file is too broad

2019-08-24 Thread Greg Price


New submission from Greg Price :

There are a number of files that we track in the repo, but are nevertheless 
covered by `.gitignore`.

This *mostly* doesn't change anything, because Git itself only cares what 
`.gitignore` has to say about files that aren't already tracked.  But:

* It affects any new files someone might add that are covered by the same 
unintentionally-broad patterns.  In that case it'd be likely to cause some 
confused debugging into why Git wasn't seeing the file; or possibly loss of 
work, if the person didn't notice that the file had never been committed to Git.

* More immediately, other tools that aren't Git but consult the Git ignore 
rules don't necessarily implement this wrinkle. In particular this is 
unfortunately a WONTFIX bug in ripgrep / `rg`: 
https://github.com/BurntSushi/ripgrep/issues/1127 .

I learned of the `rg` bug (and, for that matter, refreshed myself on just how 
Git itself handles this case) after some confusion today where I was looking 
with for references to a given macro, thought I'd looked at all of them... and 
then later noticed through `git log -p -S` a reference in `PC/pyconfig.h` with 
no subsequent change deleting it. Turned out it was indeed there and I needed 
to take account of it.

Here's the list of affected files:

$ git ls-files -i --exclude-standard 
.gitignore
Doc/Makefile
Lib/test/data/README
Modules/Setup
PC/pyconfig.h
Tools/freeze/test/Makefile
Tools/msi/core/core.wixproj
Tools/msi/core/core.wxs
Tools/msi/core/core_d.wixproj
Tools/msi/core/core_d.wxs
Tools/msi/core/core_en-US.wxl
Tools/msi/core/core_files.wxs
Tools/msi/core/core_pdb.wixproj
Tools/msi/core/core_pdb.wxs
Tools/unicode/Makefile

Fortunately this is not hard to fix.  The semantics of `.gitignore` have a 
couple of gotchas, but once you know them it's not really any more complicated 
to get the behavior exactly right.  And I've previously spent the hour or two 
to read up on it... and when I forget, I just consult my own short notes :), at 
the top of this file:
https://github.com/zulip/zulip/blob/master/.gitignore

I have a minimal fix which takes care of all the files above. I'll post that 
shortly, and I may also write up a more thorough fix that tries to make it easy 
not to fall into the same Git pitfall again.

--
components: Build
messages: 350355
nosy: Greg Price
priority: normal
severity: normal
status: open
title: gitignore file is too broad
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37936>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-24 Thread Greg Price


Greg Price  added the comment:

Thanks, Raymond, for the review on GH-15216!

Shortly after posting this issue, I noticed a very similar story in 
CHECK_BINOP.  I've just posted GH-15448 to similarly make returns explicit 
there.  It basically consists of a number of repetitions of

-CHECK_BINOP(self, other);
+if (!PyLong_Check(self) || !PyLong_Check(other)) {
+Py_RETURN_NOTIMPLEMENTED;
+}

with the names `self` and `other` varying from site to site.  Though the 
expanded version isn't literally a `return` statement, I think it functions 
almost as well as one for the purposes that explicitness serves here, and much 
better so than a reference to `CHECK_BINOP`.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-23 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15140
pull_request: https://github.com/python/cpython/pull/15448

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36375] PEP 499 implementation: "python -m foo" binds the main module as both __main__ and foo in sys.modules

2019-08-21 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue36375>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-20 Thread Greg Price


Greg Price  added the comment:

(A bit easy to miss in the way this thread gets displayed, so to highlight in a 
comment: GH-15265 is up, following the 5 other patches which have now all been 
merged.  That's the one that replaces the length-18 tuples with a dataclass.)

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35518] test_timeout uses blackhole.snakebite.net domain which doesn't exist anymore

2019-08-20 Thread Greg Price


Greg Price  added the comment:

I ran across this test when looking at especially slow files in the test suite: 
it turns out that not only is this service currently down, but the 
snakebite.net domain still exists, and as a result the test can end up waiting 
20-30s before learning that the hosts can't be found and the test gets skipped.

I agree with Benjamin's and Victor's comments -- the best solution would be to 
recreate the test, ideally as something that anyone (anyone with Docker 
installed, perhaps?) can just run locally.

For now I've just sent GH-15349 as a one-line fix to skip the test, with a 
remark pointing at this issue. It's already getting skipped 100% of the time 
thanks to the handy `support.transient_internet` mechanism -- this just makes 
the skip (a) explicit in the source code, and (b) a lot faster. :-)

--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue35518>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35518] test_timeout uses blackhole.snakebite.net domain which doesn't exist anymore

2019-08-20 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15063
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15349

___
Python tracker 
<https://bugs.python.org/issue35518>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36502] str.isspace() for U+00A0 and U+202F differs from document

2019-08-19 Thread Greg Price


Greg Price  added the comment:

Thanks Victor for the reviews and merges!

(Unmarking 2.7, because https://docs.python.org/2/library/stdtypes.html seems 
to not have this issue.)

--
versions:  -Python 2.7

___
Python tracker 
<https://bugs.python.org/issue36502>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37872] Move _Py_IDENTIFIER statics in Python/import.c to top of the file

2019-08-16 Thread Greg Price


Change by Greg Price :


--
components: +Interpreter Core

___
Python tracker 
<https://bugs.python.org/issue37872>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37872] Move _Py_IDENTIFIER statics in Python/import.c to top of the file

2019-08-16 Thread Greg Price


Change by Greg Price :


--
title: Move statics in Python/import.c to top of the file -> Move 
_Py_IDENTIFIER statics in Python/import.c to top of the file

___
Python tracker 
<https://bugs.python.org/issue37872>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37872] Move statics in Python/import.c to top of the file

2019-08-16 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue37872>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-14 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15027
pull_request: https://github.com/python/cpython/pull/15302

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36502] str.isspace() for U+00A0 and U+202F differs from document

2019-08-14 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15026
pull_request: https://github.com/python/cpython/pull/15301

___
Python tracker 
<https://bugs.python.org/issue36502>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37864] Correct and deduplicate docs on "printable" characters

2019-08-14 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +15025
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15300

___
Python tracker 
<https://bugs.python.org/issue37864>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37864] Correct and deduplicate docs on "printable" characters

2019-08-14 Thread Greg Price


New submission from Greg Price :

While working on #36502 and then #18236 about the definition and docs of 
str.isspace(), I looked closely also at its neighbor str.isprintable().

It turned out that we have the definition of what makes a character "printable" 
documented in three places, giving two different definitions.

The definition in the comment on `_PyUnicode_IsPrintable` is inverted, so 
that's an easy small fix.

With that correction, the two definitions turn out to be equivalent -- but to 
confirm that, you have to go look up, or happen to know, that those are the 
only five "Other" categories and only three "Separator" categories in the 
Unicode character database.  That makes it hard for the reader to tell whether 
they really are the same, or if there's some subtle difference in the intended 
semantics.

I've taken a crack at writing some improved docs text for a single definition, 
borrowing ideas from the C comment as well as the existing docs text; and then 
pointing there from the other places we'd had definitions. PR coming shortly.

--
components: Unicode
messages: 349792
nosy: Greg Price, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Correct and deduplicate docs on "printable" characters
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue37864>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-14 Thread Greg Price


Change by Greg Price :


--
nosy: +vstinner

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price


Greg Price  added the comment:

> About the RSS memory, I'm not sure how Linux accounts the Unicode databases 
> before they are accessed. Is it like read-only memory loaded on demand when 
> accessed?

It stands for "resident set size", as in "resident in memory"; and it only 
counts pages of real physical memory. The intention is to count up pages that 
the process is somehow using.

Where the definition potentially gets fuzzy is if this process and another are 
sharing some memory.  I don't know much about how that kind of edge case is 
handled.  But one thing I think it's pretty consistently good at is not 
counting pages that you've nominally mapped from a file, but haven't actually 
forced to be loaded physically into memory by actually looking at them.

That is: say you ask for a file (or some range of it) to be mapped into memory 
for you.  This means it's now there in the address space, and if the process 
does a load instruction from any of those addresses, the kernel will ensure the 
load instruction works seamlessly.  But: most of it won't be eagerly read from 
disk or loaded physically into RAM.  Rather, the kernel's counting on that load 
instruction causing a page fault; and its page-fault handler will take care of 
reading from the disk and sticking the data physically into RAM.  So until you 
actually execute some loads from those addresses, the data in that mapping 
doesn't contribute to the genuine demand for scarce physical RAM on the 
machine; and it also isn't counted in the RSS number.


Here's a demo!  This 262392 kiB (269 MB) Git packfile is the biggest file lying 
around in my CPython directory:

$ du -k .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack
262392  .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack


Open it for read -- adds 100 kiB, not sure why:

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, mmap
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9968 kB
>>> fd = 
>>> os.open('.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack',
>>>  os.O_RDONLY)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10068 kB


Map it into our address space -- RSS doesn't budge:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> m

>>> len(m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10068 kB


Cause the process to actually look at all the data (this takes about ~10s, 
too)...

>>> sum(len(l) for l in m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:271576 kB

RSS goes way up, by 261508 kiB!  Oddly slightly less (by ~1MB) than the file's 
size.


But wait, there's more. Drop that mapping, and RSS goes right back down (OK, 
keeps 8 kiB extra):

>>> del m
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10076 kB

... and then map the exact same file again, and it's *still* down:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10076 kB

This last step is interesting because it's a certainty that the data is still 
physically in memory -- this is my desktop, with plenty of free RAM.  And it's 
even in our address space.  But because we haven't actually loaded from those 
addresses, it's still in memory only at the kernel's caching whim, and so 
apparently our process doesn't get "charged" or "blamed" for its presence there.


In the case of running an executable with a bunch of data in it, I expect that 
the bulk of the data (and of the code for that matter) winds up treated very 
much like the file contents we mmap'd in.  It's mapped but not eagerly 
physically loaded; so it doesn't contribute to the RSS number, nor to the 
genuine demand for scarce physical RAM on the machine.


That's a bit long :-), but hopefully informative.  In short, I think for us RSS 
should work well as a pretty faithful measure of the real memory consumption 
that we want to be frugal with.

--

___
Python tracker 
<https://bugs.python.org/issue32771>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36502] str.isspace() for U+00A0 and U+202F differs from document

2019-08-14 Thread Greg Price


Change by Greg Price :


--
pull_requests: +15019
pull_request: https://github.com/python/cpython/pull/15296

___
Python tracker 
<https://bugs.python.org/issue36502>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price


Greg Price  added the comment:

(I should add that it was only after doing the reading that produced the OP 
that I had a clear idea what I thought the priority of the issue was -- before 
doing that work I didn't have a clear sense of the scope of what it affects. 
Based on that SpecialCasing.txt file as of Unicode 12.0.0, I believe the 
functionality we don't currently support is entirely about the handling of 
certain versions of the Latin letter I, as treated in Lithuanian, Turkish, and 
Azerbaijani. Though one function of this issue thread is that it would be a 
great place to point out if there's another component to it!)

--

___
Python tracker 
<https://bugs.python.org/issue37848>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price


Greg Price  added the comment:

> Maintaining Python is already expensive [...] There are already enough bugs 
> waiting for you to be fixed ;-)

BTW I basically agree with this. I think this is not a high-priority issue, and 
I have my eye on some of those bugs. :-)

I think the fact that it's per-*language* (despite my inaccurate phrasing in 
the OP), not per-locale, simplifies it some -- for example the whole `.UTF-8` 
vs `.utf8` thing doesn't appear. And in particular I think if/when someone 
decides to sit down and make an implementation of this, then if they take the 
time to carefully read and absorb the relevant pages of the standard... this is 
a feature where it's pretty feasible for the implementation to be a 
self-contained and relatively stable and low-bugs piece of code.

And in general I think even if nobody implements it soon, it's useful to have 
an issue that can be pointed to for this feature, and especially so if the 
discussion clearly lays out what the feature involves and what different 
people's views are on the API. For example #18236 has been open for 6 years, 
but the discussion there was extremely helpful for me to understand it and work 
up a fix, after just being pointed to it by someone who'd searched the tracker 
on seeing me send in the doc fix GH-15019.

--

___
Python tracker 
<https://bugs.python.org/issue37848>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price

Greg Price  added the comment:

> I believe that all locale specific things should be in the locale module, not 
> in the str class.

The locale module is all about doing things with the current process-global 
Unix locale. I don't think that'd be an appropriate interface for this -- if 
it's worth doing, it's worth doing in such a way that the same web server 
process can handle requests for Turkish-, Lithuanian-, and Spanish-speaking 
users without having to reset a global variable for each one.

> If a locale specific mapping is requested, this should be done
> explicitly by e.g. providing a parameter to str.lower() / upper() /
> title().

I like this design.

I said "locale" above, but that wasn't quite right, I think -- the file says 
e.g. `tr`, not `tr_TR` and `tr_CY`, and it describes the identifiers as 
"language IDs".  So perhaps

str.lower(*, lang=None)

?  And then

"I".lower(lang="tr") == "ı" == "\N{Latin small letter dotless I}"

--

___
Python tracker 
<https://bugs.python.org/issue37848>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price


Greg Price  added the comment:

OK, I forked off the discussion of case-mapping as #37848. I think it's 
probably good to first sort out what we want, before returning to how to 
implement it (if it's agreed that changes are desired.)

Are there other areas of functionality that would be good to add in the core, 
and require data that's currently only in unicodedata?

--

___
Python tracker 
<https://bugs.python.org/issue32771>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price


Greg Price  added the comment:

Another previous discussion is #4610.

--

___
Python tracker 
<https://bugs.python.org/issue37848>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-13 Thread Greg Price

New submission from Greg Price :

Splitting this out from #32771 for more specific discussion. Benjamin writes 
there that it would be good to:

> implement the locale-specific case mappings of 
> https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of 
> the Unicode 12 standard in str.lower/upper/casefold.

and adds that an implementation would require having available in the core the 
data on canonical combining classes, which is currently only in the unicodedata 
module.

---

First, I'd like to better understand what functionality we have now and what 
else the standard describes.  Reading 
https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see
* a bunch of rules that aren't language-specific
* some other rules that are.

I also see in makeunicodedata.py that we don't even parse the language-specific 
rules.

Here's, IIUC, a demo of us correctly implementing the language-independent 
rules.  One line in the data file reads:

FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings 
respectively:

$ unicode --brief "$(./python -c \
   's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper(')"
ff U+FB00 LATIN SMALL LIGATURE FF
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
f U+0066 LATIN SMALL LETTER F
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
F U+0046 LATIN CAPITAL LETTER F

OK, great.

---


Then here's something we don't implement. Another line in the file reads:

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.:

i U+0069 LATIN SMALL LETTER I
 ̇ U+0307 COMBINING DOT ABOVE
 ́ U+0301 COMBINING ACUTE ACCENT

... but only in a Lithuanian (`lt`) locale.

One question is: what would the right API for this be? I'm not sure I'd want 
`str.lower`'s results to depend on the process's current Unix locale... and I 
definitely wouldn't want to get that without some way of instead telling it 
what locale to use. (Either to use a single constant locale, or to use a 
per-user locale in e.g. a web application.)  Perhaps `str.lower` and friends 
would take a keyword argument `locale`?


Oh, one more link for reference: the said section of the standard is in this 
PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end.


And a related previous issue: #12736.

--
components: Unicode
messages: 349646
nosy: Greg Price, benjamin.peterson, ezio.melotti, lemburg, vstinner
priority: normal
severity: normal
status: open
title: More fully implement Unicode's case mappings
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37848>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14985
pull_request: https://github.com/python/cpython/pull/15265

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Greg Price  added the comment:

> From my perspective, the main problem with using type annotations is that 
> there's nothing checking them in CI.

Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning 
everyone not to do that in their codebases, because what you really don't want 
is a bunch of annotations that have gradually accumulated falsehoods as the 
code has changed around them.

Still, I think from "some annotations + no checking" the good equilibrium to 
land in "some annotations + checking", not "no annotations + no checking". (I 
do mean "some" -- I don't predict we'll ever go sweep all over adding them.) 
And I think the highest-probability way to get there is to let them continue to 
accumulate where people occasionally add them in new/revised code... because 
that holds a door open for someone to step up to start checking them, and then 
to do the work to make that part of CI. (That someone might even be me! But I 
can think of plenty of other likely folks to do it.)

If we accumulated quite a lot of them and nobody had yet stepped up to make 
checking happen, I'd worry.  But with the smattering we currently have, I think 
that point is far off.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price


Greg Price  added the comment:

Speaking of improving functionality:

> Having unicodedata readily accessible to the str type would also permit 
> higher a fidelity unicode implementation. For example, implementing 
> language-tailored str.lower() requires having canonical combining class of a 
> character available. This data lives only in unicodedata currently.

Benjamin, can you say more about the behavior you have in mind here? I don't 
entirely follow. (Is or should there be an issue for it?)

--
versions: +Python 3.9 -Python 3.8

___
Python tracker 
<https://bugs.python.org/issue32771>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price

Greg Price  added the comment:

> Loading it dynamically reduces the memory footprint.

Ah, this is a good question to ask!

First, FWIW on my Debian buster desktop I get a smaller figure for `import 
unicodedata`: only 64 kiB.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9888 kB

>>> import unicodedata
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9952 kB

But whether 64 kiB or 160 kiB, it's much smaller than the 1.1 MiB of the whole 
module.  Which makes sense -- there's no need to bring the whole thing in 
memory when we only import it, or generally to bring into memory the parts we 
aren't using.  I wouldn't expect that to change materially if the tables and 
algorithms were built in.

Here's another experiment: suppose we load everything that ast.c needs in order 
to handle non-ASCII identifiers.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9800 kB

>>> là = 3
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9864 kB

So that also comes to 64 kiB.

We wouldn't want to add 64 kiB to our memory use for no reason; but I think 64 
or 160 kiB is well within the range that's an acceptable cost if it gets us a 
significant simplification or improvement to core functionality, like Unicode.

--

___
Python tracker 
<https://bugs.python.org/issue32771>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Greg Price  added the comment:

> This is good. But the title mentioned dataclasses, and they are 3.7+.

Ahh, sorry, I think now I understand you. :-)

Indeed, when I switch to the branch with that change 
(https://github.com/gnprice/cpython/commit/2b4aec4dd -- it comes after the 
patch that's GH-15248, so I haven't yet sent it as a PR), then `python3.6 
Tools/unicode/makeunicodedata.py` no longer works.

I think this is fine. Most of all that's because this always works:

./python Tools/unicode/makeunicodedata.py

Anyone who's going to be running that script will want to build a `./python` 
right afterward, in order to at least run the tests. So it doesn't seem like 
much trouble to do the build first and then run the script (and then a quick 
rebuild for the handful of changed files), if indeed the person doesn't already 
have a `./python` lying around.

In fact `./python` is exactly what I used most of the time to run this script 
when I was developing these changes, simply because it seemed like the natural 
thing to do.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Greg Price  added the comment:

> What is the minimal Python version for developing CPython? The system Python 
> 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

Ah, I think my previous message had an ambiguous parse: the earliest that 
*uses* of the typing module appeared in the stdlib was 3.7. The typing module 
has been around longer than that.

I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, 
both at master and with GH-15248.

I think it would be OK for doing development on CPython to require the latest 
minor version (i.e. 3.7) -- after all, if you're doing development, you're 
already building it, so you can always get a newer version than your system 
provides if needed.  But happily the question is moot here, so I guess the 
place to discuss that further would be a new thread.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Greg Price  added the comment:

> BTW: Since when do we use type annotations in Python's stdlib ?

Hmm, interesting question!

At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, 
importlib. The earliest it appeared was in 3.7.0a4.

It's in more places in the test suite, which I think is a closer parallel to 
this maintainer script in Tools/.

The typing module itself is in the stdlib, so I don't see any obstacle to using 
it more widely. I imagine the main reason it doesn't appear more widely already 
is simply that it's new, and most of the stdlib is quite stable.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Greg Price  added the comment:

> I like to run pyflakes time to time on the Python code base. Please avoid 
> "import *" since it prevents pyflakes (and other code analyzers) to find bugs.

Ah fair enough, thanks!

Pushed that change to the next/current PR, GH-15248.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue32771>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-13 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14969
pull_request: https://github.com/python/cpython/pull/15248

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-11 Thread Greg Price


Greg Price  added the comment:

> On the other hand, if it is tricky and requires something more than minor 
> surgery, that would be a hint that it isn't worth it.  There is some value in 
> code that is stable and known to be working just fine.

Definitely true!

I think this change turned out quite clean, though. Just sent, as GH-15216 -- 
please take a look.

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-11 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14944
pull_request: https://github.com/python/cpython/pull/15216

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-11 Thread Greg Price


Greg Price  added the comment:

> Sometimes for testing we turn it off in order to identify identity test
bugs.

Interesting! Well, if the option is useful for testing, that's certainly a
good reason to keep it.

> Also, eveonline was controlling this to save memory.

Also interesting. I feel like there must be something about their setup
that I'm not imagining, because this is a pretty tiny amount of memory (9
kB) on the scale of a typical Python server process in my experience.

Or do you mean they were increasing the numbers? One thought I had looking
at this was actually that it would be interesting to do some measurements
and try to pick new (default) values for these parameters - it looks like
they're the same today as in 1993, and it's likely that with all that's
changed in typical hardware since then (plus the switch from Python ints +
longs to only what we used to call longs), the optimum values today are
different from the optimum in 1993. And certainly one of the effects of
this optimization when its hit rate is good is reducing memory consumption
- you have only one 17 instead of however many you make.

I'll rework the explicit-return patch to send without this change. It'll be
a little trickier but should be entirely doable.

On Sat, Aug 10, 2019, 23:19 Raymond Hettinger 
wrote:

>
> Raymond Hettinger  added the comment:
>
> P.S.  Making the returns explicit is a reasonable change to make, but the
> core #ifdef logic should remain.
>
> --
>
> ___
> Python tracker 
> <https://bugs.python.org/issue37812>
> ___
>

--

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37802] micro-optimization of PyLong_FromSize_t()

2019-08-10 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue37802>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-10 Thread Greg Price


Greg Price  added the comment:

I've just sent GH-15203 which is the first of two patches for this. It's quite 
small.

The second patch, which completes the change, is also small:
https://github.com/gnprice/cpython/commit/c6b905104

It depends on the first one, so I think the easiest is for me to send it as a 
PR after the first is merged? Happy to do another workflow if that'd make 
review easier. (I don't see guidance in the devguide on handling a sequence of 
patches, so I'm making this up.)

--
nosy: +aeros167, mark.dickinson, sir-sigurd

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-10 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +14932
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15203

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)

2019-08-10 Thread Greg Price


New submission from Greg Price :

In longobject.c we have the following usage a few times:

PyObject *
PyLong_FromLong(long ival)
{
PyLongObject *v;
// ... more locals ...

CHECK_SMALL_INT(ival);

if (ival < 0) {
/* negate: can't write this as abs_ival = -ival since that
   invokes undefined behaviour when ival is LONG_MIN */
abs_ival = 0U-(unsigned long)ival;
sign = -1;
}
else {
// ... etc. etc.

The CHECK_SMALL_INT macro contains a `return`, so the function can actually 
return before it reaches any of the other code.

#define CHECK_SMALL_INT(ival) \
do if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) { \
return get_small_int((sdigit)ival); \
} while(0)

That's not even an error condition -- in fact it's the fast, hopefully 
reasonably-common, path.

An implicit return like this is pretty surprising for the reader. And it only 
takes one more line (plus a close-brace) to make it explicit:

if (IS_SMALL_INT(ival)) {
return get_small_int((sdigit)ival);
}

so that seems like a much better trade.

Patch written, will post shortly.

--
components: Interpreter Core
messages: 349359
nosy: Greg Price
priority: normal
severity: normal
status: open
title: Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37812>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18236] str.isspace should use the Unicode White_Space property

2019-08-08 Thread Greg Price


Greg Price  added the comment:

Good question!  With the patch:

>>> import re
>>> re.match(r'\s', '\x1e')
>>> 

In other words, the definition of the regexp r'\s' follows along.  Good to know.

--

___
Python tracker 
<https://bugs.python.org/issue18236>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-07 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14903
pull_request: https://github.com/python/cpython/pull/15171

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18236] str.isspace should use the Unicode White_Space property

2019-08-07 Thread Greg Price


Greg Price  added the comment:

I've gone and made a patch for this change:
https://github.com/gnprice/cpython/commit/7dab9d879

Most of the work happens in the script Tools/unicode/makeunicode.py , and along 
the way I made several changes there that I found made it somewhat nicer to 
work on, and I think will help other people reading that script too.  So I'd 
like to try to merge those improvements first.

I've filed #37760 for those preparatory changes, and posted several PRs 
(GH-15128, GH-15129, GH-15130) as bite-sized pieces.  These PRs can go in in 
any order.

Please take a look! Reviews appreciated.

--
nosy: +Greg Price
versions: +Python 3.9 -Python 3.5

___
Python tracker 
<https://bugs.python.org/issue18236>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-04 Thread Greg Price


Greg Price  added the comment:

Just posted three PRs:
 * GH-15128 and GH-15129 are both quite small
 * GH-15130 is the first of two patches factoring out common parsing logic.

Two remaining patches go on top of GH-15130.  Here are drafts, in case they're 
helpful for reference:
* Patch 2/2 factoring out common parsing logic: 
https://github.com/gnprice/cpython/commit/0a32a7111
* Patch converting the big tuples to a dataclass: 
https://github.com/gnprice/cpython/commit/6d8103bbc

I figure they may be easiest to review after the PR they depend on is merged, 
so my plan is to send PRs for them each in turn.

--

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-04 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14870
pull_request: https://github.com/python/cpython/pull/15130

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-04 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14869
pull_request: https://github.com/python/cpython/pull/15129

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-04 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +14868
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15128

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass

2019-08-04 Thread Greg Price


New submission from Greg Price :

I spent some time yesterday on #18236, and I have a patch for it.

Most of that work happens in the script Tools/unicode/makeunicode.py , and 
along the way I made several changes there that I found made it somewhat nicer 
to work on, and I think will help other people reading that script too.  I'd 
like to try to merge those improvements first.

The main changes are:

 * As the script has grown over the years, it's gained many copies and 
reimplementations of logic to parse the standard format of the Unicode 
character database.  I factored those out into a single place, which makes the 
parsing code shorter and the interesting parts stand out more easily.

 * The main per-character record type in the script's data structures is a 
length-18 tuple.  Using the magic of dataclasses, I converted this so that e.g. 
the code says `record.numeric_value` instead of `record[8]`.

There's no radical restructuring or rewrite here; this script has served us 
well.  I've kept these changes focused where there's a high ratio of value, in 
future ease of development, to cost, in a reviewer's effort as well as mine.

I'll send PRs of my changes shortly.

--
components: Unicode
messages: 349020
nosy: Greg Price, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Refactor makeunicodedata.py: dedupe parsing, use dataclass
type: enhancement
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37760>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-04 Thread Greg Price


Greg Price  added the comment:

Sent two small PRs!

The first one, GH-15125, makes the substantive test change I described above.

The second one, GH-15126, is a small pure refactor to that test file, just 
cleaning out some bits that made sense when it was first written (as a script) 
but are confusing now that it's a `unittest` test module.  Took me a couple of 
minutes to sort those out when I first dug into this file, and I figure it'd be 
kind to the next person to save them the same effort.

--

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-04 Thread Greg Price


Change by Greg Price :


--
pull_requests: +14866
pull_request: https://github.com/python/cpython/pull/15126

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-04 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +14865
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15125

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints

2019-08-04 Thread Greg Price


New submission from Greg Price :

The unicodedata module has two test cases which run through the database and 
make a hash of its visible outputs for all codepoints, comparing the hash 
against a checksum.  These are helpful regression tests for making sure the 
behavior isn't changed by patches that didn't intend to change it.

But Unicode has grown since Python first gained support for it, when Unicode 
itself was still rather new.  These test cases were added in commit 6a20ee7de 
back in 2000, and they haven't needed to change much since then... but they 
should be changed to look beyond the Basic Multilingual Plane 
(`range(0x1)`) and cover all 17 planes of Unicode's final form.

Spotted in discussion on GH-15019 
(https://github.com/python/cpython/pull/15019#discussion_r308947884 ).  I have 
a patch for this which I'll send shortly.

--
components: Tests
messages: 349014
nosy: Greg Price
priority: normal
severity: normal
status: open
title: unicodedata checksum-tests only test 1/17th of Unicode's codepoints
type: enhancement
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37758>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36502] str.isspace() for U+00A0 and U+202F differs from document

2019-08-03 Thread Greg Price


Greg Price  added the comment:

The actual behavior turns out to match that comment. See attached PR, which 
adds a test confirming that and also corrects the documentation.

(A related issue is #18236 -- we should probably adjust the definition to match 
the one Unicode now provides. But meanwhile we'll want to correct the docs.)

--
nosy: +Greg Price

___
Python tracker 
<https://bugs.python.org/issue36502>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36502] str.isspace() for U+00A0 and U+202F differs from document

2019-08-03 Thread Greg Price


Change by Greg Price :


--
keywords: +patch
pull_requests: +14836
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15019

___
Python tracker 
<https://bugs.python.org/issue36502>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26452] Wrong line number attributed to comprehension expressions

2016-02-27 Thread Greg Price

New submission from Greg Price:

In a multi-line list comprehension (or dict or set comprehension), the code for 
the main expression of the comprehension is wrongly attributed to the *last* 
line of the comprehension, which might be several lines later.

This makes for quite baffling tracebacks when an exception occurs -- for 
example this program:
```
def f():
return [j
for i in range(3)
if i]

f()
```
produces (with CPython from current `default`):
```
Traceback (most recent call last):
  File "foo.py", line 15, in 
f()
  File "foo.py", line 3, in f
for i in range(3)
  File "foo.py", line 4, in 
if i]
NameError: name 'j' is not defined
```
showing the line `if i]`, which has nothing to do with the error and gives very 
little hint as to where the exception is being raised.

Disassembly confirms that the line numbers on the code object are wrong:
```
  2   0 BUILD_LIST   0
  3 LOAD_FAST0 (.0)
>>6 FOR_ITER18 (to 27)

  3   9 STORE_FAST   1 (i)

  4  12 LOAD_FAST1 (i)
 15 POP_JUMP_IF_FALSE6
 18 LOAD_GLOBAL  0 (j)
 21 LIST_APPEND  2
 24 JUMP_ABSOLUTE6
>>   27 RETURN_VALUE
```
The `LOAD_GLOBAL` instruction for `j` is attributed to line 4, when it should 
be line 2.

A similar issue affects multi-line function calls, which get attributed to a 
line in the last argument.  This is less often so seriously confusing because 
the function called is right there as the next frame down on the stack, but 
it's much more common and it makes the traceback look a little off -- I've 
noticed this as a minor annoyance for years, before the more serious 
comprehension issue got my attention.

Historically, line numbers were constrained to be wrong in these ways because 
the line-number table `co_lnotab` on a code object required its line numbers to 
increase monotonically -- and the code for the main expression of a 
comprehension comes after all the `for` and `if` clauses, so it can't get a 
line number earlier than theirs.  Victor Stinner's recent work in 
https://hg.python.org/cpython/rev/775b74e0e103 lifted that restriction in the 
`co_lnotab` data structure, so it's now just a matter of actually entering the 
correct line numbers there.

I have a draft patch to do this, attached here.  It fixes the issue both for 
comprehensions and function calls, and includes tests.  Things I'd still like 
to do before considering the patch ready:
* There are a couple of bits of logic that I knock out that can probably be 
simplified further.
* While I'm looking at this, there are several other forms of expression and 
statement that have or probably have similar issues, and I'll want to go and 
review them too to either fix or determine that they're fine.  The ones I've 
thought of are included in the draft test file, either as actual tests (with 
their current answers) or TODO comments for me to investigate.

Comments very welcome on the issue and my draft patch, and meanwhile I'll 
continue with the further steps mentioned above.

Thanks to Benjamin Peterson for helping diagnose this issue with me when we ran 
into a confusing traceback that ran through a comprehension.

--
components: Interpreter Core
files: lines.diff
keywords: patch
messages: 260966
nosy: Greg Price
priority: normal
severity: normal
status: open
title: Wrong line number attributed to comprehension expressions
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file42042/lines.diff

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26452>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com