[issue1524081] logging using the SysLog handler fails if locale is set
Greg Price added the comment: For the record because this issue is mentioned in a comment in logging/handlers.py and people are sometimes confused by it today: > This happens because in that particular locale, > "INFO".lower() != "info" Since Python 3, this no longer happens: str.lower() and friends do not depend on the current locale. Specifically, the lower() and similar methods on Unicode strings (now "str", previously "unicode") were always independent of the current locale. The corresponding methods on byte strings (now "bytes", previously "str") did have this locale-dependent behavior, but that was replaced in commit 6ccd3f2dbcb98b33a71ffa6eae949deae797c09c, in 2007. See also #37848, for a 2019 discussion of potentially adding an optional parameter to use an *explicit* locale. -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue1524081> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > labeling long-stable code as "evil". Let me apologize about this bit -- I was thinking of the term in quotes (referring to the earlier comment), but I didn't write it clearly that way. I don't think any of this code is evil, past or present, and I regret that I carelessly gave that impression. I think there isn't currently any problem to be solved here, and I hope we can all put our energy elsewhere. (For my part, I just spent the balance of the evening implementing an old feature request on unicodedata.) Victor, I'd still be quite interested in better understanding your thinking (as this example relates to macros in general) if you'd like to reply, though this thread may not be the right place to continue. You can find my email in Git, and I'm on Zulip and Discourse; and I'd be happy to start or follow a thread in a forum you think appropriate. Or if you'd rather drop it entirely, that's fine too. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16684] Unicode property value abbreviated names and long names
Greg Price added the comment: I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo: >>> import unicodedata, pprint >>> pprint.pprint(unicodedata.property_value_aliases) {'bidirectional': {'AL': ['Arabic_Letter'], # ... 'WS': ['White_Space']}, 'category': {'C': ['Other'], # ... 'east_asian_width': {'A': ['Ambiguous'], # ... 'W': ['Wide']}} Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name": >>> unicodedata.property_value_aliases['category'][unicodedata.category('4')] ['Decimal_Number', 'digit'] This implementation also provides the reverse mapping, from an alias to the "short name": >>> pprint.pprint(unicodedata.property_value_by_alias) {'bidirectional': {'Arabic_Letter': 'AL', # ... This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions: * This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py . It's not *that* much code... but it sure would be more convenient to do in Python instead. Should the unicodedata module perhaps have a Python part? I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then this could go there instead of using the C code I've just written. * Is this API the right one? * This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' . * Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' . * Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions? So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' . * There's also room for bikeshedding on the names. * How shall we handle ucd_3_2_0 for this feature? This implementation doesn't attempt to record the older version of the data. My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet. OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation. Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data. -- keywords: +patch nosy: +Greg Price versions: +Python 3.9 -Python 3.8 Added file: https://bugs.python.org/file48616/prop-val-aliases.patch ___ Python tracker <https://bugs.python.org/issue16684> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: I hesitate to come back to this thread, because as Raymond says it's consumed a lot of time already. But I think this point is of broader interest than the specific few lines we've been discussing: > When small integer are disabled at compilation time (NSMALLPOSINTS=0 and a > NSMALLNEGINTS=0), I suggest to remove code related to small integers. IMHO > CHECK_SMALL_INT() macro is a good practical solution for that. Victor, can you be more specific about the problem this is solving? I think the existing code in master really doesn't leave any problems in place that CHECK_SMALL_INT solves. In master (as of 2702638ea), here's the code that CHECK_SMALL_INT would replace: if (IS_SMALL_INT(ival)) { return get_small_int((sdigit)ival); } When NSMALLPOSINTS=0 and NSMALLNEGINTS=0 , this expands in the preprocessor to the equivalent of: if (0) { return (Py_UNREACHABLE(), NULL); } (Specifically, it expands to whatever that expands to, since Py_UNREACHABLE and possibly NULL are also macros.) A compiler that's attempting to do any significant optimization at all has to be able to discard code that's inside `if (0)`; dead-code elimination is a necessary primitive for lots of other optimizations. (And at a quick empirical sanity-check: gcc, clang, and msvc all do so at any optimization setting other than "disabled". In fact gcc and clang do so even with optimizations disabled.) You made the case very nicely above that macros are evil. The IS_SMALL_INT macro is fairly mild, and it has a performance benefit on some platforms to justify it. But CHECK_SMALL_INT causes a return from the enclosing function, which seems quite high on the "evil macro" scale. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: Thanks Benjamin for reviewing and merging this series! -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18236] str.isspace should use the Unicode White_Space property
Change by Greg Price : -- keywords: +patch pull_requests: +15849 stage: needs patch -> patch review pull_request: https://github.com/python/cpython/pull/16254 ___ Python tracker <https://bugs.python.org/issue18236> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > We're wasted a lot of dev time on something that never promised much value in > the first place. So, I'll revert and close this tracker issue OK, so be it. I'll certainly agree that this series of threads consumed a lot more time than I anticipated or hoped, though I think we have different analyses of why that was. (As you can probably guess, my previous message crossed with yours.) -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38015] inline function generates slightly inefficient machine code
Greg Price added the comment: See followup at https://bugs.python.org/issue38205 and https://bugs.python.org/issue37812#msg352670 . The patch in GH-15710 had a side effect of introducing a call to `Py_UNREACHABLE` inside a comma-expression. A subsequent commit 3ab61473b changed `Py_UNREACHABLE` in a way that made that a syntax error; the issue wasn't caught then because the code is only compiled if `NSMALLNEGINTS + NSMALLPOSINTS <= 0`. Detailed discussion on those threads. -- ___ Python tracker <https://bugs.python.org/issue38015> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > if using a static inline function is causing issues Separately from whether there was or wasn't such an issue here, I think it's interesting to note that the build failure bpo-38205 is an example of exactly the opposite! It was caused by a combination of (a) using a macro *instead* of a plain old C function; (b) using avoidable preprocessor conditionals. And both of them led to that build failure in classic ways. Specifically, the build failure happened because (a) this `Py_UNREACHABLE` call was in an unusual syntactic context, squished into an expression, in order to use it in a macro. (b) the call was behind an `#ifdef`/`#else`; and the configuration that included it wasn't one that got test-built by the authors of 3ab61473b (which modified `Py_UNREACHABLE`), nor by CI. When conditional preprocessing is kept to a minimum -- here, if the `#if NSMALLNEGINTS + NSMALLPOSINTS > 0` conditional enclosed just the `small_ints` array that it needs to -- then this kind of build regression on non-default configurations can't so easily happen. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: Thanks Victor for linking that issue back here. > A first change converted a macro to a static inline function. The second > change converted the static inline fnuction to a macro Not quite. The first change converted a macro `CHECK_SMALL_INT` to an equivalent sequence, including a `return` that had been hidden inside the macro: if (is_small_int(ival)) { return get_small_int((sdigit)ival); } The second change converted `is_small_int` to a macro `IS_SMALL_INT`. The second change also changed an `assert(0)` to say `Py_UNREACHABLE()`. I don't know why it did that -- it seems quite orthogonal to the rest of the change. > Morever, if using a static inline function is causing issues, The change that caused the build failure you're seeing (GH-15710) was intended for bpo-38015 . Details there; briefly, common compilers on x86_32 would emit some unsightly extra instructions with `is_small_int`, and converting it to the macro `IS_SMALL_INT` eliminated those extra instructions. It's not clear to me if anyone benchmarked to see if the conversion to a macro had any measurable performance benefit. There's some measurement here: https://bugs.python.org/issue38015#msg351315 which finds no benefit. I'm not sure if that was measuring the change in GH-15710, or if it was an attempt to replicate the findings in msg351255 (which it quotes) about a more invasive change. So I think the short of it is that the static inline function was not causing any issue, and an easy fix for the build failure is to revert GH-15710. Alternatively, another easy fix would be to replace just the `Py_UNREACHABLE()` introduced by GH-15710 with the original `assert(0)`. Either way, I don't think there's any reason here to revert GH-15216. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
Greg Price added the comment: Thanks @zach.ware for the review and merge of GH-15451! That was the "minimal" fix, fixing rules that apply to files we have in the repo now. So `rg` will no longer ignore `PC/pyconfig.h`. :-) I've just sent GH-15823 with the "more thorough" fix on top of that, which tries to make it easy not to fall into the same Git pitfall again. -- status: closed -> open ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
Change by Greg Price : -- pull_requests: +15472 pull_request: https://github.com/python/cpython/pull/15823 ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38079] _PyObject_VAR_SIZE should avoid arithmetic overflow
Greg Price added the comment: (The tracker just linked GH-14838 to this issue because I mentioned it in a comment there, but it's not for this issue -- it's that recent fix for an 11-year-old bug in a callsite's overflow check.) -- ___ Python tracker <https://bugs.python.org/issue38079> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38079] _PyObject_VAR_SIZE should avoid arithmetic overflow
New submission from Greg Price : Currently `_PyObject_VAR_SIZE` effectively has a precondition that it must not be passed arguments which would overflow in its arithmetic. If that's violated, it overflows... which is undefined behavior already, and in fact the likely next thing that happens is we allocate a short buffer for a giant array of items, then overflow the buffer, bam! (The arithmetic is on `Py_ssize_t`, a signed integer type; that's why overflow is UB.) It's quite difficult to maintain that precondition at call sites, because it effectively means writing a check that duplicates the content of `_PyObject_VAR_SIZE` modulo some algebraic rearrangement. Empirically we don't consistently manage that: we just fixed an 11-year-old bug where such a check was wrong (luckily in the safe/over-conservative direction), and some other call paths don't immediately appear to be checking at all. So, I think it would be helpful to move that check into `_PyObject_VAR_SIZE` itself. Cc'ing the three folks on the interest list "coverity scan", as that one seems the closest match for who'd likely be interested in commenting on this. Details below, largely from GH-14838 which got me looking at this. --- In GH-14838, @sir-sigurd fixed an overflow check on a call to PyObject_GC_NewVar. The call site looked like this: if ((size_t)size > ((size_t)PY_SSIZE_T_MAX - sizeof(PyTupleObject) - sizeof(PyObject *)) / sizeof(PyObject *)) { return (PyTupleObject *)PyErr_NoMemory(); } op = PyObject_GC_NewVar(PyTupleObject, _Type, size); but the `- sizeof(PyObject *)` had the wrong sign. Happily the bug made the check too conservative, but that's basically a lucky coin-flip we had -- a similar bug could just as easily have gone the other direction, making the code vulnerable to arithmetic overflow, then memory corruption from overflowing a too-small buffer. The bug was present for 11 years. Sergey, I am still very curious how you discovered the bug. :-) I feel like there's an opportunity for a more complete fix as a followup to this; that's this thread. --- My next question on reading that fix was, OK, how can we write this so it's hard to get wrong? I think the point is that `PyObject_GC_NewVar` really has a precondition, that the `nitems` it's passed (`size` in this caller) not be so big that the allocation will overflow. Specifically `_PyObject_VAR_SIZE(tp, nitems)` needs to not overflow. All the trickiness in this overflow check is basically an algebraic rearrangement of what `_PyObject_VAR_SIZE(_Type, size)` would say, plus substituting in the values actually found at `_Type`. So, a sort of minimal fix would be to make something with a name like `PyObject_GC_NewVar_WouldOverflow` that encodes that. Maybe that in turn would just be a macro to delegate to a `_PyObject_VAR_SIZE_WOULD_OVERFLOW`, so that each layer only has to know what it actually knows. I think that immediately raises a next question, though, which is: are other call sites of `PyObject_GC_NewVar` checking properly for overflow? At a quick grep, I see some that don't appear to be checking. And: why should they have to? It seems like the best place for this check is immediately before the possibly-overflowing computation; or if not there, then the closer to it the better. So, the ideal solution: `_PyObject_VAR_SIZE` itself would be augmented to first do this check, and only if it won't overflow then go and actually multiply. There are only a handful of call sites of `_PyObject_VAR_SIZE`, so it wouldn't even be much of a migration to then take care of each of them. Then we wouldn't need these prophylactic checks at callers' callers -- which (a) this bug demonstrates are hard to consistently get right, and (b) the grep mentioned above suggests we may not be consistently doing in the first place. -- components: Interpreter Core messages: 351573 nosy: Greg Price, brett.cannon, christian.heimes, sir-sigurd, twouters priority: normal severity: normal status: open title: _PyObject_VAR_SIZE should avoid arithmetic overflow versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue38079> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38015] inline function generates slightly inefficient machine code
Greg Price added the comment: (Just to help keep discussions together: some earlier discussion was on GH-15216 .) Because is_small_int / IS_SMALL_INT is so small, there's not much cost in the source code to making it a macro (as GH-15710 did). But I think it'd be a mistake to go a lot farther than that and convert significantly larger chunks of code like get_small_int to macros (as GH-15718 would), unless the payoff were really commensurate. It'd be better to focus optimization efforts where profiling shows a lot of time is really being spent. -- ___ Python tracker <https://bugs.python.org/issue38015> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38043] small cleanups in Unicode normalization code
Change by Greg Price : -- pull_requests: +15368 pull_request: https://github.com/python/cpython/pull/15558 ___ Python tracker <https://bugs.python.org/issue38043> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38043] small cleanups in Unicode normalization code
Change by Greg Price : -- pull_requests: +15367 pull_request: https://github.com/python/cpython/pull/15712 ___ Python tracker <https://bugs.python.org/issue38043> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38043] small cleanups in Unicode normalization code
Change by Greg Price : -- keywords: +patch pull_requests: +15366 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15711 ___ Python tracker <https://bugs.python.org/issue38043> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38043] small cleanups in Unicode normalization code
New submission from Greg Price : Benjamin noticed in reviewing GH-15558 (for #37966) several points where the existing code around Unicode normalization can be improved: * on the `QuickcheckResult` enum: > Maybe `makeunicodedata.py` should output this enum (with better name namespacing) * > merging `test_normalization` into this file [i.e. `test_unicodedata.py`] for clarity * > These "boolean int" parameters could be actual `bool`s. [sc. the `nfc` and `k` parameters to `is_normalized_quickcheck`] None of these are super hard, so good to knock them out while we're thinking of them. -- components: Unicode messages: 351229 nosy: Greg Price, benjamin.peterson, ezio.melotti, vstinner priority: normal severity: normal status: open title: small cleanups in Unicode normalization code versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue38043> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37837] add internal _PyLong_FromUnsignedChar() function
Greg Price added the comment: Ah OK, that makes sense of it then :) > But the most important thing is that using PyLong_FromUnsignedLong() instead > of _PyLong_FromUnsignedChar() on top of GH-15192 is producing the same > results: striter_next() uses small_ints[] directly. However that's not true > for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think > that's due to how code is profiled so I'm satisfied with these results more > or less. Very interesting! That's a good comparison to have made, too. I agree with your conclusions. -- ___ Python tracker <https://bugs.python.org/issue37837> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6331] Add unicode script info to the unicode database
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue6331> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37966] is_normalized is much slower at "no" than the standard's algorithm
Greg Price added the comment: Fix posted, as GH-15558. Adding cc's for the folks in the thread on #32285, where this function was originally added. -- components: +Unicode nosy: +Maxime Belanger, benjamin.peterson, ezio.melotti, steven.daprano, vstinner title: is_normalized is much slower than the standard's algorithm -> is_normalized is much slower at "no" than the standard's algorithm versions: +Python 3.9 ___ Python tracker <https://bugs.python.org/issue37966> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37966] is_normalized is much slower than the standard's algorithm
Change by Greg Price : -- keywords: +patch pull_requests: +15231 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15558 ___ Python tracker <https://bugs.python.org/issue37966> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37966] is_normalized is much slower than the standard's algorithm
New submission from Greg Price : In 3.8 we add a new function `unicodedata.is_normalized`. The result is equivalent to `str == unicodedata.normalize(form, str)`, but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string. This was added in issue #32285, commit 2810dd7be. However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says `NO`, it's not normalized: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \ 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \ 's == unicodedata.normalize("NFD", s)' 50 loops, best of 5: 4.41 msec per loop That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization. Here it is after a patch that makes the algorithm run as in the standard: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*50' -- \ 'unicodedata.is_normalized("NFD", s)' 500 loops, best of 5: 58.2 nsec per loop Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1). The root cause of the issue is that our `is_normalized` static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm. The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of `unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO. So the helper often returns MAYBE when the standard algorithm would say NO. (More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.) That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for `unicodedata.is_normalized` means it has to go compute the whole normalized string. -- messages: 350651 nosy: Greg Price priority: normal severity: normal status: open title: is_normalized is much slower than the standard's algorithm versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue37966> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
Greg Price added the comment: > I have a minimal fix which takes care of all the files above. I'll post that > shortly, and I may also write up a more thorough fix that tries to make it > easy not to fall into the same Git pitfall again. Both now done. * GH-15451 is that minimal fix. * Just now I've posted https://github.com/gnprice/cpython/pull/1 , which is the more thorough fix. It's written on top of the minimal fix, but I could also squash them together as one PR. Also just posted GH-15542 . This is a small edit, deleting some lines whose author evidently never intended them to get committed to the repo. I'd noticed those lines before and been mildly puzzled; writing up the thorough fix caused me to look in the history and see how they were added, and resolve the mystery :). I've kept that as a separate PR just to reduce the number of moving parts in the main one, because the reasoning for it is quite independent. -- ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
Change by Greg Price : -- pull_requests: +15219 pull_request: https://github.com/python/cpython/pull/15542 ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37837] add internal _PyLong_FromUnsignedChar() function
Greg Price added the comment: Very interesting, thanks! It looks like with LTO enabled, this optimization has no effect at all. This change adds significant complexity, and it seems like the hoped-for payoff is entirely in terms of performance on rather narrowly-focused microbenchmarks. In general I think that sets the bar rather high for making sure the performance gains are meaningful enough to justify the increase in complexity in the code. In particular, I expect most anyone running Python and concerned with performance to be using LTO. (It's standard in distro builds of Python, so that covers probably most users already.) That means if the optimization doesn't do anything in the presence of LTO, it doesn't really count. ;-) --- Now, I am surprised at the specifics of the result! I expected that LTO would probably pick up the equivalent optimization on its own, so that this change didn't have an effect. Instead, it looks like with LTO, this microbenchmark performs the same as it does without LTO *before* this change. That suggests that LTO may instead be blocking this optimization. In that case, there may still be an opportunity: if you can work out why the change doesn't help under LTO, maybe you can find a way to make this optimization happen under LTO after all. -- ___ Python tracker <https://bugs.python.org/issue37837> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18236] str.isspace should use the Unicode White_Space property
Greg Price added the comment: > I've gone and made a patch for this change Update: * The preparatory changes in #37760 are now almost all merged; GH-15265 is the one piece remaining, and I'd be grateful for a review. It's a generally straightforward and boring change that converts the main data structures of makeunicodedata.py from using length-18 tuples as records to using a dataclass, which I think makes subsequent changes that add features to that script much easier both to write and to review. * I have a slightly updated version of the fix itself, which differs mainly by adding a test: https://github.com/gnprice/cpython/commit/9b3bf6739 Comments welcome there too. -- ___ Python tracker <https://bugs.python.org/issue18236> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37837] add internal _PyLong_FromUnsignedChar() function
Greg Price added the comment: > Is there a particular reason to specifically call PyLong_FromSize_t? Seems > like PyLong_FromLong is the natural default (and what we default to in the > rest of the code), and it's what this ends up calling anyway. Ah I see, the patch is meant to go on top of GH-15192, which makes PyLong_FromSize_t apply the small-int optimization itself. As I just suggested on GH-15192, I'd like to see that PR apply the small-int optimization in the more broadly-used PyLong_FromUnsignedLong... and then I think the natural thing for this new function to do is to call that. Still quite curious how LTO does, and also curious what compiler and flags you're using in benchmarks. -- ___ Python tracker <https://bugs.python.org/issue37837> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37837] add internal _PyLong_FromUnsignedChar() function
Greg Price added the comment: Oh also: * What compiler, and what compilation flags, are you using in your benchmarking? That seems relevant :) -- ___ Python tracker <https://bugs.python.org/issue37837> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37837] add internal _PyLong_FromUnsignedChar() function
Greg Price added the comment: Hmm, I'm a bit confused because: * Your patch at GH-15251 replaces a number of calls to PyLong_FromLong with calls to the new _PyLong_FromUnsignedChar. * That function, in turn, just calls PyLong_FromSize_t. * And that function begins: PyObject * PyLong_FromSize_t(size_t ival) { PyLongObject *v; size_t t; int ndigits = 0; if (ival < PyLong_BASE) return PyLong_FromLong((long)ival); // ... * So, it seems like after your patch we still end up calling PyLong_FromLong at each of these callsites, just after a couple more indirections than before. Given the magic of compilers and of hardware branch prediction, it wouldn't at all surprise me for those indirections to not make anything slower... but if the measurements are coming out *faster*, then I feel like something else must be going on. ;-) Ohhh, I see -- I bet it's that at _PyLong_FromUnsignedChar, the compiler can see that `is_small_int(ival)` is always true, so the whole function just turns into get_small_int. Whereas when compiling a call to PyLong_FromLong from some other file (other translation unit), it can't see that and can't make the optimization. Two questions, then: * How do the measurements look under LTO? I wonder if with LTO the linker is able to make the same optimization that this change helps the compiler make. * Is there a particular reason to specifically call PyLong_FromSize_t? Seems like PyLong_FromLong is the natural default (and what we default to in the rest of the code), and it's what this ends up calling anyway. -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue37837> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > May I suggest directing your efforts towards fixing known bugs or > implementing requested features. Well, I would certainly be grateful for a review on my fix to #18236. ;-) There's also a small docs bug at GH-15301. I do think there's significant value in making code easier to read and less tricky. If the project continues to be successful for a long time to come, then that means the code will be read many, many more times than it's written. But one particular spot where it seems our experiences interestingly differ is: > They are a bit tedious to review and are eating up our time in the back and > forth. As a reviewer I generally find it much less work to review a change when it's intended to have no effect on the code's behavior. First, because it's easier to confirm no effect than to pin down what the effects are; then because the whole set of questions about whether the effects are desirable doesn't arise. As a result I often ask contributors (to Zulip, say) to split a change into a series of small pure refactors, followed by a very focused diff for the behavior change. So that's certainly background to my sending as many PRs that don't change any behavior as PRs that do. I actually have quite a number of draft changes built up over the last few weeks. I've held back on sending them all at once, partly because I've felt I have enough open PRs and I wanted to get a better sense of how reviews go. Perhaps I'll go pick out a couple more of them that are bugfixes, features, and docs to send next. (You didn't mention docs just now, but given the care I see you take in adding to them and in revising What's New, I think we agree that work there is valuable.) -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
Change by Greg Price : -- keywords: +patch pull_requests: +15143 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15451 ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37936] gitignore file is too broad
New submission from Greg Price : There are a number of files that we track in the repo, but are nevertheless covered by `.gitignore`. This *mostly* doesn't change anything, because Git itself only cares what `.gitignore` has to say about files that aren't already tracked. But: * It affects any new files someone might add that are covered by the same unintentionally-broad patterns. In that case it'd be likely to cause some confused debugging into why Git wasn't seeing the file; or possibly loss of work, if the person didn't notice that the file had never been committed to Git. * More immediately, other tools that aren't Git but consult the Git ignore rules don't necessarily implement this wrinkle. In particular this is unfortunately a WONTFIX bug in ripgrep / `rg`: https://github.com/BurntSushi/ripgrep/issues/1127 . I learned of the `rg` bug (and, for that matter, refreshed myself on just how Git itself handles this case) after some confusion today where I was looking with for references to a given macro, thought I'd looked at all of them... and then later noticed through `git log -p -S` a reference in `PC/pyconfig.h` with no subsequent change deleting it. Turned out it was indeed there and I needed to take account of it. Here's the list of affected files: $ git ls-files -i --exclude-standard .gitignore Doc/Makefile Lib/test/data/README Modules/Setup PC/pyconfig.h Tools/freeze/test/Makefile Tools/msi/core/core.wixproj Tools/msi/core/core.wxs Tools/msi/core/core_d.wixproj Tools/msi/core/core_d.wxs Tools/msi/core/core_en-US.wxl Tools/msi/core/core_files.wxs Tools/msi/core/core_pdb.wixproj Tools/msi/core/core_pdb.wxs Tools/unicode/Makefile Fortunately this is not hard to fix. The semantics of `.gitignore` have a couple of gotchas, but once you know them it's not really any more complicated to get the behavior exactly right. And I've previously spent the hour or two to read up on it... and when I forget, I just consult my own short notes :), at the top of this file: https://github.com/zulip/zulip/blob/master/.gitignore I have a minimal fix which takes care of all the files above. I'll post that shortly, and I may also write up a more thorough fix that tries to make it easy not to fall into the same Git pitfall again. -- components: Build messages: 350355 nosy: Greg Price priority: normal severity: normal status: open title: gitignore file is too broad versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37936> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: Thanks, Raymond, for the review on GH-15216! Shortly after posting this issue, I noticed a very similar story in CHECK_BINOP. I've just posted GH-15448 to similarly make returns explicit there. It basically consists of a number of repetitions of -CHECK_BINOP(self, other); +if (!PyLong_Check(self) || !PyLong_Check(other)) { +Py_RETURN_NOTIMPLEMENTED; +} with the names `self` and `other` varying from site to site. Though the expanded version isn't literally a `return` statement, I think it functions almost as well as one for the purposes that explicitness serves here, and much better so than a reference to `CHECK_BINOP`. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Change by Greg Price : -- pull_requests: +15140 pull_request: https://github.com/python/cpython/pull/15448 ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36375] PEP 499 implementation: "python -m foo" binds the main module as both __main__ and foo in sys.modules
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue36375> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: (A bit easy to miss in the way this thread gets displayed, so to highlight in a comment: GH-15265 is up, following the 5 other patches which have now all been merged. That's the one that replaces the length-18 tuples with a dataclass.) -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35518] test_timeout uses blackhole.snakebite.net domain which doesn't exist anymore
Greg Price added the comment: I ran across this test when looking at especially slow files in the test suite: it turns out that not only is this service currently down, but the snakebite.net domain still exists, and as a result the test can end up waiting 20-30s before learning that the hosts can't be found and the test gets skipped. I agree with Benjamin's and Victor's comments -- the best solution would be to recreate the test, ideally as something that anyone (anyone with Docker installed, perhaps?) can just run locally. For now I've just sent GH-15349 as a one-line fix to skip the test, with a remark pointing at this issue. It's already getting skipped 100% of the time thanks to the handy `support.transient_internet` mechanism -- this just makes the skip (a) explicit in the source code, and (b) a lot faster. :-) -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue35518> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35518] test_timeout uses blackhole.snakebite.net domain which doesn't exist anymore
Change by Greg Price : -- keywords: +patch pull_requests: +15063 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15349 ___ Python tracker <https://bugs.python.org/issue35518> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36502] str.isspace() for U+00A0 and U+202F differs from document
Greg Price added the comment: Thanks Victor for the reviews and merges! (Unmarking 2.7, because https://docs.python.org/2/library/stdtypes.html seems to not have this issue.) -- versions: -Python 2.7 ___ Python tracker <https://bugs.python.org/issue36502> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37872] Move _Py_IDENTIFIER statics in Python/import.c to top of the file
Change by Greg Price : -- components: +Interpreter Core ___ Python tracker <https://bugs.python.org/issue37872> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37872] Move _Py_IDENTIFIER statics in Python/import.c to top of the file
Change by Greg Price : -- title: Move statics in Python/import.c to top of the file -> Move _Py_IDENTIFIER statics in Python/import.c to top of the file ___ Python tracker <https://bugs.python.org/issue37872> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37872] Move statics in Python/import.c to top of the file
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue37872> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Change by Greg Price : -- pull_requests: +15027 pull_request: https://github.com/python/cpython/pull/15302 ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36502] str.isspace() for U+00A0 and U+202F differs from document
Change by Greg Price : -- pull_requests: +15026 pull_request: https://github.com/python/cpython/pull/15301 ___ Python tracker <https://bugs.python.org/issue36502> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37864] Correct and deduplicate docs on "printable" characters
Change by Greg Price : -- keywords: +patch pull_requests: +15025 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15300 ___ Python tracker <https://bugs.python.org/issue37864> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37864] Correct and deduplicate docs on "printable" characters
New submission from Greg Price : While working on #36502 and then #18236 about the definition and docs of str.isspace(), I looked closely also at its neighbor str.isprintable(). It turned out that we have the definition of what makes a character "printable" documented in three places, giving two different definitions. The definition in the comment on `_PyUnicode_IsPrintable` is inverted, so that's an easy small fix. With that correction, the two definitions turn out to be equivalent -- but to confirm that, you have to go look up, or happen to know, that those are the only five "Other" categories and only three "Separator" categories in the Unicode character database. That makes it hard for the reader to tell whether they really are the same, or if there's some subtle difference in the intended semantics. I've taken a crack at writing some improved docs text for a single definition, borrowing ideas from the C comment as well as the existing docs text; and then pointing there from the other places we'd had definitions. PR coming shortly. -- components: Unicode messages: 349792 nosy: Greg Price, ezio.melotti, vstinner priority: normal severity: normal status: open title: Correct and deduplicate docs on "printable" characters versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue37864> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Change by Greg Price : -- nosy: +vstinner ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32771] merge the underlying data stores of unicodedata and the str type
Greg Price added the comment: > About the RSS memory, I'm not sure how Linux accounts the Unicode databases > before they are accessed. Is it like read-only memory loaded on demand when > accessed? It stands for "resident set size", as in "resident in memory"; and it only counts pages of real physical memory. The intention is to count up pages that the process is somehow using. Where the definition potentially gets fuzzy is if this process and another are sharing some memory. I don't know much about how that kind of edge case is handled. But one thing I think it's pretty consistently good at is not counting pages that you've nominally mapped from a file, but haven't actually forced to be loaded physically into memory by actually looking at them. That is: say you ask for a file (or some range of it) to be mapped into memory for you. This means it's now there in the address space, and if the process does a load instruction from any of those addresses, the kernel will ensure the load instruction works seamlessly. But: most of it won't be eagerly read from disk or loaded physically into RAM. Rather, the kernel's counting on that load instruction causing a page fault; and its page-fault handler will take care of reading from the disk and sticking the data physically into RAM. So until you actually execute some loads from those addresses, the data in that mapping doesn't contribute to the genuine demand for scarce physical RAM on the machine; and it also isn't counted in the RSS number. Here's a demo! This 262392 kiB (269 MB) Git packfile is the biggest file lying around in my CPython directory: $ du -k .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack 262392 .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack Open it for read -- adds 100 kiB, not sure why: $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os, mmap >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9968 kB >>> fd = >>> os.open('.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack', >>> os.O_RDONLY) >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10068 kB Map it into our address space -- RSS doesn't budge: >>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ) >>> m >>> len(m) 268684419 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10068 kB Cause the process to actually look at all the data (this takes about ~10s, too)... >>> sum(len(l) for l in m) 268684419 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS:271576 kB RSS goes way up, by 261508 kiB! Oddly slightly less (by ~1MB) than the file's size. But wait, there's more. Drop that mapping, and RSS goes right back down (OK, keeps 8 kiB extra): >>> del m >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10076 kB ... and then map the exact same file again, and it's *still* down: >>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ) >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10076 kB This last step is interesting because it's a certainty that the data is still physically in memory -- this is my desktop, with plenty of free RAM. And it's even in our address space. But because we haven't actually loaded from those addresses, it's still in memory only at the kernel's caching whim, and so apparently our process doesn't get "charged" or "blamed" for its presence there. In the case of running an executable with a bunch of data in it, I expect that the bulk of the data (and of the code for that matter) winds up treated very much like the file contents we mmap'd in. It's mapped but not eagerly physically loaded; so it doesn't contribute to the RSS number, nor to the genuine demand for scarce physical RAM on the machine. That's a bit long :-), but hopefully informative. In short, I think for us RSS should work well as a pretty faithful measure of the real memory consumption that we want to be frugal with. -- ___ Python tracker <https://bugs.python.org/issue32771> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36502] str.isspace() for U+00A0 and U+202F differs from document
Change by Greg Price : -- pull_requests: +15019 pull_request: https://github.com/python/cpython/pull/15296 ___ Python tracker <https://bugs.python.org/issue36502> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37848] More fully implement Unicode's case mappings
Greg Price added the comment: (I should add that it was only after doing the reading that produced the OP that I had a clear idea what I thought the priority of the issue was -- before doing that work I didn't have a clear sense of the scope of what it affects. Based on that SpecialCasing.txt file as of Unicode 12.0.0, I believe the functionality we don't currently support is entirely about the handling of certain versions of the Latin letter I, as treated in Lithuanian, Turkish, and Azerbaijani. Though one function of this issue thread is that it would be a great place to point out if there's another component to it!) -- ___ Python tracker <https://bugs.python.org/issue37848> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37848] More fully implement Unicode's case mappings
Greg Price added the comment: > Maintaining Python is already expensive [...] There are already enough bugs > waiting for you to be fixed ;-) BTW I basically agree with this. I think this is not a high-priority issue, and I have my eye on some of those bugs. :-) I think the fact that it's per-*language* (despite my inaccurate phrasing in the OP), not per-locale, simplifies it some -- for example the whole `.UTF-8` vs `.utf8` thing doesn't appear. And in particular I think if/when someone decides to sit down and make an implementation of this, then if they take the time to carefully read and absorb the relevant pages of the standard... this is a feature where it's pretty feasible for the implementation to be a self-contained and relatively stable and low-bugs piece of code. And in general I think even if nobody implements it soon, it's useful to have an issue that can be pointed to for this feature, and especially so if the discussion clearly lays out what the feature involves and what different people's views are on the API. For example #18236 has been open for 6 years, but the discussion there was extremely helpful for me to understand it and work up a fix, after just being pointed to it by someone who'd searched the tracker on seeing me send in the doc fix GH-15019. -- ___ Python tracker <https://bugs.python.org/issue37848> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37848] More fully implement Unicode's case mappings
Greg Price added the comment: > I believe that all locale specific things should be in the locale module, not > in the str class. The locale module is all about doing things with the current process-global Unix locale. I don't think that'd be an appropriate interface for this -- if it's worth doing, it's worth doing in such a way that the same web server process can handle requests for Turkish-, Lithuanian-, and Spanish-speaking users without having to reset a global variable for each one. > If a locale specific mapping is requested, this should be done > explicitly by e.g. providing a parameter to str.lower() / upper() / > title(). I like this design. I said "locale" above, but that wasn't quite right, I think -- the file says e.g. `tr`, not `tr_TR` and `tr_CY`, and it describes the identifiers as "language IDs". So perhaps str.lower(*, lang=None) ? And then "I".lower(lang="tr") == "ı" == "\N{Latin small letter dotless I}" -- ___ Python tracker <https://bugs.python.org/issue37848> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32771] merge the underlying data stores of unicodedata and the str type
Greg Price added the comment: OK, I forked off the discussion of case-mapping as #37848. I think it's probably good to first sort out what we want, before returning to how to implement it (if it's agreed that changes are desired.) Are there other areas of functionality that would be good to add in the core, and require data that's currently only in unicodedata? -- ___ Python tracker <https://bugs.python.org/issue32771> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37848] More fully implement Unicode's case mappings
Greg Price added the comment: Another previous discussion is #4610. -- ___ Python tracker <https://bugs.python.org/issue37848> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37848] More fully implement Unicode's case mappings
New submission from Greg Price : Splitting this out from #32771 for more specific discussion. Benjamin writes there that it would be good to: > implement the locale-specific case mappings of > https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of > the Unicode 12 standard in str.lower/upper/casefold. and adds that an implementation would require having available in the core the data on canonical combining classes, which is currently only in the unicodedata module. --- First, I'd like to better understand what functionality we have now and what else the standard describes. Reading https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see * a bunch of rules that aren't language-specific * some other rules that are. I also see in makeunicodedata.py that we don't even parse the language-specific rules. Here's, IIUC, a demo of us correctly implementing the language-independent rules. One line in the data file reads: FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings respectively: $ unicode --brief "$(./python -c \ 's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper(')" ff U+FB00 LATIN SMALL LIGATURE FF U+0020 SPACE F U+0046 LATIN CAPITAL LETTER F f U+0066 LATIN SMALL LETTER F U+0020 SPACE F U+0046 LATIN CAPITAL LETTER F F U+0046 LATIN CAPITAL LETTER F OK, great. --- Then here's something we don't implement. Another line in the file reads: 00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.: i U+0069 LATIN SMALL LETTER I ̇ U+0307 COMBINING DOT ABOVE ́ U+0301 COMBINING ACUTE ACCENT ... but only in a Lithuanian (`lt`) locale. One question is: what would the right API for this be? I'm not sure I'd want `str.lower`'s results to depend on the process's current Unix locale... and I definitely wouldn't want to get that without some way of instead telling it what locale to use. (Either to use a single constant locale, or to use a per-user locale in e.g. a web application.) Perhaps `str.lower` and friends would take a keyword argument `locale`? Oh, one more link for reference: the said section of the standard is in this PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end. And a related previous issue: #12736. -- components: Unicode messages: 349646 nosy: Greg Price, benjamin.peterson, ezio.melotti, lemburg, vstinner priority: normal severity: normal status: open title: More fully implement Unicode's case mappings versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37848> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- pull_requests: +14985 pull_request: https://github.com/python/cpython/pull/15265 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: > From my perspective, the main problem with using type annotations is that > there's nothing checking them in CI. Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning everyone not to do that in their codebases, because what you really don't want is a bunch of annotations that have gradually accumulated falsehoods as the code has changed around them. Still, I think from "some annotations + no checking" the good equilibrium to land in "some annotations + checking", not "no annotations + no checking". (I do mean "some" -- I don't predict we'll ever go sweep all over adding them.) And I think the highest-probability way to get there is to let them continue to accumulate where people occasionally add them in new/revised code... because that holds a door open for someone to step up to start checking them, and then to do the work to make that part of CI. (That someone might even be me! But I can think of plenty of other likely folks to do it.) If we accumulated quite a lot of them and nobody had yet stepped up to make checking happen, I'd worry. But with the smattering we currently have, I think that point is far off. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32771] merge the underlying data stores of unicodedata and the str type
Greg Price added the comment: Speaking of improving functionality: > Having unicodedata readily accessible to the str type would also permit > higher a fidelity unicode implementation. For example, implementing > language-tailored str.lower() requires having canonical combining class of a > character available. This data lives only in unicodedata currently. Benjamin, can you say more about the behavior you have in mind here? I don't entirely follow. (Is or should there be an issue for it?) -- versions: +Python 3.9 -Python 3.8 ___ Python tracker <https://bugs.python.org/issue32771> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32771] merge the underlying data stores of unicodedata and the str type
Greg Price added the comment: > Loading it dynamically reduces the memory footprint. Ah, this is a good question to ask! First, FWIW on my Debian buster desktop I get a smaller figure for `import unicodedata`: only 64 kiB. $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9888 kB >>> import unicodedata >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9952 kB But whether 64 kiB or 160 kiB, it's much smaller than the 1.1 MiB of the whole module. Which makes sense -- there's no need to bring the whole thing in memory when we only import it, or generally to bring into memory the parts we aren't using. I wouldn't expect that to change materially if the tables and algorithms were built in. Here's another experiment: suppose we load everything that ast.c needs in order to handle non-ASCII identifiers. $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9800 kB >>> là = 3 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9864 kB So that also comes to 64 kiB. We wouldn't want to add 64 kiB to our memory use for no reason; but I think 64 or 160 kiB is well within the range that's an acceptable cost if it gets us a significant simplification or improvement to core functionality, like Unicode. -- ___ Python tracker <https://bugs.python.org/issue32771> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: > This is good. But the title mentioned dataclasses, and they are 3.7+. Ahh, sorry, I think now I understand you. :-) Indeed, when I switch to the branch with that change (https://github.com/gnprice/cpython/commit/2b4aec4dd -- it comes after the patch that's GH-15248, so I haven't yet sent it as a PR), then `python3.6 Tools/unicode/makeunicodedata.py` no longer works. I think this is fine. Most of all that's because this always works: ./python Tools/unicode/makeunicodedata.py Anyone who's going to be running that script will want to build a `./python` right afterward, in order to at least run the tests. So it doesn't seem like much trouble to do the build first and then run the script (and then a quick rebuild for the handful of changed files), if indeed the person doesn't already have a `./python` lying around. In fact `./python` is exactly what I used most of the time to run this script when I was developing these changes, simply because it seemed like the natural thing to do. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: > What is the minimal Python version for developing CPython? The system Python > 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger. Ah, I think my previous message had an ambiguous parse: the earliest that *uses* of the typing module appeared in the stdlib was 3.7. The typing module has been around longer than that. I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, both at master and with GH-15248. I think it would be OK for doing development on CPython to require the latest minor version (i.e. 3.7) -- after all, if you're doing development, you're already building it, so you can always get a newer version than your system provides if needed. But happily the question is moot here, so I guess the place to discuss that further would be a new thread. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: > BTW: Since when do we use type annotations in Python's stdlib ? Hmm, interesting question! At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, importlib. The earliest it appeared was in 3.7.0a4. It's in more places in the test suite, which I think is a closer parallel to this maintainer script in Tools/. The typing module itself is in the stdlib, so I don't see any obstacle to using it more widely. I imagine the main reason it doesn't appear more widely already is simply that it's new, and most of the stdlib is quite stable. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: > I like to run pyflakes time to time on the Python code base. Please avoid > "import *" since it prevents pyflakes (and other code analyzers) to find bugs. Ah fair enough, thanks! Pushed that change to the next/current PR, GH-15248. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32771] merge the underlying data stores of unicodedata and the str type
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue32771> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- pull_requests: +14969 pull_request: https://github.com/python/cpython/pull/15248 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > On the other hand, if it is tricky and requires something more than minor > surgery, that would be a hint that it isn't worth it. There is some value in > code that is stable and known to be working just fine. Definitely true! I think this change turned out quite clean, though. Just sent, as GH-15216 -- please take a look. -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Change by Greg Price : -- pull_requests: +14944 pull_request: https://github.com/python/cpython/pull/15216 ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: > Sometimes for testing we turn it off in order to identify identity test bugs. Interesting! Well, if the option is useful for testing, that's certainly a good reason to keep it. > Also, eveonline was controlling this to save memory. Also interesting. I feel like there must be something about their setup that I'm not imagining, because this is a pretty tiny amount of memory (9 kB) on the scale of a typical Python server process in my experience. Or do you mean they were increasing the numbers? One thought I had looking at this was actually that it would be interesting to do some measurements and try to pick new (default) values for these parameters - it looks like they're the same today as in 1993, and it's likely that with all that's changed in typical hardware since then (plus the switch from Python ints + longs to only what we used to call longs), the optimum values today are different from the optimum in 1993. And certainly one of the effects of this optimization when its hit rate is good is reducing memory consumption - you have only one 17 instead of however many you make. I'll rework the explicit-return patch to send without this change. It'll be a little trickier but should be entirely doable. On Sat, Aug 10, 2019, 23:19 Raymond Hettinger wrote: > > Raymond Hettinger added the comment: > > P.S. Making the returns explicit is a reasonable change to make, but the > core #ifdef logic should remain. > > -- > > ___ > Python tracker > <https://bugs.python.org/issue37812> > ___ > -- ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37802] micro-optimization of PyLong_FromSize_t()
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue37802> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Greg Price added the comment: I've just sent GH-15203 which is the first of two patches for this. It's quite small. The second patch, which completes the change, is also small: https://github.com/gnprice/cpython/commit/c6b905104 It depends on the first one, so I think the easiest is for me to send it as a PR after the first is merged? Happy to do another workflow if that'd make review easier. (I don't see guidance in the devguide on handling a sequence of patches, so I'm making this up.) -- nosy: +aeros167, mark.dickinson, sir-sigurd ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
Change by Greg Price : -- keywords: +patch pull_requests: +14932 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15203 ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37812] Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT)
New submission from Greg Price : In longobject.c we have the following usage a few times: PyObject * PyLong_FromLong(long ival) { PyLongObject *v; // ... more locals ... CHECK_SMALL_INT(ival); if (ival < 0) { /* negate: can't write this as abs_ival = -ival since that invokes undefined behaviour when ival is LONG_MIN */ abs_ival = 0U-(unsigned long)ival; sign = -1; } else { // ... etc. etc. The CHECK_SMALL_INT macro contains a `return`, so the function can actually return before it reaches any of the other code. #define CHECK_SMALL_INT(ival) \ do if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) { \ return get_small_int((sdigit)ival); \ } while(0) That's not even an error condition -- in fact it's the fast, hopefully reasonably-common, path. An implicit return like this is pretty surprising for the reader. And it only takes one more line (plus a close-brace) to make it explicit: if (IS_SMALL_INT(ival)) { return get_small_int((sdigit)ival); } so that seems like a much better trade. Patch written, will post shortly. -- components: Interpreter Core messages: 349359 nosy: Greg Price priority: normal severity: normal status: open title: Make implicit returns explicit in longobject.c (in CHECK_SMALL_INT) versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37812> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18236] str.isspace should use the Unicode White_Space property
Greg Price added the comment: Good question! With the patch: >>> import re >>> re.match(r'\s', '\x1e') >>> In other words, the definition of the regexp r'\s' follows along. Good to know. -- ___ Python tracker <https://bugs.python.org/issue18236> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- pull_requests: +14903 pull_request: https://github.com/python/cpython/pull/15171 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18236] str.isspace should use the Unicode White_Space property
Greg Price added the comment: I've gone and made a patch for this change: https://github.com/gnprice/cpython/commit/7dab9d879 Most of the work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. So I'd like to try to merge those improvements first. I've filed #37760 for those preparatory changes, and posted several PRs (GH-15128, GH-15129, GH-15130) as bite-sized pieces. These PRs can go in in any order. Please take a look! Reviews appreciated. -- nosy: +Greg Price versions: +Python 3.9 -Python 3.5 ___ Python tracker <https://bugs.python.org/issue18236> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Greg Price added the comment: Just posted three PRs: * GH-15128 and GH-15129 are both quite small * GH-15130 is the first of two patches factoring out common parsing logic. Two remaining patches go on top of GH-15130. Here are drafts, in case they're helpful for reference: * Patch 2/2 factoring out common parsing logic: https://github.com/gnprice/cpython/commit/0a32a7111 * Patch converting the big tuples to a dataclass: https://github.com/gnprice/cpython/commit/6d8103bbc I figure they may be easiest to review after the PR they depend on is merged, so my plan is to send PRs for them each in turn. -- ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- pull_requests: +14870 pull_request: https://github.com/python/cpython/pull/15130 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- pull_requests: +14869 pull_request: https://github.com/python/cpython/pull/15129 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
Change by Greg Price : -- keywords: +patch pull_requests: +14868 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15128 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37760] Refactor makeunicodedata.py: dedupe parsing, use dataclass
New submission from Greg Price : I spent some time yesterday on #18236, and I have a patch for it. Most of that work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. I'd like to try to merge those improvements first. The main changes are: * As the script has grown over the years, it's gained many copies and reimplementations of logic to parse the standard format of the Unicode character database. I factored those out into a single place, which makes the parsing code shorter and the interesting parts stand out more easily. * The main per-character record type in the script's data structures is a length-18 tuple. Using the magic of dataclasses, I converted this so that e.g. the code says `record.numeric_value` instead of `record[8]`. There's no radical restructuring or rewrite here; this script has served us well. I've kept these changes focused where there's a high ratio of value, in future ease of development, to cost, in a reviewer's effort as well as mine. I'll send PRs of my changes shortly. -- components: Unicode messages: 349020 nosy: Greg Price, ezio.melotti, vstinner priority: normal severity: normal status: open title: Refactor makeunicodedata.py: dedupe parsing, use dataclass type: enhancement versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37760> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Greg Price added the comment: Sent two small PRs! The first one, GH-15125, makes the substantive test change I described above. The second one, GH-15126, is a small pure refactor to that test file, just cleaning out some bits that made sense when it was first written (as a script) but are confusing now that it's a `unittest` test module. Took me a couple of minutes to sort those out when I first dug into this file, and I figure it'd be kind to the next person to save them the same effort. -- ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Change by Greg Price : -- pull_requests: +14866 pull_request: https://github.com/python/cpython/pull/15126 ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Change by Greg Price : -- keywords: +patch pull_requests: +14865 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15125 ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37758] unicodedata checksum-tests only test 1/17th of Unicode's codepoints
New submission from Greg Price : The unicodedata module has two test cases which run through the database and make a hash of its visible outputs for all codepoints, comparing the hash against a checksum. These are helpful regression tests for making sure the behavior isn't changed by patches that didn't intend to change it. But Unicode has grown since Python first gained support for it, when Unicode itself was still rather new. These test cases were added in commit 6a20ee7de back in 2000, and they haven't needed to change much since then... but they should be changed to look beyond the Basic Multilingual Plane (`range(0x1)`) and cover all 17 planes of Unicode's final form. Spotted in discussion on GH-15019 (https://github.com/python/cpython/pull/15019#discussion_r308947884 ). I have a patch for this which I'll send shortly. -- components: Tests messages: 349014 nosy: Greg Price priority: normal severity: normal status: open title: unicodedata checksum-tests only test 1/17th of Unicode's codepoints type: enhancement versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37758> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36502] str.isspace() for U+00A0 and U+202F differs from document
Greg Price added the comment: The actual behavior turns out to match that comment. See attached PR, which adds a test confirming that and also corrects the documentation. (A related issue is #18236 -- we should probably adjust the definition to match the one Unicode now provides. But meanwhile we'll want to correct the docs.) -- nosy: +Greg Price ___ Python tracker <https://bugs.python.org/issue36502> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36502] str.isspace() for U+00A0 and U+202F differs from document
Change by Greg Price : -- keywords: +patch pull_requests: +14836 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15019 ___ Python tracker <https://bugs.python.org/issue36502> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26452] Wrong line number attributed to comprehension expressions
New submission from Greg Price: In a multi-line list comprehension (or dict or set comprehension), the code for the main expression of the comprehension is wrongly attributed to the *last* line of the comprehension, which might be several lines later. This makes for quite baffling tracebacks when an exception occurs -- for example this program: ``` def f(): return [j for i in range(3) if i] f() ``` produces (with CPython from current `default`): ``` Traceback (most recent call last): File "foo.py", line 15, in f() File "foo.py", line 3, in f for i in range(3) File "foo.py", line 4, in if i] NameError: name 'j' is not defined ``` showing the line `if i]`, which has nothing to do with the error and gives very little hint as to where the exception is being raised. Disassembly confirms that the line numbers on the code object are wrong: ``` 2 0 BUILD_LIST 0 3 LOAD_FAST0 (.0) >>6 FOR_ITER18 (to 27) 3 9 STORE_FAST 1 (i) 4 12 LOAD_FAST1 (i) 15 POP_JUMP_IF_FALSE6 18 LOAD_GLOBAL 0 (j) 21 LIST_APPEND 2 24 JUMP_ABSOLUTE6 >> 27 RETURN_VALUE ``` The `LOAD_GLOBAL` instruction for `j` is attributed to line 4, when it should be line 2. A similar issue affects multi-line function calls, which get attributed to a line in the last argument. This is less often so seriously confusing because the function called is right there as the next frame down on the stack, but it's much more common and it makes the traceback look a little off -- I've noticed this as a minor annoyance for years, before the more serious comprehension issue got my attention. Historically, line numbers were constrained to be wrong in these ways because the line-number table `co_lnotab` on a code object required its line numbers to increase monotonically -- and the code for the main expression of a comprehension comes after all the `for` and `if` clauses, so it can't get a line number earlier than theirs. Victor Stinner's recent work in https://hg.python.org/cpython/rev/775b74e0e103 lifted that restriction in the `co_lnotab` data structure, so it's now just a matter of actually entering the correct line numbers there. I have a draft patch to do this, attached here. It fixes the issue both for comprehensions and function calls, and includes tests. Things I'd still like to do before considering the patch ready: * There are a couple of bits of logic that I knock out that can probably be simplified further. * While I'm looking at this, there are several other forms of expression and statement that have or probably have similar issues, and I'll want to go and review them too to either fix or determine that they're fine. The ones I've thought of are included in the draft test file, either as actual tests (with their current answers) or TODO comments for me to investigate. Comments very welcome on the issue and my draft patch, and meanwhile I'll continue with the further steps mentioned above. Thanks to Benjamin Peterson for helping diagnose this issue with me when we ran into a confusing traceback that ran through a comprehension. -- components: Interpreter Core files: lines.diff keywords: patch messages: 260966 nosy: Greg Price priority: normal severity: normal status: open title: Wrong line number attributed to comprehension expressions type: behavior versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6 Added file: http://bugs.python.org/file42042/lines.diff ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26452> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com