[Python-Dev] Re: How about using modern C++ in development of CPython ?

2022-01-16 Thread Denis Kotov
Stephen J. Turnbull wrote:
> You take a hunk of the standard library (in this case it would have to
> be an accelerator written in C since you want to compare C++ vs. C) or
> interpreter code, and translate it to the new syntax.
> Now, *INCREF and friends are frequently cited as annoyances or even
> warts, so your suggestion of std::shared _ptr<> seemed plausible to
> me.  But Antoine peeked at it and points out it's not that easy, for
> performance reasons you can't use it "as is".  It's true that you
> could reimplement it, but then it's not std::shared_ptr<> any more and
> the only benefit left is that it looks familiar to C++ programmers
> (and may mislead *them* into thinking about it as if it were exactly
> std::shared_ptr<>).

Re-implementing of std::shared _ptr<> for single thread is smallest of the 
problems

> And that's why you need to do more work than arguing that in principle
> C++ is just a better language than C.  We've been hearing that for 4
> decades now (at least we greybeards have), and we've discovered that
> for many existing applications, C++ may be better but the cost of
> converting large swaths of C code to equivalent C++ that passes all
> tests is too big.  Python may very well be one of them.
> So if you're not going to do the work to demonstrate big wins from
> using C++ instead of C in actual Python implementation code, I think
> you're wasting your posts.

I thought about it, but will CPython accept the PR for this changes if it would 
show benefits ?
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HHXVWQ4GJSBQFYBVW6IH42DE75BHUXVR/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: How about using modern C++ in development of CPython ?

2022-01-16 Thread Denis Kotov
Yes, each compiler implement its own compiler ABI, but for me it is not a 
problem at all, just provide build with 3 major compilers gcc/clang (due to 
binary compatability) and Visual Studio
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YJF3TJARY3BRTQ5WVPCZIWQ4XWXVB45Y/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Tim Peters
[Guido]
> I don't think there's a way to do a PGO build from Visual Studio; but
> a command prompt in the repo can do it using `PCbuild\build.bat --pgo`.
> Just be patient with it.

Thanks! That worked, and was easy, and gave me an executable that runs
"// 10" at supernatural speed.

Alas, Visual Studio will not show a disassembly window unless the
debugger is running, and there appears to be no way to convince it to
run its debugger without it first recompiling the source file you're
staring at. Which it recomplies without benefit of PGO.

So, in the end, it was an elaborate way to reproduce the ;non-PGO
optimized machine code I already saw :-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/D47HHYYASRHS56AZL6SEK4Y7K5FNSJOP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Gregory P. Smith
On Sun, Jan 16, 2022 at 1:51 PM Mark Dickinson  wrote:

> On Sun, Jan 16, 2022 at 9:28 PM Guido van Rossum  wrote:
>
>> Does the optimization for //10 actually help in the real world? [...]
>>
>
> Yep, I don't know. If 10 is *not* the most common small divisor in real
> world code, it must at least rank in the top five. I might hazard a guess
> that division by 2 would be more common, but I've no idea how one would go
> about establishing that.
>

All of the int constants relating to time and date calculations show up
frequently as well.  But I'd assume -fprofile-values isn't likely to pick
many to specialize on to avoid adding branches so maybe 10 is ironically
it.  --enable-optimizations with clang doesn't trigger value specialization
(I'm pretty sure they support the concept, but I've never looked at how).

>
> The reason that the divisor of 10 is turning up from the PGO isn't a
> particularly convincing one - it looks as though it's a result of our
> testing the builtin int-to-decimal-string conversion by comparing with an
> obviously-correct repeated-division-by-10 algorithm.
>
> Then again I'm not sure what's *lost* even if this optimization is
>> pointless -- surely it doesn't slow other divisions down enough to be
>> measurable.
>>
>
> Agreed. That at least is testable. I can run some timings (but not
> tonight).
>

BTW, I am able to convince clang 11 and higher to produce a 64:32 divide
instruction with a modified version of the code. Basically just taking your
assembly divl variant as an example and writing that explicitly as the
operations in C:

https://godbolt.org/z/63eWPczjx

Taking that code and turning it into an actual test within CPython itself,
it appears to deliver the desired speedup in gcc9 as well.
https://github.com/python/cpython/pull/30626 for
https://bugs.python.org/issue46406.

20% faster microbenchmarking with x//1 or x//17 or other non-specialized
divide values.  similar speedup even in --enable-optimizations builds.
with both gcc9 and clang13.

The compilers seem happier optimizing that code.

-gps
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FJRKRXSXPF24C3NHGYZMVPB3ZZPCBI6A/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Guido van Rossum
On Sun, Jan 16, 2022 at 2:21 PM Tim Peters  wrote:

> I have to believe the same is true under Visual Studio 2019, but
> offhand don't know how to prove that. I understand Steve uses PGO to
> build the python.org Windows release, but I've never done that - the
> "Release build" configuration I get from Github does not use PGO, and
> the code I get from my own "release builds" is equally slow for all
> divisors (and the generated assembler is obviously not trying to
> special-case anything).
>

I don't think there's a way to do a PGO build from Visual Studio; but a
command prompt in the repo can do it using `PCbuild\build.bat --pgo`. Just
be patient with it.

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6QENWEMEUMOHVA4OO3JZK75T3K2FSELD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Tim Peters
[Tim, incidentally notes that passing 10 as the divisor to inplace_divrem1()
 is "impossibly fast" on Windows, consuming less than a third the time as
 when passing seemingly any other divisor]

[Mark Dickinson, discovers much the same is true under other, but not all,
 Linux-y builds, due to the generated code doing a runtime(!) check for
 10 and using a scaled integer reciprocal multiplication trick, instead of
 HW division, in that specific case]

And there's a picture of this in the dictionary, next to the "dubious
optimization" entry ;-)

I have to believe the same is true under Visual Studio 2019, but
offhand don't know how to prove that. I understand Steve uses PGO to
build the python.org Windows release, but I've never done that - the
"Release build" configuration I get from Github does not use PGO, and
the code I get from my own "release builds" is equally slow for all
divisors (and the generated assembler is obviously not trying to
special-case anything).

You later gave PGO the credit, which made sense from the start. But I
_assumed_ PGO was latching on to // 10 because all sorts of programs
use that all the time simply as a consequence of converting Python
integers to strings. But that's not it, The base-conversion code
doesn't call inplace_divrem1() - it does its own division loops with
compile-time constants.

So, as you already said ;-), it appears to be mostly an accident due
to the specific tests we feed to PGO builds.

More interesting to me now is the "over 3x faster" empirical
observation. The paper I referenced earlier has a cheaper way to use
scaled integer reciprocals, requiring nothing worse than 32x32->64 int
mult, and, to compute the fudged reciprocal, nothing worse than
64x32-:>32 int divide.

OTOH, libdivide already has working code ready to use, and - unlike
the paper I referenced - does not require "normalizing" the divisor
(shifting left to ensure it's >= base/2).

libdivide needs more cycles to compute its 64-bit reciprocal, but in
the context of bigints that's amortized over the number of digits in
the numerator. A win is a win ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Mark Dickinson
On Sun, Jan 16, 2022 at 9:28 PM Guido van Rossum  wrote:

> Does the optimization for //10 actually help in the real world? [...]
>

Yep, I don't know. If 10 is *not* the most common small divisor in real
world code, it must at least rank in the top five. I might hazard a guess
that division by 2 would be more common, but I've no idea how one would go
about establishing that.

The reason that the divisor of 10 is turning up from the PGO isn't a
particularly convincing one - it looks as though it's a result of our
testing the builtin int-to-decimal-string conversion by comparing with an
obviously-correct repeated-division-by-10 algorithm.

Then again I'm not sure what's *lost* even if this optimization is
> pointless -- surely it doesn't slow other divisions down enough to be
> measurable.
>

Agreed. That at least is testable. I can run some timings (but not tonight).

-- 
Mark
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OMZI5SQ2SQ7SYN4PCDKIXQQIKGXVJTO5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Guido van Rossum
As a bystander, this is all fascinating (I had actually anticipated that
the //10 optimization came from PGO).

Does the optimization for //10 actually help in the real world? It would if
people did a lot of manual conversion to decimal, which is easiest
expressed using //10. But presumably for that people mostly end up using
str() or repr(), which has its own custom code,
long_to_decimal_string_internal().

Then again I'm not sure what's *lost* even if this optimization is
pointless -- surely it doesn't slow other divisions down enough to be
measurable.

On Sun, Jan 16, 2022 at 12:35 PM Mark Dickinson  wrote:

> On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson 
> wrote:
>
>> So gcc is anticipating divisions by 10 and introducing special-case
>> divide-by-reciprocal-multiply code for that case, and presumably the
>> profile generated for the PGO backs up this being a common enough case, so
>> we end up with the above code in the final compilation.
>>
>
> Nope, that's not what's happening. This analysis is backwards, and
> unfairly attributes to GCC the apparently arbitrary choice to
> optimise division by 10. But it's not GCC's fault; it's ours. What's
> *actually* happening is that GCC is simply recording values for n used in
> calls to divrem1 (via the -fprofile-values option, which is implied by
> -fprofile-generate, which is used as a result of the --enable-optimizations
> configure script option). It's then noticing that in our profile task
> (which consists of a selection of Lib/test/test_*.py test files) we most
> often do divisions by 10, and so it optimizes that case.
>
> To test this hypothesis I added a large number of tests for division by 17
> in test_long.py, and then recompiled from scratch (again with
> --enable-optimizations). Here are the results:
>
> root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
> "x=10**1000; y=10" "x//y"
>
> 100 loops, best of 5: 1.14 usec per loop
>
> root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
> "x=10**1000; y=17" "x//y"
>
> 100 loops, best of 5: 306 nsec per loop
>
> root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
> "x=10**1000; y=1" "x//y"
>
> 100 loops, best of 5: 1.14 usec per loop
>
> root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
> "x=10**1000; y=2" "x//y"
>
> 100 loops, best of 5: 1.15 usec per loop
>
> As expected, division by 17 is now optimised; division by 10 is as slow as
> division by other small scalars.
>
> --
> Mark
>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/2MOQCVMEQBV7PATT47GUYHS42QIJHTRK/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/67DXYX3YLMDC5R4X6FI3NMRT2TGZDZHC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Mark Dickinson
On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson  wrote:

> So gcc is anticipating divisions by 10 and introducing special-case
> divide-by-reciprocal-multiply code for that case, and presumably the
> profile generated for the PGO backs up this being a common enough case, so
> we end up with the above code in the final compilation.
>

Nope, that's not what's happening. This analysis is backwards, and unfairly
attributes to GCC the apparently arbitrary choice to optimise division by
10. But it's not GCC's fault; it's ours. What's *actually* happening is
that GCC is simply recording values for n used in calls to divrem1 (via the
-fprofile-values option, which is implied by -fprofile-generate, which is
used as a result of the --enable-optimizations configure script option).
It's then noticing that in our profile task (which consists of a selection
of Lib/test/test_*.py test files) we most often do divisions by 10, and so
it optimizes that case.

To test this hypothesis I added a large number of tests for division by 17
in test_long.py, and then recompiled from scratch (again with
--enable-optimizations). Here are the results:

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
"x=10**1000; y=10" "x//y"

100 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
"x=10**1000; y=17" "x//y"

100 loops, best of 5: 306 nsec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
"x=10**1000; y=1" "x//y"

100 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 100 -s
"x=10**1000; y=2" "x//y"

100 loops, best of 5: 1.15 usec per loop

As expected, division by 17 is now optimised; division by 10 is as slow as
division by other small scalars.

-- 
Mark
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2MOQCVMEQBV7PATT47GUYHS42QIJHTRK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Mark Dickinson
On Sun, Jan 16, 2022 at 4:11 PM Terry Reedy  wrote:

>
>
> https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi
>
> and
>
>
> https://stackoverflow.com/questions/30790184/perform-integer-division-using-multiplication
>
> have multiple discussions of the technique for machine division
> invariant (small) ints and GCC's use thereof (only suppressed with -0s?).
>

Yes, it's an old and well-known technique, and compilers have been using it
for division by a known-at-compile-time constant for many decades. What's
surprising here is the use by GCC in a situation where the divisor is
*not* known
at compile time - that GCC essentially guesses that a divisor of 10 is
common enough to justify special-casing.

There's also the libdivide library[1], which caters to situations where you
have a divisor not known at compile time but you know you're going to be
using it often enough to compensate for the cost of computing the magic
multiplier dynamically at run time.

[1] https://libdivide.com

-- 
Mark
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PPF6TOGH6QJXGKYTYVVAQC4D3D3HT7R4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Terry Reedy

On 1/16/2022 7:08 AM, Mark Dickinson wrote:

Now /that/ I certainly wasn't expecting. I don't see the same effect on 
macOS / Clang, whether compiling with --enable-optimizations or not; 
this appears to be a GCC innovation. And indeed, as Tim suggested, it 
turns out that there's no division instruction present in the loop for 
the division-by-10 case - we're doing division via multiplication by the 
reciprocal. In Python terms, we're computing `x // 10` as `(x * 
0xcccd) >> 67`.


https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi

and

https://stackoverflow.com/questions/30790184/perform-integer-division-using-multiplication

have multiple discussions of the technique for machine division 
invariant (small) ints and GCC's use thereof (only suppressed with -0s?).


--
Terry Jan Reedy
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PY3QL2XXHGGFGCA3M3JNQ6CULDIKO622/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

2022-01-16 Thread Mark Dickinson
On Sat, Jan 15, 2022 at 8:12 PM Tim Peters  wrote:

> Something is missing here, but can't guess what without seeing the
> generated machine code.But I trust Mark will do that.
>

Welp, there goes my weekend. :-)

 $ python -m timeit -n 150 -s "x = 10**1000" "x//10"

150 loops, best of 5: 376 nsec per loop
>
> Which actually makes little sense to me. [...] Under 4 nsec per iteration
> seems

close to impossibly fast on a 3.8GHz box, given the presence of any
> division instruction.







However, dividing by 10 is not a worst case on this box. Dividing by
> 100 is over 3x slower:
>
> $ python -m timeit -n 150 -s "x = 10**1000" "x//100"
> 150 loops, best of 5: 1.25 usec per loop


Now *that* I certainly wasn't expecting. I don't see the same effect on
macOS / Clang, whether compiling with --enable-optimizations or not; this
appears to be a GCC innovation. And indeed, as Tim suggested, it turns out
that there's no division instruction present in the loop for the
division-by-10 case - we're doing division via multiplication by the
reciprocal. In Python terms, we're computing `x // 10` as `(x *
0xcccd) >> 67`. Here's the tell-tale snippet of the assembly
output from the second compilation (the one that makes use of the generated
profile information) of longobject.c at commit
09087b8519316608b85131ee7455b664c00c38d2

on
a Linux box, with GCC 11.2.0. I added a couple of comments, but it's
otherwise unaltered

.loc 1 1632 36 view .LVU12309
movl %r13d, %r11d
salq $2, %rbp
cmpl $10, %r13d # compare divisor 'n' with 10, and
jne .L2797  # go to the slow version if n != 10
leaq 1(%r10), %r9 # from here on, the divisor is 10
addq %rbp, %r8
.LVL3442:
.loc 1 1632 36 view .LVU12310
addq %rbp, %rdi
.LVL3443:
.loc 1 1632 36 view .LVU12311
.LBE8049:
.loc 1 1624 15 view .LVU12312
xorl %r13d, %r13d
.LVL3444:
.loc 1 1624 15 view .LVU12313
movabsq $-3689348814741910323, %r11 # magic constant 0xcccd for
division by 10

and then a few lines later:

.loc 1 1630 9 is_stmt 1 view .LVU12316
.loc 1 1631 9 view .LVU12317
.loc 1 1631 39 is_stmt 0 view .LVU12318
movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of
r14
.LVL3446:
.loc 1 1632 9 is_stmt 1 view .LVU12319
movq %r14, %rax # set up for division: top digit is now in rax
.loc 1 1633 13 is_stmt 0 view .LVU12320
movq %r14, %r13
mulq %r11 # here's the division by 10: multiply by the magic constant
shrq $3, %rdx # and divide by 8 (via a shift)

and then it all gets a bit repetitive and boring - there's a lot of loop
unrolling going on.

So gcc is anticipating divisions by 10 and introducing special-case
divide-by-reciprocal-multiply code for that case, and presumably the
profile generated for the PGO backs up this being a common enough case, so
we end up with the above code in the final compilation.

TIL ...

-- 
Mark
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VDII5EBMXLNO4U3BSSNWAW2ETLNG6YUN/
Code of Conduct: http://python.org/psf/codeofconduct/