[Python-Dev] Re: v3.8b1 Breaks PyQt on Windows (Issue 36085/os.add_dll_directory())

2019-06-23 Thread Phil Thompson

Carol,

I'm "happy" with Steve's position. Fundamentally I am at fault for 
assuming that a combination of the stable ABI and Python's deprecation 
policy meant that I could assume that a wheel for Python v3.x would 
continue to work for v3.x+1. For this particular change I don't see how 
a normal deprecation warning could have been implemented. The only 
alternative would have been to delay the implementation for v3.9 and 
have loud warnings in the v3.8 docs about the upcoming change.


Phil

On 23/06/2019 00:06, Carol Willing wrote:

Hi Phil,

Thanks for trying the beta. Please file this as an issue at
bugs.python.org. Doing so would be helpful for folks who can look into
the issue.

Thanks,

Carol

On 6/22/19 2:04 PM, Phil Thompson wrote:
The implementation of issue 36085 breaks PyQt on Windows as it relies 
on PATH to find the Qt DLLs. The problem is that PyQt is built using 
the stable ABI and a single wheel is supposed to support all versions 
of Python starting with v3.5. On the assumption (perhaps naive) that 
using the stable ABI would avoid future compatibility issues, the 
existing PyPI wheels have long been tagged with cp38.


Was this issue considered at the time? What is the official view?

Thanks,
Phil
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/YFNKFRJGNM25VUGDJ5PVCQM4WPLZU6J7/

___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/VWIWASHPBMLQDS7PQX3LXILT56Q2KCPO/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Antoine Pitrou
On Fri, 21 Jun 2019 17:18:18 -0500
Tim Peters  wrote:
> 
> > And what would be an efficient way of detecting allocations punted to
> > malloc, if not address_in_range?  
> 
> _The_ most efficient way is the one almost all allocators used long
> ago:  use some "hidden" bits right before the address returned to the
> user to store info about the block being returned.

There's a fundamental problem here: you can't be sure that all
allocators reserve such space.  If some allocator doesn't, it can
return a pointer just at the very start of the page, and if obmalloc
tries to look at "a few bits before" that address, it could very well
page-fault.

> Neil Schemenauer takes a different approach in the recent "radix tree
> arena map for obmalloc" thread here.  We exchanged ideas on that until
> it got to the point that the tree levels only need to trace out
> prefixes of obmalloc arena addresses.  That is, the new space burden
> of the radix tree appears quite reasonably small.

One open question is the cache efficiency of the two approaches.
Intuitively, address_in_range() will often look at exactly the same
cache line (since a significant number of allocation will share the
same "page prefix").  Apparently, this benefit may be offset by cache
aliasing issues.  Cache aliasing can also be mitigated by the fact that
CPU caches are most of the time N-way with N > 1 (but N generally
remains small, from 2 to 8, for L1 and L2 caches).

So I guess the a priori answer is "it's complicated" :-)

I must also thank both you and Neil for running these experiments, even
though sometimes I may disagree about the conclusions :-)

Regards

Antoine.

___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/RUQVG2KOYVMUIIX5HIZKNVN4AUXKKURM/


[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters
[Thomas]
>>> And what would be an efficient way of detecting allocations punted to
>>> malloc, if not address_in_range?

[Tim]
>> _The_ most efficient way is the one almost all allocators used long
>> ago:  use some "hidden" bits right before the address returned to the
>> user to store info about the block being returned.

[Antoine]
> There's a fundamental problem here: you can't be sure that all
> allocators reserve such space.  If some allocator doesn't, it can
> return a pointer just at the very start of the page, and if obmalloc
> tries to look at "a few bits before" that address, it could very well
> page-fault.

I snipped some technical but crucial context in my reply to Thomas:
this was assuming users are following the documented requirement to
never mix memory calls from different families.

What you describe certainly could happen in "illegal" code that. e.g.,
obtained a block from the system malloc, but passed it to obmalloc to
free.  Which, in reality, works fine today, although the docs forbid
it.  (I'm not sure, but there _may_ be some mode of running today that
actually enforces the doc requirements.)

In the other world, where code is playing by the rules, if an obmalloc
malloc/;realloc spelling is called, and it needs to punt to a
different allocator, no problem:  first it boosts the size request so
it has room to store "the bit" it needs before the address it actually
returns to the client.  Then it's "legal" only to free that memory
with an obmalloc spelling of free() later - obmalloc reads "the bit",
sees "oh - that's not my memory!", and adjusts the pointer accordingly
on _its_ call to the spelling of free() corresponding to the memory
family obmalloc() used to get the memory to begin with.

>> Neil Schemenauer takes a different approach in the recent "radix tree
>> arena map for obmalloc" thread here. ...

> One open question is the cache efficiency of the two approaches.
> Intuitively, address_in_range() will often look at exactly the same
> cache line (since a significant number of allocation will share the
> same "page prefix").

I believe we haven't seen a program yet that used more than one node
at the tree's top level :-)  But who knows?  mmap() and VirtualAlloc()
don't appear to make any _guarantees_ that the high-order bits of
returned addresses aren't entirely random.  In real life so far, they
always appear to be zeroes.

While x86 has a 64-bit virtual address space, the hardware "only"
supports 48 bits of physical address space, and I haven't seen a
virtual address yet where any of the top 16 bits are set.

AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47.

Blah blah blah - for the foreseeable future, the top level of the tree
has a very easy job.

And Neil keenly observed that the top level of the tree can be
_declared_  as being very broad (and so suck up a lot of the leading
bits), because it's a file static and is effectively just an address
space reservation (at least on Linux) until nodes in it actually get
used.

>  Apparently, this benefit may be offset by cache
> aliasing issues.  Cache aliasing can also be mitigated by the fact that
> CPU caches are most of the time N-way with N > 1 (but N generally
> remains small, from 2 to 8, for L1 and L2 caches).
>
> So I guess the a priori answer is "it's complicated" :-)

Indeed it is.

> I must also thank both you and Neil for running these experiments, even
> though sometimes I may disagree about the conclusions :-)

Well, there aren't any conclusions yet - just seeing the same things repeatedly.

Over the weekend, Neil ran many variations of a longish-running "real
job" related to his work, which goes through phases of processing bulk
database operations and "trying to" release the space (details are
complicated).

Arena recycling was essentially non-existent in either branch (my PR
or the radix tree).

In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but
on closer examination they were virtually all of the "useless arena
thrashing" flavor.  The arenas in use were almost always within one of
the highwater mark.

But it got much better in the branches if arenas were shrunk to a tiny 8 KiB.

Which is just another instance of the "256 KiB arenas are already way
too fat for effective arena recycling unless the program is
exceptionally friendly in its dynamic memory use patterns"
observation.

We haven't seen a case yet where 1 MiB arenas do materially worse than
256 KiB ones in this respect.

Speed is generally a wash between the branches, although they
consistently appear to be faster (by a little, not a lot) than 3.7.3.

The radix tree generally appears to be a little more memory-frugal
than my PR (presumably because my need to break "big pools" into 4K
chunks, while the tree branch doesn't, buys the tree more space to
actually store objects than it costs for the new tree).

We need more "real jobs", and especially ones where arena recycling is
actually doing good in 3.7.3 (wh

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters
[Tim]
> The radix tree generally appears to be a little more memory-frugal
> than my PR (presumably because my need to break "big pools" into 4K
> chunks, while the tree branch doesn't, buys the tree more space to
> actually store objects than it costs for the new tree).

It depends a whole lot on the size classes of the most popular
objects.  A program below to compute it all.  For a 64-bit box using
3.8 alignment, and 16 KiB pools:

pages per pool 4
pool size 16,384
alignment 16

The first output block:

size 16
SQ 1012  1.2%
PR 1018  0.6%
RT 1021  0.3%

SQ is the status quo:  we have to use four separate 4 KiB pools, and
each burns 48 bytes for a pool header.

PR:  my PR.  There's one pool spanning 4 pages, with 48 bytes for a
pool header in the first page, and 16 bytes to store the arena index
in each of the other 3 pages.

RT:  the radix tree.  One 16 KiB block that only "wastes" 48 bytes for
the pool header.

The first number on each line is the number of objects we can get from
a "big pool" under that scheme.  The second number is the % of total
pool space that cannot be use for client objects.

So, in the above, PR is a substantial improvement over SQ, and RT a
less substantial improvement over PR.  Regardless of size class, PR
never does worse than SQ, and RT never worse than PR.

The most dramatic difference is in the largest size class:

size 512
SQ   28 12.5%
PR   28 12.5%
RT   31  3.1%

RT is a huge win there.  And while it's generally true that RT's
advantages are more striking in the larger size classes, it doesn't
always crush.  For example, in the 2nd-largest size class, it doesn't
matter at all which scheme is used:

size 496
SQ   32  3.1%
PR   32  3.1%
RT   32  3.1%

However, in the 3rd-largest size class, RT crushes it again:

size 480
SQ   32  6.2%
PR   32  6.2%
RT   34  0.4%

I trust the general principle at work here is too obvious to need
explanation ;-)

Anyway, these differences can really add up when there are millions of
objects passed out from a size class where RT has an advantage.
That's very attractive to me.

On the other hand, this _effectively_ make arenas even larger (they
can contain more objects), which apparently makes it even less likely
that arenas can eventually be returned to the system.

But, on the third hand, I've seen no evidence yet that increasing
arena size matters much at all to that after hitting 128 KiB arenas
(smaller than what we already use).  "Uncooperative" programs
essentially recycle nothing regardless, and "happy" programs
essentially recycle almost as many arena bytes with 1 MiB arenas as
with 8 KiB arenas.

Here's the code:

PAGES_PER_POOL = 4
ALIGNMENT = 16 # change to 8 for < Python 3.8

PAGE_SIZE = 2**12
POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL
POOL_HEADER_SIZE = 48

def from_block(size, blocksize, overhead):
return (blocksize - overhead) // size

def from_first_page(size, *, pagesize=PAGE_SIZE):
return from_block(size, pagesize, POOL_HEADER_SIZE)

# using multiple 4K one-page pools - status quo
def nobj_4K(size):
return  from_first_page(size) * PAGES_PER_POOL

# using the PR
def nobj_PR(size):
return (from_first_page(size) +
from_block(size, PAGE_SIZE, ALIGNMENT)
* (PAGES_PER_POOL - 1))

# using the radix tree branch
def nobj_RT(size):
return from_first_page(size, pagesize=POOL_SIZE)

print("pages per pool", PAGES_PER_POOL)
print(f"pool size {POOL_SIZE:,}")
print("alignment", ALIGNMENT)

for size in range(ALIGNMENT, 512 + 1, ALIGNMENT):
print("size", size)
for tag, f in (("SQ", nobj_4K),
   ("PR", nobj_PR),
   ("RT", nobj_RT),
  ):
nobj = f(size)
waste = POOL_SIZE - nobj * size
print(f"{tag} {nobj:4} {waste/POOL_SIZE:5.1%}")
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/S5KMU6M6GZACRNFCF3TNPE7NKDCMQD5E/