[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Tim Peters
[Tim]
> - For truly effective RAM releasing, we would almost certainly need to
> make major changes, to release RAM at an OS page level.   256K arenas
> were already too fat a granularity.

We can approximate that closely right now by using 4K pools _and_ 4K
arenas:  one pool per arena, and mmap()/munmap() are then called on
one page at a time.

[Don't try this at home ;-)  There are subtle assumptions in the code
that there are at least two pools in an arena, and those have to be
overcome first.]

For memcrunch.py, using 200x the original initial objects, this works
quite well!  Note that this still uses our current release-arenas
heuristic:  the only substantive change from the status quo is setting
ARENA_SIZE to POOL_SIZE (both 4 KiB - one page).

# arenas allocated total   =  873,034
# arenas reclaimed =  344,380
# arenas highwater mark=  867,540
# arenas allocated current =  528,654
528654 arenas * 4096 bytes/arena   =2,165,366,784

# bytes in allocated blocks=1,968,234,336
# bytes in available blocks=  141,719,280
5349 unused pools * 4096 bytes =   21,909,504
# bytes lost to pool headers   =   25,118,640
# bytes lost to quantization   =8,385,024
# bytes lost to arena alignment=0
Total  =2,165,366,784

So, at the end, space utilization is over 90%:

1,968,234,336 / 2,165,366,784 = 0.90896117

OTOH, an even nastier version of the other program I posted isn't
helped much at all, ending like so after phase 10:

# arenas allocated total   =1,025,106
# arenas reclaimed =   30,539
# arenas highwater mark=1,025,098
# arenas allocated current =  994,567
994567 arenas * 4096 bytes/arena   =4,073,746,432

# bytes in allocated blocks=  232,861,440
# bytes in available blocks=2,064,665,008
424741 unused pools * 4096 bytes   =1,739,739,136
# bytes lost to pool headers   =   27,351,648
# bytes lost to quantization   =9,129,200
# bytes lost to arena alignment=0
Total  =4,073,746,432

So space utilization is under 6%:

232,861,440 / 4,073,746,432 = 0.0571615

Believe it or not, that's slightly (but _only_ slightly) better than
when using the current 256K/4K arena/pool mix, which released no
arenas at all and ended with

232,861,440 / 4,199,022,592 = 0.05545611

utilization.

So:

- There's substantial room for improvement in releasing RAM by
tracking it at OS page level.

- But the current code design is (very!) poorly suited for that.

- In some non-contrived cases it wouldn't really help anyway.

A natural question is how much arena size affects final space
utilization for memcrunch.py.  Every successive increase over one pool
hurts, but eventually it stops mattering much.  Here are the possible
power-of-2 arena sizes, using 4K pools, ending with the smallest for
which no arenas get reclaimed:

1,968,234,336 / 2,165,366,784 = 0.90896117
528654 arenas * 4096 bytes/arena   =2,165,366,784
# bytes in allocated blocks=1,968,234,336

1,968,234,336 / 2,265,399,296 = 0.86882447
276538 arenas * 8192 bytes/arena   =2,265,399,296
# bytes in allocated blocks=1,968,234,336

1,968,235,360 / 2,441,314,304 = 0.80621957
149006 arenas * 16384 bytes/arena  =2,441,314,304
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 2,623,799,296 = 0.75014707
80072 arenas * 32768 bytes/arena   =2,623,799,296
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 2,924,216,320 = 0.67308131
44620 arenas * 65536 bytes/arena   =2,924,216,320
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,299,475,456 = 0.59652978
25173 arenas * 131072 bytes/arena  =3,299,475,456
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,505,913,856 = 0.56140437
13374 arenas * 262144 bytes/arena  =3,505,913,856
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,552,051,200 = 0.55411233
6775 arenas * 524288 bytes/arena   =3,552,051,200
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,553,624,064 = 0.55386707
3389 arenas * 1048576 bytes/arena  =3,553,624,064
# bytes in allocated blocks=1,968,235,360

Most of the damage was done by the time we reached 128K arenas, and
"almost all" when reaching 256K.

I expect that's why I'm not seeing much of any effect (on arena
recycling effectiveness) moving from the current 256K/4K to the PR's
1M/16K.  256K/4K already required "friendly" allocation/deallocation
patterns for the status quo to do real good, and 256K 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Tim Peters
And one more random clue.

The memcrunch.py attached to the earlier-mentioned bug report does
benefit a lot from changing to a "use the arena with the smallest
address" heuristic, leaving 86.6% of allocated bytes in use by objects
at the end (this is with the arena-thrashing fix, and the current
256K/4K arena/pool sizes).  Which isn't surprising, else the person
who opened that report wouldn't have attached that specific script ;-)

However, when I increase its number of starting objects by a factor of
200, changing the heuristic barely helped at all.  It managed to
reclaim a few dozen more arenas by the end (out of 13,556 total arenas
allocated), but left only 56.1% of arena space in use by objects.

For the later program I posted, which by accident-rather-than-design
is even worse for arena cycling, changing the heuristic didn't help at
all.  Either way, no arenas were released (although one empty arena is
retained in the usable_arenas list), and less than 10% of arena space
was in use at the end of phase 10.

A few things to note, then:

- For truly effective RAM releasing, we would almost certainly need to
make major changes, to release RAM at an OS page level.   256K arenas
were already too fat a granularity.

- Whether a program benefits from what we do now appears to rely by
far most on its patterns of allocation/deallocation, rather than on
the arena-releasing heuristic OR on the arena size.

- While the by-lowest-address heuristic is obviously more effective in
one known artificial program so far ;-) , it's more expensive than
what we do now.  That's a full-blown sorting problem.  Ordering by
number of free pools is not:  that's essentially a bucket-distribution
problem, since the number of _possible_ values is small.  The data
structure changes I already checked in truly take constant time, in
all cases, to restore sorted order when an arena's number of free
pools changes.  When sorting by address, the best that can be done in
general is to take O(log(A)) time to insert a new arena (where A is
the number of arenas).  And a linked list is no good for that either
(unless, e.g., it's made fancier, like a skip list).

The quick experiment I tried used one-at-time search to put an arena
in the by-address-ordered list (the submitted patch also did), and we
already know from experience that can become a timing disaster when
hundreds of thousands of arenas are in use.  OTOH, that only needs to
be done once each time an arena is added to usable_arenas, not every
time an arena's number of free pools changes.

So, overall:

- I'm not inclined to pursue changing the arena-freeing heuristic.

- Changes to keep track of OS page-level use would likely require
starting over from scratch.  But would also make using far larger
pools practical (on 64-bit boxes).

- I still haven't found a program where changing from 256K to 1M
arenas makes a lick of appreciable difference in arena recycling
effectiveness:  for a given program chewing on a given problem size,
"does pretty well" or "does horribly" has been the outcome either way.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/F2LEVIFQL55EVSCCIBKDFSPPTG3KY3P2/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Antoine Pitrou
On Mon, 17 Jun 2019 13:44:29 -0500
Tim Peters  wrote:
> 
> To illustrate, I reverted that change in my PR and ran exactly same
> thing.  Wow - _then_ the 1M-arena-16K-pool PR reclaimed 1135(!) arenas
> instead of none.  Almost all worse than uselessly.  The only one that
> "paid" was the last:  the run ended with 3361 arenas still in use
> instead of 3362.  Because with the change, one entirely empty arena
> remained on the usable_arenas list.
> 
> So, word to the wise:  when looking at _debugmallocstats() output, like:
> 
> # arenas allocated total   =4,496
> # arenas reclaimed =1,135
> # arenas highwater mark=3,362
> # arenas allocated current =3,361
> 3361 arenas * 1048576 bytes/arena  =3,524,263,936
> 
> the number "reclaimed" isn't really telling us much:  before 3.9, it
> may be telling us only how many times obmalloc wasted time on useless
> arena thrashing.

That's a very nice improvement :-)

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BS4XUDVOVTB7CECA6UHWWXCUFECGAWAV/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters
[Tim]
> ...
> Now under 3.7.3.  First when phase 10 is done building:
>
> phase 10 adding 9953410
> phase 10 has 16743920 objects
>
> # arenas allocated total   =   14,485
> # arenas reclaimed =2,020
> # arenas highwater mark=   12,465
> # arenas allocated current =   12,465
> 12465 arenas * 262144 bytes/arena  =3,267,624,960
>
> # bytes in allocated blocks=3,216,219,656
>
> Space utilization is again excellent.  A significant number of arenas
> were reclaimed - but usefully?

Nope!  Digging through all the output, all the arena recycling
happened in the "_add_ millions of objects" stages, never in the
"delete millions of objects" stages.  So it was just more accidental
arena thrashing (which was recently repaired in bpo-37257).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/JXQHYTTFKBE27DKZ532ENIMPNPE7PJ44/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters
Heh.  I wasn't intending to be nasty, but this program makes our arena
recycling look _much_ worse than memcrunch.py does.  It cycles through
phases.  In each phase, it first creates a large randomish number of
objects, then deletes half of all objects in existence.  Except that
every 10th phase, it deletes 90% instead.  It's written to go through
100 phases, but I killed it after 10 because it was obviously going to
keep on growing without bound.

Note 1:  to do anything deterministic with obmalloc stats these days
appears to require setting the envar PYTHONHASHSEED to 0 before
running (else stats vary even by the time you get to an interactive
prompt).

Note 2:  there are 3 heavily used size classes here, for ints,
2-tuples, and class instances, of byte sizes 32, 64, and 96 on 64-bit
boxes, under my PR and under released 3.7.3.

First with my branch, after phase 10 finishes building objects:

phase 10 adding 9953410
phase 10 has 16743920 objects

# arenas allocated total   =3,114
# arenas reclaimed =0
# arenas highwater mark=3,114
# arenas allocated current =3,114
3114 arenas * 1048576 bytes/arena  =3,265,265,664

# bytes in allocated blocks=3,216,332,784

No arenas have ever been reclaimed, but space utilization is excellent
(about 98.5% of arenas are being used by objects).

Then after phase 10 deletes 90% of everything still alive:

phase 10 deleting factor 90% 15069528
phase 10 done deleting

# arenas allocated total   =3,114
# arenas reclaimed =0
# arenas highwater mark=3,114
# arenas allocated current =3,114
3114 arenas * 1048576 bytes/arena  =3,265,265,664

# bytes in allocated blocks=  323,111,488

Still no arenas have been released, and space utilization is horrid.
A bit less than 10% of allocated space is being use for objects.

Now under 3.7.3.  First when phase 10 is done building:

phase 10 adding 9953410
phase 10 has 16743920 objects

# arenas allocated total   =   14,485
# arenas reclaimed =2,020
# arenas highwater mark=   12,465
# arenas allocated current =   12,465
12465 arenas * 262144 bytes/arena  =3,267,624,960

# bytes in allocated blocks=3,216,219,656

Space utilization is again excellent.  A significant number of arenas
were reclaimed - but usefully?  Let's see how things turn out after
phase 10 ends deleting 90% of the objects:

phase 10 deleting factor 90% 15069528
phase 10 done deleting

# arenas allocated total   =   14,485
# arenas reclaimed =2,020
# arenas highwater mark=   12,465
# arenas allocated current =   12,465
12465 arenas * 262144 bytes/arena  =3,267,624,960

# bytes in allocated blocks=  322,998,360

Didn't manage to reclaim anything!  Space utililization is again
horrid, and it's actually consuming a bit more arena bytes than when
running under the PR.

Which is just more of what I've been seeing over & over:  3.7.3 and
the PR both do a fine job of recycling arenas, or a horrid job,
depending on the program.

For excellent recycling, change this program to use a dict instead of a set.  So

data = {}

at the start, fill it with

data[serial] = Stuff()

and change

data.pop()

to use .popitem().

The difference is that set elements still appear in pseudo-random
order, but dicts are in insertion-time order.  So data.popitem() loses
the most recently added dict entry, and the program is then just
modeling stack allocation/deallocation.

def doit():
import random
from random import randrange
import sys

class Stuff:
# add cruft so it takes 96 bytes under 3.7 and 3.8
__slots__ = tuple("abcdefg")

def __hash__(self):
return hash(id(self))

LO = 5_000_000
HI = LO * 2
data = set()
serial = 0
random.seed(42)

for phase in range(1, 101):
toadd = randrange(LO, HI)
print("phase", phase, "adding", toadd)
for _ in range(toadd):
data.add((serial, Stuff()))
serial += 1
print("phase", phase, "has", len(data), "objects")
sys._debugmallocstats()
factor = 0.5 if phase % 10 else 0.9
todelete = int(len(data) * factor)
print(f"phase {phase} deleting factor {factor:.0%} {todelete}")
for _ in range(todelete):
data.pop()
print("phase", phase, "done deleting")
sys._debugmallocstats()

doit()
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters
[Tim]
> ...
> Here are some stats from running [memcrunch.py] under
> my PR, but using 200 times the initial number of objects
> as the original script:
>
> n = 2000 #number of things
>
> At the end, with 1M arena and 16K pool:
>
> 3362 arenas * 1048576 bytes/arena  =3,525,312,512
> # bytes in allocated blocks=1,968,233,888
> ...
>  With the larger arenas, none [arenas] were ever released.

...

> BTW, anyone keen to complicate the mmap management should first take
> this recent change into account::
>
> https://bugs.python.org/issue37257
>
> That appears to have killed off _the_ most overwhelmingly common cause
> of obmalloc counter-productively releasing an arena only to create a
> new one again milliseconds later.
>
> My branch, and Neil's, both contain that change, which makes it much
> harder to compare our branches' obmalloc arena stats with 3.7.  It
> turns out that a whole lot of "released arenas" under 3.7 (and will
> still be so in 3.8) were due to that worse-than-useless arena
> thrashing.

To illustrate, I reverted that change in my PR and ran exactly same
thing.  Wow - _then_ the 1M-arena-16K-pool PR reclaimed 1135(!) arenas
instead of none.  Almost all worse than uselessly.  The only one that
"paid" was the last:  the run ended with 3361 arenas still in use
instead of 3362.  Because with the change, one entirely empty arena
remained on the usable_arenas list.

So, word to the wise:  when looking at _debugmallocstats() output, like:

# arenas allocated total   =4,496
# arenas reclaimed =1,135
# arenas highwater mark=3,362
# arenas allocated current =3,361
3361 arenas * 1048576 bytes/arena  =3,524,263,936

the number "reclaimed" isn't really telling us much:  before 3.9, it
may be telling us only how many times obmalloc wasted time on useless
arena thrashing.

The _important_ bit is the difference between "highwater mark" and
"allocated current".  That's how much peak arena address reservation
declined.  In this run, it only managed to release one empty arena
from the peak (which the actual PR does not release, because bpo-37257
changed this to keep (at most) one empty arena available for reuse).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZFAIAW4VYZRSRG6QJRXTVVTY43562ONI/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters
[Inada Naoki]
>> Increasing pool size is one obvious way to fix these problems.
>> I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
>> a sweet spot for recent web servers (typically, about 32 threads, and
>> 64GiB), but there is no evidence about it.

[Antoine]
> Note that the OS won't give a huge page automatically, because memory
> management becomes much more inflexible then.
>
> For example, the Linux madvise() man page has this to say about
> MADV_HUGEPAGE:
>
>   This feature is primarily aimed at  applications  that
>   use large mappings of data and access large regions of
>   that memory at a time  (e.g.,  virtualization  systems
>   such as QEMU).  It can very easily waste memory (e.g.,
>   a 2 MB mapping that only ever  accesses  1  byte  will
>   result  in  2 MB  of  wired memory instead of one 4 KB
>   page).  See the Linux kernel  source  file  Documenta‐
>   tion/vm/transhuge.txt for more details.
>
> I'm not sure a small objects allocator falls into the right use case
> for huge pages.

The SuperMalloc paper I recently pointed at notes that it uses huge
pages only for "huge" requests.  Not for "small", "medium", or "large"
requests.

But it carves up 2 MiB chunks. aligned at 2 MiB addresses, for each
size class anyway (which use 4K pages).

There are a mix of reasons for that.  Partly for the same reasons I
want bigger pools and arenas:  to stay in the fastest code paths.
Hitting page/arena/chunk boundaries costs cycles for computation and
conditional branches, and clobbers cache lines to access & mutate
bookkeeping info that the fast paths don't touch.

Also to reduce the fraction of allocator space "wasted" on bookkeeping
info.  48 header bytes out of a 4K pool is a bigger percentage hit
than, say, two 4K pages (to hold fancier allocator bookkeeping data
structures) out of a 2M chunk.

And partly for the same reason Neil is keen for bigger arenas in his
branch:  to reduce the size of data structures to keep track of other
bookkeeping info (in Neil's case, a radix tree, which can effectively
shift away the lowest ARENA_BITS bits of addresses it needs to store).

Which hints at much of why it wants "huge" chunks, but doesn't explain
why it doesn't want huge pages except to satisfy huge requests.
That's because it strives to be able to release physical RAM back to
the system on a page basis (which is also part of why it needs fancier
bookkeeping data structures to manage its chunks - it needs to keep
track of which pages are in use, and apply page-based heuristics to
push toward freeing pages).

So that combines very much larger "pools" (2M v 4K) with better
chances of actually returning no-longer-used pages to the system (on a
4K basis rather than a 256K basis).  But it's built on piles of
platform-specific code, and isn't suitable at all for 32-bit boxes
(it' relies on that virtual address space is an abundant resource on
64-bit boxes - reserving 2M of address space is close to trivial, and
could potentially be done millions of times without getting in
trouble).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CST4C3PF7EGNPEXN3F3LRPGJ7NNDAHTE/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Inada Naoki
On Mon, Jun 17, 2019 at 6:14 PM Antoine Pitrou  wrote:
> But it's not enabled by default...  And there are reasons for that (see
> the manpage I quoted).

Uh, then if people want to use huge page, they need to enable it on system wide,
or add madvice in obmalloc.c.

> > In web applications, it's common to one Python worker process
> > use 50~100+ MiB RSS.  2MB arena seems reasonable for those
> > applications.
>
> Perhaps, but what is the problem you are trying to solve?  Do you have
> evidence that memory management of those 50-100 MB is costly?
>

I just meant we may be able to utilize THP if we provide large arena option.
In other words, *if* we provide configure option to increase arena & pool size,
2MB arena seems reasonable to me.  That's all I wanted to say here.

I didn't mean utilizing THP is the main motivation to increase arena size.
People who want to use huge page may have a problem to solve by huge page.
But I don't have it.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S3GELVBIRQS5LOA72OIPPUNPNZGEUV2J/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Antoine Pitrou

Le 17/06/2019 à 10:55, Inada Naoki a écrit :
> On Mon, Jun 17, 2019 at 5:18 PM Antoine Pitrou  wrote:
>>
>> On Mon, 17 Jun 2019 11:15:29 +0900
>> Inada Naoki  wrote:
>>>
>>> Increasing pool size is one obvious way to fix these problems.
>>> I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
>>> a sweet spot for recent web servers (typically, about 32 threads, and
>>> 64GiB), but there is no evidence about it.
>>
>> Note that the OS won't give a huge page automatically, because memory
>> management becomes much more inflexible then.
> 
> When we used contiguous 2MB pages, Linux may use Huge Page
> implicitly when Transparent Huge Page is enabled.

But it's not enabled by default...  And there are reasons for that (see
the manpage I quoted).

> In web applications, it's common to one Python worker process
> use 50~100+ MiB RSS.  2MB arena seems reasonable for those
> applications.

Perhaps, but what is the problem you are trying to solve?  Do you have
evidence that memory management of those 50-100 MB is costly?

Perhaps those 50-100 MB are allocated at worker startup (module
initialization structures, docstrings...) and only get deallocated at
the end, so they aren't really a problem for arena allocation cost.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z2AO6AFBCYDJK7HM3NZ6SCTWXSUZ36TI/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Inada Naoki
On Mon, Jun 17, 2019 at 5:18 PM Antoine Pitrou  wrote:
>
> On Mon, 17 Jun 2019 11:15:29 +0900
> Inada Naoki  wrote:
> >
> > Increasing pool size is one obvious way to fix these problems.
> > I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
> > a sweet spot for recent web servers (typically, about 32 threads, and
> > 64GiB), but there is no evidence about it.
>
> Note that the OS won't give a huge page automatically, because memory
> management becomes much more inflexible then.

When we used contiguous 2MB pages, Linux may use Huge Page
implicitly when Transparent Huge Page is enabled.

Then, if we munmap one page of the 2MB, the kernel will split
the huge page into small many pages again.
I don't know it really happens on Python applications,
but using 2MB arena will reduce the risk of performance penalty
of page splitting.

In web applications, it's common to one Python worker process
use 50~100+ MiB RSS.  2MB arena seems reasonable for those
applications.

I am not proposing making this default.
I just meant this may be a reasonable configuration for many Python
users who create medium~large size web apps.

Regards,

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VGXTBA6JBYOMXVTFVZ7NEINLPBQ4ZYFT/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Antoine Pitrou
On Mon, 17 Jun 2019 11:15:29 +0900
Inada Naoki  wrote:
> 
> Increasing pool size is one obvious way to fix these problems.
> I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
> a sweet spot for recent web servers (typically, about 32 threads, and
> 64GiB), but there is no evidence about it.

Note that the OS won't give a huge page automatically, because memory
management becomes much more inflexible then.

For example, the Linux madvise() man page has this to say about
MADV_HUGEPAGE:

  This feature is primarily aimed at  applications  that
  use large mappings of data and access large regions of
  that memory at a time  (e.g.,  virtualization  systems
  such as QEMU).  It can very easily waste memory (e.g.,
  a 2 MB mapping that only ever  accesses  1  byte  will
  result  in  2 MB  of  wired memory instead of one 4 KB
  page).  See the Linux kernel  source  file  Documenta‐
  tion/vm/transhuge.txt for more details.

I'm not sure a small objects allocator falls into the right use case
for huge pages.

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GDRMLWAI26TSRMX6HNJ6C5X7RRJL/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters
[Inada Naoki ]
> obmalloc is very nice at allocating small (~224 bytes) memory blocks.
> But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

For the "unavoidable memory waste" reasons you spell out here,
Vladimir deliberately set the threshold to 256 at the start.  As
things turned out, though, dicts drove the decision to boost it to 512
anyway.  A comment in the code says:

 * Note: a size threshold of 512 guarantees that newly created dictionaries
 * will be allocated from preallocated memory pools on 64-bit.

That was probably before "compact dicts", but, as-is, obmalloc can
handle small dicts now even if they're not empty.  And small dicts are
common.  Another relevant one pushing dicts to use obmalloc:

https://bugs.python.org/issue23601

> ...
> Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
> to 256 and believe malloc works well for medium size memory blocks.

Alas, it doesn't ;-)  Well, everyone who ever timed it found obmalloc
was significantly faster than the system malloc for every size class
obmalloc handles.  That's partly because obmalloc runs under
protection of the GIL, while the system malloc has to worry about
thread safety.

But, regardless, obmalloc is just plain hard to beat so long as it can
stay in its fast paths (which increasing pool and arena sizes aims at
helping it do).  Its current pool and arena sizes are tiny compared to
comparable structures in the high-profile allocators often mentioned
in these discussions.  However, the latter generally use
platform-specific ways to spell "free this OS page" rather than
waiting for an entire arena-like structure to become empty.

For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the
sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to
holding _only_ 16-byte objects.  Now you want 388?  Cool, we'll
reserved another 2 MiB to holding only 388-byte objects.  Etc."

[1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf


> ```
> >>> pool_size = 4096 - 48  # 48 is pool header size
> >>> for bs in range(16, 513, 16):
> ... n,r = pool_size//bs, pool_size%bs + 48
> ... print(bs, n, r, 100*r/4096)
> ...
> 16 253 48 1.171875
> 32 126 64 1.5625
> 48 84 64 1.5625
> 64 63 64 1.5625
> 80 50 96 2.34375
> 96 42 64 1.5625
> 112 36 64 1.5625
> 128 31 128 3.125
> 144 28 64 1.5625
> 160 25 96 2.34375
> 176 23 48 1.171875
> 192 21 64 1.5625
> 208 19 144 3.515625
> 224 18 64 1.5625
> 240 16 256 6.25
> 256 15 256 6.25
> 272 14 288 7.03125
> 288 14 64 1.5625
> 304 13 144 3.515625
> 320 12 256 6.25
> 336 12 64 1.5625
> 352 11 224 5.46875
> 368 11 48 1.171875
> 384 10 256 6.25
> 400 10 96 2.34375
> 416 9 352 8.59375
> 432 9 208 5.078125
> 448 9 64 1.5625
> 464 8 384 9.375
> 480 8 256 6.25
> 496 8 128 3.125
> 512 7 512 12.5
> ```
>
> There are two problems here.
>
> First, pool overhead is at most about 3.5 % until 224 bytes.
> But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
> bytes, and 12.5% at 512 bytes.

Note that in sys._debugmallocstats() output, the total of the
"unusable for this reason" space is called "bytes lost to
quantization",  It's rarely a big deal in my experience, but certainly
tends to account for a higher percentage of arena space as the average
size of used objects increases.


> Second, some size classes have the same number of memory blocks.
> Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
> It reduces utilization of pools.  This problem becomes bigger on 32bit 
> platform.

I don't understand the "32bit platform" comment.  On 32-bit boxes,
alignment is to 8 byte boundaries (not 16 as above), and there are 64
size classes (not 32 as above).  Which tends to make space efficiency
better, not worse, than on 64-bit Python (16-byte alignment is a waste
for the vast majority of objects Python programs use, which rarely
require better than pointer alignment (on 64-byte boxes) or double (on
32-bit ones)).


> Increasing pool size is one obvious way to fix these problems.

Yes & no ;-)  In Neil's radix-tree branch your code above works fine
when the pool is increased.  But my PR doesn't introduce a new tree,
building on what obmalloc already does:  it absolutely needs to
"steal" 4 bytes at the start of every 4 KiB page - which turns into 16
bytes on 64-boxes to preserve 16-bit alignment.  The number of usable
blocks in a pool is computed by this function in the PR:

/* Return total number of blocks in pool of size index `i`, as a uint. */
static uint
NUMBLOCKS(int i)
{
assert(0 <= i && i < NB_SMALL_SIZE_CLASSES);
/* The first page burns space for a pool header, and remaining pages
 * burn ALIGNMENT bytes for the arena index.
 */
const uint size = INDEX2SIZE(i);
uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD;
uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT;
return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1);
}

For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte
objects than we can 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Inada Naoki
obmalloc is very nice at allocating small (~224 bytes) memory blocks.
But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

```
>>> pool_size = 4096 - 48  # 48 is pool header size
>>> for bs in range(16, 513, 16):
... n,r = pool_size//bs, pool_size%bs + 48
... print(bs, n, r, 100*r/4096)
...
16 253 48 1.171875
32 126 64 1.5625
48 84 64 1.5625
64 63 64 1.5625
80 50 96 2.34375
96 42 64 1.5625
112 36 64 1.5625
128 31 128 3.125
144 28 64 1.5625
160 25 96 2.34375
176 23 48 1.171875
192 21 64 1.5625
208 19 144 3.515625
224 18 64 1.5625
240 16 256 6.25
256 15 256 6.25
272 14 288 7.03125
288 14 64 1.5625
304 13 144 3.515625
320 12 256 6.25
336 12 64 1.5625
352 11 224 5.46875
368 11 48 1.171875
384 10 256 6.25
400 10 96 2.34375
416 9 352 8.59375
432 9 208 5.078125
448 9 64 1.5625
464 8 384 9.375
480 8 256 6.25
496 8 128 3.125
512 7 512 12.5
```

There are two problems here.

First, pool overhead is at most about 3.5 % until 224 bytes.
But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
bytes, and 12.5% at 512 bytes.

Second, some size classes have the same number of memory blocks.
Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
It reduces utilization of pools.  This problem becomes bigger on 32bit platform.

Increasing pool size is one obvious way to fix these problems.
I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
a sweet spot for recent web servers (typically, about 32 threads, and
64GiB), but there is no evidence about it.
We need a reference application and scenario to benchmark.
pyperformance is not good for measuring memory usage of complex
applications.

```
>>> header_size = 48
>>> pool_size = 16*1024
>>> for bs in range(16, 513, 16):
... n = (pool_size - header_size) // bs
... r = (pool_size - header_size) % bs + header_size
... print(bs, n, r, 100 * r / pool_size)
...
16 1021 48 0.29296875
32 510 64 0.390625
48 340 64 0.390625
64 255 64 0.390625
80 204 64 0.390625
96 170 64 0.390625
112 145 144 0.87890625
128 127 128 0.78125
144 113 112 0.68359375
160 102 64 0.390625
176 92 192 1.171875
192 85 64 0.390625
208 78 160 0.9765625
224 72 256 1.5625
240 68 64 0.390625
256 63 256 1.5625
272 60 64 0.390625
288 56 256 1.5625
304 53 272 1.66015625
320 51 64 0.390625
336 48 256 1.5625
352 46 192 1.171875
368 44 192 1.171875
384 42 256 1.5625
400 40 384 2.34375
416 39 160 0.9765625
432 37 400 2.44140625
448 36 256 1.5625
464 35 144 0.87890625
480 34 64 0.390625
496 32 512 3.125
512 31 512 3.125
```

Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
to 256 and believe malloc works well for medium size memory blocks.

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/AG6UUPKFXYOTZALFV7XD7EUV62SHOI3P/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters
[Antoine]
> We moved from malloc() to mmap() for allocating arenas because of user
> requests to release memory more deterministically:
>
> https://bugs.python.org/issue11849

Which was a good change!  As was using VirtualAlloc() on Windows.
None of that is being disputed.  The change under discussion isn't
about mmap - mmap only incidentally gets sucked in here because it's
part of obmalloc's _the_ very slowest paths.  Again, I'm aiming at
_all_ of obmalloc's slower paths, on all platforms, at once.  mmap
isn't my focus.

That doesn't preclude anyone who cares lots about mmap from adding
more complication to cater specifically to it.

> And given the number of people who use Python for long-running
> processes nowadays, I'm sure that they would notice (and be annoyed) if
> Python did not release memory after memory consumption spikes.

The PR changes nothing about the "release arenas" heuristic or
implementation.  There's just unquantified speculation that boosting
arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less
trivial 1 MiB, may be disastrous.  The only evidence for that so far
is repetition of the speculation ;-)

>> ...
>> We haven't been especially
>> pro-active about giant machines, and are suffering from it:
>>
>> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/

> So you're definitely trying to solve a problem, right?

By my definition of "a problem", no.  I have no quantified goals or
any criteria to say "and now it's solved".  I have open-ended concerns
about how Python will fare on giant machines slinging billions of
objects, and want to make things better _before_ users are provoked to
complain.  Which they'll do anyway, of course.  I'm not trying to
change human nature ;-)

> So the question becomes: does the improvement increasing the pool
> and arena size have a negative outcome on *other* use cases?

Which is why Neil & I have been running benchmarks, tests, Python
programs we run all the time, Python programs we care about ... and
actively soliciting feedback from people actually trying the code on
programs _they_ care about.

> Not everyone has giant machines.  Actually a frequent usage model is to
> have many small VMs or containers on a medium-size machine.

Something I've never done, so am wholly unqualified to judge.  I don't
even know what "many", "small", or "medium-size" might mean in this
context.  And I don't have a giant machine either, but spent most of
my professional career in the "supercomputer" business so at least
understand how those kinds of people think ;-)


>> For example, it has to allocate at least 56 bytes of separate bookkeeping 
>> info
>> for each arena.  Nobody cares when they have 100 arenas, but when there
>> are a million arenas (which I've seen), that adds up.

> In relative terms, assuming that arenas are 50% full on average
> (probably a pessimistic assumption?), that overhead is 0.08% per arena
> memory used.  What point is worrying about that?

You're only looking at one cost.  Those bytes aren't just address
reservations, they consume actual physical RAM.  The bookkeeping info
is periodically read, and mutated, over time.  In aggregate, that's
more than enough bytes to blow away an entire L3 cache.  The less of
that stuff needs to be read and written (i.e., the fewer arenas there
are), the less pressure that puts on the faster layers (caches) of the
memory hierarchy.

That bookkeeping info is also immortal:  all the bookkeeping
arena_object structs live a single contiguously allocated vector.  It
may move over time (via realloc()), but can never shrink, only grow.
Asking the platform malloc/realloc for a 50 MB chunk sucks on the face
of it :-(

So while the 50M is first-order trivial when a million arenas are in
use, if the program enters a new phase releasing almost all of it,
leaving (say) only 10 arenas still active, the 50 M is still there,
effectively wasting 200 arenas' worth of "invisible" (to
_debugmallocstats()) space forever.

About typical arena usage, I expect, but don't know, that 50% is quite
pessimistic.  It's a failure of project management (starting with me)
that _every_ step in the "free arenas" evolution was driven by a user
complaining about their program, and that nothing was ever checked in
to even ensure their problem remained solved, let alone to help find
out how effective arena-releasing is in an arbitrary program.  We've
been flying blind from the start, and remain blind.

That said, over the decades I've often looked at obmalloc stats, and
have generally been pleasantly surprised at how much of allocated
space is being put to good use.  80% seems more typical than 50% to me
based on that.

It's easy enough to contrive programs tending toward only 16 bytes in
use per arena, but nobody has ever reported anything like that.  The
aforementioned memcrunch.py program wasn't particularly contrived, but
was added to a bug report to "prove" that obmalloc's arena releasing
strategy 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Antoine Pitrou
On Sat, 15 Jun 2019 22:02:35 -0600
Neil Schemenauer  wrote:
> On 2019-06-15, Antoine Pitrou wrote:
> > We should evaluate what problem we are trying to solve here, instead
> > of staring at micro-benchmark numbers on an idle system.  
> 
> I think a change to obmalloc is not going to get accepted unless we
> can show it doesn't hurt these micro-benchmarks.  To displace the
> status quo, it has to give other advantages as well.  I don't have
> any agenda or "problem to solve".  After Tim made a PR to allow
> obmalloc to use larger pools, I thought it would be interesting to
> see if a arena mapping scheme based on radix trees should be
> performance competitive.  I'm not proposing any changes to CPython
> at this point.  I'm sharing the results of an experiment.  I thought
> it was interesting.  I guess you don't.

I'm not saying it's not interesting.  I'm just saying that you can't
validate memory allocator changes only on short-running
micro-benchmarks with well-behaved allocation patterns.

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KD32F5SPZDEOOJ5CW2254U5FDNFGWMNL/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Antoine Pitrou
On Sat, 15 Jun 2019 19:56:58 -0500
Tim Peters  wrote:
> 
> At the start, obmalloc never returned arenas to the system.  The vast
> majority of users were fine with that.  A relative few weren't.  Evan
> Jones wrote all the (considerable!) code to change that, and I
> massaged it and checked it in - not because there was "scientific
> proof" that it was more beneficial than harmful (it certainly added
> new expenses!) overall, but because it seemed like a right thing to
> do, _anticipating_ that the issue would become more important in
> coming years.
> 
> I'm still glad it was done, but no tests were checked in to _quantify_
> its presumed benefits - or even to verify that it ever returned arenas
> to the system.  Best I can tell, nobody actually has any informed idea
> how well it does.  Evan stared at programs that were important to him,
> and fiddled things until he was "happy enough".

We moved from malloc() to mmap() for allocating arenas because of user
requests to release memory more deterministically:
https://bugs.python.org/issue11849

And given the number of people who use Python for long-running
processes nowadays, I'm sure that they would notice (and be annoyed) if
Python did not release memory after memory consumption spikes.

> I've looked at obmalloc stats in other programs at various stages, and
> saw nothing concerning.  memchunk.py appears to model object lifetimes
> as coming from a uniform distribution, but in real life they appear to
> be strongly multi-modal (with high peaks at the "way less than an eye
> blink" and "effectively immortal" ends).

I agree they will certainly be multi-modal, with the number of modes,
their respective weights and their temporal distance widely dependent on
use cases.

(the fact that they're multi-modal is the reason why generational GC is
useful, btw)

> We haven't been especially
> pro-active about giant machines, and are suffering from it:
> 
> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/

So you're definitely trying to solve a problem, right?

> Fixing the underlying cause put giant machines on my radar, and
> getting rid of obmalloc's pool size limit was the next obvious thing
> that would help them (although not in the same universe as cutting
> quadratic time to linear).

"Not in the same universe", indeed.  So the question becomes: does the
improvement increasing the pool and arena size have a negative outcome
on *other* use cases?

Not everyone has giant machines.  Actually a frequent usage model is to
have many small VMs or containers on a medium-size machine.

> For example,
> it has to allocate at least 56 bytes of separate bookkeeping info for
> each arena.  Nobody cares when they have 100 arenas, but when there
> are a million arenas (which I've seen), that adds up.

In relative terms, assuming that arenas are 50% full on average
(probably a pessimistic assumption?), that overhead is 0.08% per arena
memory used.  What point is worrying about that?

> > If the problem is the cost of mmap() and munmap() calls, then the
> > solution more or less exists at the system level: jemalloc and other
> > allocators use madvise() with MADV_FREE (see here:
> > https://lwn.net/Articles/593564/).
> >
> > A possible design is a free list of arenas on which you use MADV_FREE
> > to let the system know the memory *can* be reclaimed.  When the free
> > list overflows, call munmap() on extraneous arenas.  
> 
> People can certainly pursue that if they like.  I'm not interested in
> adding more complication that helps only one of obmalloc's slowest
> paths on only one platform.

MADV_FREE is available on multiple platforms (at least Linux, macOS,
FreeBSD).  Windows seems to offer similar facilities:
https://devblogs.microsoft.com/oldnewthing/20170113-00/?p=95185

> The dead obvious, dead simple, way to reduce mmap() expense is to call
> it less often, which just requires changing a compile-time constant -
> which will also call VirtualAlloc() equally less often on Windows.

That's assuming the dominating term in mmap() cost is O(1) rather than
O(size).  That's not a given.  The system call cost is certainly O(1),
but the cost of reserving and mapping HW pages, and zeroing them out is
most certainly O(# pages).

Regards

Antoine.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ER62CFARDT2RYUNW7WOP3L5JV6UVSHSI/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Tim Peters
[Tim]
>> At the start, obmalloc never returned arenas to the system.  The vast
>> majority of users were fine with that.

[Neil]
> Yeah, I was totally fine with that back in the day.  However, I
> wonder now if there is a stronger reason to try to free memory back
> to the OS.  Years ago, people would typically have swap space that
> was as large or larger than their real RAM.  So, if the OS had to
> swap out unused pages, it wasn't a big deal.  Now that disks are
> relatively so much slower and RAM is larger, people don't have as
> much swap.  Some Linux systems get setup without any.  Freeing
> arenas seems more important than it used to be.

See my earlier "right thing to do ... anticipating" ;-)  It was clear
enough then that processor speeds were going to continue to increase
much faster than disk speeds, although the _primary_ driver at the
time was the then-small but increasing number of long-running Python
programs running big (for the time) data-crunching problems that
repeatedly went through stages of low and high memory need.


> OTOH, I don't think obmalloc should try too hard. The whole point of
> the small object allocator is to be really fast.  Anti-fragmentation
> heuristics are going to slow it down.  As far as I'm concerned, it
> works well enough as it is.

Freeing arenas _already_ slowed it down.  All the cruft to keep arenas
sorted by number of free pools was far from free.  Before that,  free
pools for a given size class were singly-linked together directly
(regardless of which arenas they were in), and we just used the head
of that list.   Arenas weren't involved at all.  Indeed, pools didn't
even point back to their arenas.

So major code and data changes were needed to implement the "take from
the non-full pool in the most heavily populated arena" heuristic.  The
other heuristic worth considering is "take from the non-full pool with
the smallest arena address", which directly strives to let the
higher-addressed arenas free up.

Since they're both just poke-&-hope, it's hard to out-think them.  But
it's possible one would consistently work better than the other, and
that even _which_ one could change depending on arena and pool sizes.
And should note that "poke-&-hope" is the best that can be hoped for,
since it's impossible to predict the lifetime of an object when space
for it is requested, and we can't move the object after its address is
returned to the caller.

More amusing:  address_in_range() didn't exist either at first.
Instead a pool header contained its own address and a "magic"
constant.  So, e.g.,

if (pool->pooladdr != pool || pool->magic != POOL_MAGIC) {
// This ain't ours!  Pass it to the system malloc family instead.
}

That was unlikely to believe something that _had_ come from the system
malloc really belonged to obmalloc, but proved very easy to trick into
making that error, even from pure Python code.

Good times ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GLE6UC7EKGZLBTFJVHUMMCKYXRRWU5GB/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Neil Schemenauer
On 2019-06-15, Tim Peters wrote:
> At the start, obmalloc never returned arenas to the system.  The vast
> majority of users were fine with that.

Yeah, I was totally fine with that back in the day.  However, I
wonder now if there is a stronger reason to try to free memory back
to the OS.  Years ago, people would typically have swap space that
was as large or larger than their real RAM.  So, if the OS had to
swap out unused pages, it wasn't a big deal.  Now that disks are
relatively so much slower and RAM is larger, people don't have as
much swap.  Some Linux systems get setup without any.  Freeing
arenas seems more important than it used to be.

OTOH, I don't think obmalloc should try too hard. The whole point of
the small object allocator is to be really fast.  Anti-fragmentation
heuristics are going to slow it down.  As far as I'm concerned, it
works well enough as it is.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z6O5HPSLQXNUHTY6GY7WN4KUB43J6QGV/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Neil Schemenauer
On 2019-06-15, Antoine Pitrou wrote:
> We should evaluate what problem we are trying to solve here, instead
> of staring at micro-benchmark numbers on an idle system.

I think a change to obmalloc is not going to get accepted unless we
can show it doesn't hurt these micro-benchmarks.  To displace the
status quo, it has to give other advantages as well.  I don't have
any agenda or "problem to solve".  After Tim made a PR to allow
obmalloc to use larger pools, I thought it would be interesting to
see if a arena mapping scheme based on radix trees should be
performance competitive.  I'm not proposing any changes to CPython
at this point.  I'm sharing the results of an experiment.  I thought
it was interesting.  I guess you don't.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6PZU7UYGG5VCZTC4PAYDDU456SS3YTGT/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Tim Peters
[Tim. to Neil]
>> Moving to bigger pools and bigger arenas are pretty much no-brainers
>> for us, [...]

[Antoine]
> Why "no-brainers"?

We're running tests, benchmarks, the Python programs we always run,
Python programs that are important to us, staring at obmalloc stats
... and seeing nothing bad, nothing mysterious, only neutral-to-good
results.  So what's to think about? ;-)  These are 64-bit boxes, with
terabytes of virtual address space.  Asking for a meg of that is just
reserving less than a thousandth of a thousandth of that range of
integers, not actual physical RAM.

>  Bigger pools sound ok,

They're not necessarily independent choices.  Increase pool size
without increasing arena size, and the number of pools per arena
falls.  At the extreme, if pools were made the same size as arenas,
we'd need at least 32 arenas just to start Python (which uses at least
one pool of _every_ possible size class before you hit the prompt -
note that, on 64-bit boxes, the number of possible size classes is
falling from 64 (3.7) to 32 (3.8), due to some desire to make
everything aligned to 16 bytes - which I've already seen account for
some programs needing 100s of MB of more RAM).

> but bigger arenas will make Python less likely to return memory to the system.

I know that gets repeated a lot, and I usually play along - but why do
people believe that?  Where's the evidence?

At the start, obmalloc never returned arenas to the system.  The vast
majority of users were fine with that.  A relative few weren't.  Evan
Jones wrote all the (considerable!) code to change that, and I
massaged it and checked it in - not because there was "scientific
proof" that it was more beneficial than harmful (it certainly added
new expenses!) overall, but because it seemed like a right thing to
do, _anticipating_ that the issue would become more important in
coming years.

I'm still glad it was done, but no tests were checked in to _quantify_
its presumed benefits - or even to verify that it ever returned arenas
to the system.  Best I can tell, nobody actually has any informed idea
how well it does.  Evan stared at programs that were important to him,
and fiddled things until he was "happy enough".

Not everyone was.  About five years ago, Kristján Valur Jónsson opened
this report:

https://bugs.python.org/issue21220

suggesting a very different heuristic to try to free arenas.  The
"memcrunch..py" in his patch is the only time I've ever seen anyone
write code trying to measure whether obmalloc's arena-freeing is
effective.

I can verify that if you increase the number of objects in his script
by a factor of 100, my PR _never_ returns an arena to the system.  But
it's clear as mud to me whether his heuristic would  either (with the
100x smaller number of objects in the original script, the PR branch
does recycle arenas).

So that's the only objective evidence I have :-)

I've looked at obmalloc stats in other programs at various stages, and
saw nothing concerning.  memchunk.py appears to model object lifetimes
as coming from a uniform distribution, but in real life they appear to
be strongly multi-modal (with high peaks at the "way less than an eye
blink" and "effectively immortal" ends).


> We should evaluate what problem we are trying to solve here,

I'm not trying to solve a problem.  This is a "right thing to do"
thing, anticipating that slinging a massive number of objects on
massive machines will become ever more important, and that changing
20-year-old constants will allow obmalloc to spend more time in its
fastest paths instead of its slowest.  We haven't been especially
pro-active about giant machines, and are suffering from it:

https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/
"""
Update: Here is a “fun” Python performance bug that I ran into the
other day: deleting a set of 1 billion strings takes >12 hours.
Obviously, this particular instance can be fixed, but this exactly the
sort of thing that I would never have done a few years ago. A billion
strings seemed like a lot back then, but now we regularly discuss
multiple Terabytes of input data as “not a big deal”. This may not
apply for your settings, but it does for mine.
"""

That was _probably_ due to obmalloc's move-one-at-a-time way of
keeping its usable arenas list sorted, which sat un-diagnosed for over
a year.

https://bugs.python.org/issue32846

Fixing the underlying cause put giant machines on my radar, and
getting rid of obmalloc's pool size limit was the next obvious thing
that would help them (although not in the same universe as cutting
quadratic time to linear).

> instead of staring at micro-benchmark numbers on an idle system.

My only interest in those is that they're not slowing down, because
that's important too.  The aim here is much more to make life better
for programs slinging millions - even billions - of objects.
obmalloc's internal overheads are frugal, but not free.  For example,
it has to allocate at least 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Antoine Pitrou
On Sat, 15 Jun 2019 01:15:11 -0500
Tim Peters  wrote:
> 
> > ...
> > My feeling right now is that Tim's obmalloc-big-pool is the best
> > design at this point.  Using 8 KB or 16 KB pools seems to be better
> > than 4 KB.  The extra complexity added by Tim's change is not so
> > nice.  obmalloc is already extremely subtle and obmalloc-big-pool
> > makes it more so.  
> 
> Moving to bigger pools and bigger arenas are pretty much no-brainers
> for us, [...]

Why "no-brainers"?  Bigger pools sound ok, but bigger arenas will make
Python less likely to return memory to the system.

We should evaluate what problem we are trying to solve here, instead
of staring at micro-benchmark numbers on an idle system.
Micro-benchmarks don't tell you what happens on a loaded system with
many processes, lots of I/O happening.

If the problem is the cost of mmap() and munmap() calls, then the
solution more or less exists at the system level: jemalloc and other
allocators use madvise() with MADV_FREE (see here:
https://lwn.net/Articles/593564/).

A possible design is a free list of arenas on which you use MADV_FREE
to let the system know the memory *can* be reclaimed.  When the free
list overflows, call munmap() on extraneous arenas.

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MJC274FJJMAD3C25XV5S6UOKNPAZ5A3R/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Tim Peters
[Neil Schemenauer ]
> ...
> BTW, the current radix tree doesn't even require that pools are
> aligned to POOL_SIZE.  We probably want to keep pools aligned
> because other parts of obmalloc rely on that.

obmalloc relies on it heavily.  Another radix tree could map block
addresses to all the necessary info in a current pool header, but that
could become gigantic.  For example, even a current "teensy" 256 KiB
arena can be carved into the order of 16K 16-byte blocks (the smallest
size class).  And finding a pool header now is unbeatably cheap:  just
clear the last 12 address bits, and you're done.


> Here is the matchup of the radix tree vs the current
> address_in_range() approach.
>
> - nearly the same in terms of performance.  It might depend on OS
>   and workload but based on my testing on Linux, they are very
>   close.  Would be good to do more testing but I think the radix
>   tree is not going to be faster, only slower.

People should understand that the only point to these things is to
determine whether a pointer passed to a free() or realloc() spelling
was obtained from an obmalloc pool or from the system malloc family.
So they're invoked once near the very starts of those two functions,
and that's all.

Both ways are much more expensive than finding a pool header (which is
just clearing some trailing address bits).

The current way reads an "arena index" out of the pool header, uses
that to index the file static `arenas` vector to get a pointer to an
arena descriptor, then reads the arena base address out of the
descriptor.  That;s used to determine whether the original address is
contained in the arena.  The primary change my PR makes is to read the
arena index from the start of the _page_ the address belongs instead
(the pool header address is irrelevant to this, apart from that a pool
header is aligned to the first page in a pool).

The radix tree picks bits out of the address three times to index into
a 3-level (but potentially broad) tree, ending with a node containing
info about the only two possible arenas the original address may
belong to.  Then that info is used to check.

The number of loads is essentially the same, but the multiple levels
of indexing in the tree is a teensy bit more expensive because it
requires more bit-fiddling.  I spent hours, in all, dreaming up a way
to make the _seemingly_ more complex final "so is the address in one
of those two arenas or not?" check about as cheap as the current way.
But Neil didn't see any significant timing difference after making
that change, which was mildly disappointing but not really surprising:
 arithmetic is just plain cheap compared to reading up memory.

> - radix tree uses a bit more memory overhead.  Maybe 1 or 2 MiB on a
>   64-bit OS.  The radix tree uses more as memory use goes up but it
>   is a small fraction of total used memory.  The extra memory use is
>   the main downside at this point, I think.

I'd call it the only downside.  But nobody yet has quantified how bad
it can get.


> - the radix tree doesn't read uninitialized memory.  The current
>   address_in_range() approach has worked very well but is relying on
>   some assumptions about the OS (how it maps pages into the program
>   address space).  This is the only aspect where the radix tree is
>   clearly better.  I'm not sure this matters enough to offset the
>   extra memory use.

I'm not worried about that.  The only real assumption here is that if
an OS supports some notion of "pages" at all, then for any address for
which the program has read/write access (which are the only kinds of
addresses that can be sanely passed to free/realloc), the OS allows
the same access to the entire page containing that address.

In two decades we haven't seen an exception to that yet, right?  It's
hard to imagine a HW designer thinking "I know!  Let's piss away more
transistors on finer-grained control nobody has asked for, and slow
down memory operations even more checking for that." ;-)


> - IMHO, the radix tree code is a bit simpler than Tim's
>   obmalloc-big-pool code.

Absolutely so.  There's another way to look at this:  if Vladimir
Marangozov (obmalloc's original author) had used an arena radix tree
from the start, would someone now get anywhere proposing a patch to
change it to the current scheme?  I'd expect a blizzard of -1 votes,
starting with mine ;-)

> ...
> My feeling right now is that Tim's obmalloc-big-pool is the best
> design at this point.  Using 8 KB or 16 KB pools seems to be better
> than 4 KB.  The extra complexity added by Tim's change is not so
> nice.  obmalloc is already extremely subtle and obmalloc-big-pool
> makes it more so.

Moving to bigger pools and bigger arenas are pretty much no-brainers
for us, but unless pool size is increased there's no particular reason
to pursue either approach - "ain't broke, don't fix".

Larry Hastings started a "The untuned tunable parameter ARENA_SIZE"
thread here about two years ago, where he got a blizzard of 

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Neil Schemenauer
Here are benchmark results for 64 MB arenas and 16 kB pools.  I ran
without the --fast option and on a Linux machine in single user
mode.  The "base" columm is the obmalloc-big-pools branch with
ARENA_SIZE = 64 MB and POOL_SIZE = 16 kB.  The "radix" column is
obmalloc_radix_tree (commit 5e00f6041) with the same arena and pool
sizes.

+-+-+-+
| Benchmark   | base (16kB/64MB)| radix (16KB/64MB)   |
+=+=+=+
| 2to3| 290 ms  | 292 ms: 1.00x slower (+0%)  |
+-+-+-+
| crypto_pyaes| 114 ms  | 116 ms: 1.02x slower (+2%)  |
+-+-+-+
| django_template | 109 ms  | 106 ms: 1.03x faster (-3%)  |
+-+-+-+
| dulwich_log | 75.2 ms | 74.5 ms: 1.01x faster (-1%) |
+-+-+-+
| fannkuch| 454 ms  | 449 ms: 1.01x faster (-1%)  |
+-+-+-+
| float   | 113 ms  | 111 ms: 1.01x faster (-1%)  |
+-+-+-+
| hexiom  | 9.45 ms | 9.47 ms: 1.00x slower (+0%) |
+-+-+-+
| json_dumps  | 10.6 ms | 11.1 ms: 1.04x slower (+4%) |
+-+-+-+
| json_loads  | 24.4 us | 25.2 us: 1.03x slower (+3%) |
+-+-+-+
| logging_simple  | 8.19 us | 8.37 us: 1.02x slower (+2%) |
+-+-+-+
| mako| 15.1 ms | 15.1 ms: 1.01x slower (+1%) |
+-+-+-+
| meteor_contest  | 98.3 ms | 97.1 ms: 1.01x faster (-1%) |
+-+-+-+
| nbody   | 142 ms  | 140 ms: 1.02x faster (-2%)  |
+-+-+-+
| nqueens | 93.8 ms | 93.0 ms: 1.01x faster (-1%) |
+-+-+-+
| pickle  | 8.89 us | 8.85 us: 1.01x faster (-0%) |
+-+-+-+
| pickle_dict | 17.9 us | 18.2 us: 1.01x slower (+1%) |
+-+-+-+
| pickle_list | 2.68 us | 2.64 us: 1.01x faster (-1%) |
+-+-+-+
| pidigits| 182 ms  | 184 ms: 1.01x slower (+1%)  |
+-+-+-+
| python_startup_no_site  | 5.31 ms | 5.33 ms: 1.00x slower (+0%) |
+-+-+-+
| raytrace| 483 ms  | 476 ms: 1.02x faster (-1%)  |
+-+-+-+
| regex_compile   | 167 ms  | 169 ms: 1.01x slower (+1%)  |
+-+-+-+
| regex_dna   | 170 ms  | 171 ms: 1.01x slower (+1%)  |
+-+-+-+
| regex_effbot| 2.70 ms | 2.75 ms: 1.02x slower (+2%) |
+-+-+-+
| regex_v8| 21.1 ms | 21.3 ms: 1.01x slower (+1%) |
+-+-+-+
| scimark_fft | 368 ms  | 371 ms: 1.01x slower (+1%)  |
+-+-+-+
| scimark_monte_carlo | 103 ms  | 101 ms: 1.02x faster (-2%)  |
+-+-+-+
| scimark_sparse_mat_mult | 4.31 ms | 4.27 ms: 1.01x faster (-1%) |
+-+-+-+
| spectral_norm   | 131 ms  | 135 ms: 1.03x slower (+3%)  |

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Neil Schemenauer
On 2019-06-14, Tim Peters wrote:
> However, last I looked there Neil was still using 4 KiB obmalloc
> pools, all page-aligned.  But using much larger arenas (16 MiB, 16
> times bigger than my branch, and 64 times bigger than Python currently
> uses).

I was testing it verses your obmalloc-big-pool branch and trying to
make it a fair comparision.  You are correct: 4 KiB pools and 16 MiB
arenas.  Maybe I should test with 16 KiB pools and 16 MiB arenas.
That seems a more optimized setting for current machines and
workloads.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SUN6QZQKRPI5WQZKSBZFSLBNG4MMV3YH/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Neil Schemenauer
On 2019-06-15, Inada Naoki wrote:
> Oh, do you mean your branch doesn't have headers in each page?

That's right.  Each pool still has a header but pools can be larger
than the page size.  Tim's obmalloc-big-pool idea writes something
to the head of each page within a pool.  The radix tree doesn't need
that and actually doesn't care about OS page size.

BTW, the current radix tree doesn't even require that pools are
aligned to POOL_SIZE.  We probably want to keep pools aligned
because other parts of obmalloc rely on that.

Here is the matchup of the radix tree vs the current
address_in_range() approach.

- nearly the same in terms of performance.  It might depend on OS
  and workload but based on my testing on Linux, they are very
  close.  Would be good to do more testing but I think the radix
  tree is not going to be faster, only slower.

- radix tree uses a bit more memory overhead.  Maybe 1 or 2 MiB on a
  64-bit OS.  The radix tree uses more as memory use goes up but it
  is a small fraction of total used memory.  The extra memory use is
  the main downside at this point, I think.

- the radix tree doesn't read uninitialized memory.  The current
  address_in_range() approach has worked very well but is relying on
  some assumptions about the OS (how it maps pages into the program
  address space).  This is the only aspect where the radix tree is
  clearly better.  I'm not sure this matters enough to offset the
  extra memory use.

- IMHO, the radix tree code is a bit simpler than Tim's
  obmalloc-big-pool code.  That's not a big deal though as long as
  the code works and is well commented (which Tim's code is).

My feeling right now is that Tim's obmalloc-big-pool is the best
design at this point.  Using 8 KB or 16 KB pools seems to be better
than 4 KB.  The extra complexity added by Tim's change is not so
nice.  obmalloc is already extremely subtle and obmalloc-big-pool
makes it more so.

Regards,

Neil
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZAPSJB6TOODRBRCF3T3CXMYSX3FLWDDI/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Tim Peters
[Inada Naoki . to Neil S]
> Oh, do you mean your branch doesn't have headers in each page?

That's probably right ;-)  Neil is using a new data structure, a radix
tree implementing a sparse set of arena addresses.  Within obmalloc
pools, which can be of any multiple-of-4KiB (on a 64-bit box) size,
every byte beyond the pool header is usable for user data.  In my
patch, there is no new data structure, but it needs to store an "arena
index" at the start of every page (every 4K bytes) within a pool.

I certainly _like_ Neil's code better.  It's clean rather than
excruciatingly tricky.  The question is whether that's enough to
justify the memory burden of an additional data structure (which can
potentially grow very large).  So I've been working with Neil to see
whether it's possible to make it faster than my branch, to give it
another selling point people actually care about ;-)

Should also note that Neil's approach never needs to read
uninitialized memory, so we could throw away decades of (e.g.)
valgrind pain caused by the current approach (which my patch builds
on).


> https://bugs.python.org/issue32846
>
> As far as I remember, this bug was caused by cache thrashing (page
> header is aligned by 4K, so cache line can conflict often.)
> Or this bug can be caused by O(N) free() which is fixed already.

I doubt that report is relevant here, but anyone is free to try it
with Neil's branch.

https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

However, last I looked there Neil was still using 4 KiB obmalloc
pools, all page-aligned.  But using much larger arenas (16 MiB, 16
times bigger than my branch, and 64 times bigger than Python currently
uses).

But the `O(N) free()` fix may very well be relevant.  To my eyes,
while there was plenty of speculation in that bug report, nobody
actually dug in deep to nail a specific cause.  A quick try just now
on my branch (which includes the `O(N) free()` fix) on Terry Reedy's
simple code in that report shows much improved behavior, until I run
out of RAM.

For example, roughly 4.3 seconds to delete 40 million strings in a
set, and 9.1 to delete 80 million in a set.  Not really linear, but
very far from quadratic.  In contrast, Terry saw nearly a quadrupling
of delete time when moving from 32M to 64M strings

So more than one thing was going on there, but looks likely that the
major pain was caused by quadratic-time arena list sorting.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GXNDFA7YO6NP3IWIW4IIYX5XEIOW2FJH/


[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Inada Naoki
Oh, do you mean your branch doesn't have headers in each page?

https://bugs.python.org/issue32846

As far as I remember, this bug was caused by cache thrashing (page
header is aligned by 4K, so cache line can conflict often.)
Or this bug can be caused by O(N) free() which is fixed already.

I'll see it in next week.

On Sat, Jun 15, 2019 at 3:54 AM Neil Schemenauer  wrote:
>
> I've been working on this idea for a couple of days.  Tim Peters has
> being helping me out and I think it has come far enough to get some
> more feedback.  It is not yet a good replacement for the current
> address_in_range() test.  However, performance wise, it is very
> close.  Tim figures we are not done optimizing it yet so maybe it
> will get better.
>
> Code is available on my github branch:
>
> https://github.com/nascheme/cpython/tree/obmalloc_radix_tree
>
> Tim's "obmalloc-big-pools" is what I have been comparing it to.  It
> seems 8 KB pools are faster than 4 KB.  I applied Tim's arena
> trashing fix (bpo-37257) to both branches.  Some rough (--fast)
> pyperformance benchmark results are below.
>
>
> +-+-+-+
> | Benchmark   | obmalloc-big-pools  | obmalloc_radix  
> |
> +=+=+=+
> | crypto_pyaes| 168 ms  | 170 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
> | hexiom  | 13.7 ms | 13.6 ms: 1.01x faster (-1%) 
> |
> +-+-+-+
> | json_dumps  | 15.9 ms | 15.6 ms: 1.02x faster (-2%) 
> |
> +-+-+-+
> | json_loads  | 36.9 us | 37.1 us: 1.01x slower (+1%) 
> |
> +-+-+-+
> | meteor_contest  | 141 ms  | 139 ms: 1.02x faster (-2%)  
> |
> +-+-+-+
> | nqueens | 137 ms  | 140 ms: 1.02x slower (+2%)  
> |
> +-+-+-+
> | pickle_dict | 26.2 us | 25.9 us: 1.01x faster (-1%) 
> |
> +-+-+-+
> | pickle_list | 3.91 us | 3.94 us: 1.01x slower (+1%) 
> |
> +-+-+-+
> | python_startup_no_site  | 8.00 ms | 7.78 ms: 1.03x faster (-3%) 
> |
> +-+-+-+
> | regex_dna   | 246 ms  | 241 ms: 1.02x faster (-2%)  
> |
> +-+-+-+
> | regex_v8| 29.6 ms | 30.0 ms: 1.01x slower (+1%) 
> |
> +-+-+-+
> | richards| 93.9 ms | 92.7 ms: 1.01x faster (-1%) 
> |
> +-+-+-+
> | scimark_fft | 525 ms  | 531 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
> | scimark_sparse_mat_mult | 6.32 ms | 6.24 ms: 1.01x faster (-1%) 
> |
> +-+-+-+
> | spectral_norm   | 195 ms  | 198 ms: 1.02x slower (+2%)  
> |
> +-+-+-+
> | sqlalchemy_imperative   | 49.5 ms | 50.5 ms: 1.02x slower (+2%) 
> |
> +-+-+-+
> | sympy_expand| 691 ms  | 695 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
> | unpickle_list   | 5.09 us | 5.32 us: 1.04x slower (+4%) 
> |
> +-+-+-+
> | xml_etree_parse | 213 ms  | 215 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
> | xml_etree_generate  | 134 ms  | 136 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
> | xml_etree_process   | 103 ms  | 104 ms: 1.01x slower (+1%)  
> |
> +-+-+-+
>
> Not significant (34): 2to3; chameleon; chaos; deltablue;
> django_template; dulwich_log; fannkuch; float; go; html5lib;
>