Re: shrinking and growing reallocs: a theoretical? bad case for performance
On Tue, Sep 01, 2020 at 11:56:36AM +0100, Stuart Henderson wrote: > On 2020/08/31 08:39, Otto Moerbeek wrote: > > A question from Theo made me think about realloc and come up with a > > particular bad case for performance. I do not know if it happens in > > practice, but it was easy to create a test program to hit the case. > > Not very scientific testing (a single attempt at building one port), but > this seems to help quite a lot when compiling programs written in rust. > I encourage others to test the diff :-) > It turned out this particular case was a fluke. But I'm still very interested in cases where it does matter and tests in general as well. -Otto
Re: shrinking and growing reallocs: a theoretical? bad case for performance
On 2020/08/31 08:39, Otto Moerbeek wrote: > A question from Theo made me think about realloc and come up with a > particular bad case for performance. I do not know if it happens in > practice, but it was easy to create a test program to hit the case. Not very scientific testing (a single attempt at building one port), but this seems to help quite a lot when compiling programs written in rust. I encourage others to test the diff :-)
Re: shrinking and growing reallocs: a theoretical? bad case for performance
On Mon, Aug 31, 2020 at 11:25:51AM -0600, Theo de Raadt wrote: > > Taking advantage of the sparse address space is smart and as 64-bit > > is now the norm, that space is even sparser. > > Fundamentally this is moving various forms of pressure to the kernel, > which does not do the best job yet. This effect is reduced by making small shrinks a no-op. > > The pivot code in mmap for new mappings isn't entirely bug-free so we've > avoided it turning it on. The idea of that code is be random as > neccessary -- creating "unknowable addresses", but in doing so avoid > fragmenting the address space excessively. Excessive fragmentation in turn > fragmentations allocation in multi-level page-tables, and that in turn > results in excessive TLB pressure. Which is difficult to gauge since things > keep working, but brings in a big performance cost. > > Basically we were brave to do very high amounts of randomization early on. > At a cost. But our work to improve the cost isn't finished.
Re: shrinking and growing reallocs: a theoretical? bad case for performance
> Taking advantage of the sparse address space is smart and as 64-bit > is now the norm, that space is even sparser. Fundamentally this is moving various forms of pressure to the kernel, which does not do the best job yet. The pivot code in mmap for new mappings isn't entirely bug-free so we've avoided it turning it on. The idea of that code is be random as neccessary -- creating "unknowable addresses", but in doing so avoid fragmenting the address space excessively. Excessive fragmentation in turn fragmentations allocation in multi-level page-tables, and that in turn results in excessive TLB pressure. Which is difficult to gauge since things keep working, but brings in a big performance cost. Basically we were brave to do very high amounts of randomization early on. At a cost. But our work to improve the cost isn't finished.
Re: shrinking and growing reallocs: a theoretical? bad case for performance
On Mon, 31 Aug 2020 08:39:56 +0200, Otto Moerbeek wrote: > 1. I do not put high pages of shrinking reallocs into to cache, but > directly unmap. > > 2. For small shrinking reallocs realloc become a no-op. Pro: no > syscalls at all, cons: the actual allocation is larger, so less > overflow detection. So I do not do this if guard pages are active or > the reduction is larger than the cache size. I think this strikes a good balance. It is a common pattern to preallocate an estimated amount and then use realloc() to shrink it once the actual data has been stored. If that buffer is later enlarged, you can get the kind of behavior your test program exhibits. Taking advantage of the sparse address space is smart and as 64-bit is now the norm, that space is even sparser. - todd
Re: shrinking and growing reallocs: a theoretical? bad case for performance
> If I am interpreting this correctly, realloc could be used to groom/shape > the heap such that future allocations are less random and more predictable? Traditionally the word "heap" refers to the zone of allocations in a sbrk allocator, meaning things are packed tightly together in a known space, and ordering of the objects inside that produces very low variability. I recommend against using the word heap, especially when today we are using large-address space systems. Additionally I think this phrasing forgets there are many many objects in play, not just the ones being realloc'd. Those objects disrupt the available space by being allocated and freed. Object allocation isn't entirely controlled by the (small) malloc cache. I guess the theory is that an attacker will succeed because a few realloc'd objects don't 'relocate' as much as expected. I don't believe this is likely. I think we have placed a reasonable number of hurdles at various levels with an eye on compute cost... we recognize if we reject standard compsci "caching strategies" too much, then perforance still stink.
Re: shrinking and growing reallocs: a theoretical? bad case for performance
On Mon, Aug 31, 2020 at 08:28:25AM -0400, David Higgs wrote: > On Mon, Aug 31, 2020 at 2:41 AM Otto Moerbeek wrote: > > > Hi, > > > > A question from Theo made me think about realloc and come up with a > > particular bad case for performance. I do not know if it happens in > > practice, but it was easy to create a test program to hit the case. > > > > We're talking allocation >= a page here. Smaller allocation follow > > different rules. > > > > If an allocation is grown by realloc, I first tries to extend the > > allocation by mapping pages next to the existing allocation. Since > > the location of pages is randomized, chanches are high that next to an > > allocation there are unmapped pages so the grow will work out. > > > > If realloc needs to shrink the allocation it puts the high pages no > > longer needed in the malloc cache. There they can be re-used by other > > allocations. But if that happens, next a grow of first allocation will > > fail: the pages are already mapped. So realloc needs to do a new > > allocation followed by a copy and a cleanup of the original. > > > > So this strategy of a shrinking realloc to of put unneeded pages into > > the cache can work against us, plus it has the consequence that use of > > realloc leads to allocations close to each other: no free guard pages. > > > > If I am interpreting this correctly, realloc could be used to groom/shape > the heap such that future allocations are less random and more predictable? > > --david In a way yes, but that's a consequence of caching pages: new allocations will come from the cache if possible. But with this diff there are less possibilities. Also note that malloc option S disables the cache. -Otto
Re: shrinking and growing reallocs: a theoretical? bad case for performance
On Mon, Aug 31, 2020 at 2:41 AM Otto Moerbeek wrote: > Hi, > > A question from Theo made me think about realloc and come up with a > particular bad case for performance. I do not know if it happens in > practice, but it was easy to create a test program to hit the case. > > We're talking allocation >= a page here. Smaller allocation follow > different rules. > > If an allocation is grown by realloc, I first tries to extend the > allocation by mapping pages next to the existing allocation. Since > the location of pages is randomized, chanches are high that next to an > allocation there are unmapped pages so the grow will work out. > > If realloc needs to shrink the allocation it puts the high pages no > longer needed in the malloc cache. There they can be re-used by other > allocations. But if that happens, next a grow of first allocation will > fail: the pages are already mapped. So realloc needs to do a new > allocation followed by a copy and a cleanup of the original. > > So this strategy of a shrinking realloc to of put unneeded pages into > the cache can work against us, plus it has the consequence that use of > realloc leads to allocations close to each other: no free guard pages. > If I am interpreting this correctly, realloc could be used to groom/shape the heap such that future allocations are less random and more predictable? --david
shrinking and growing reallocs: a theoretical? bad case for performance
Hi, A question from Theo made me think about realloc and come up with a particular bad case for performance. I do not know if it happens in practice, but it was easy to create a test program to hit the case. We're talking allocation >= a page here. Smaller allocation follow different rules. If an allocation is grown by realloc, I first tries to extend the allocation by mapping pages next to the existing allocation. Since the location of pages is randomized, chanches are high that next to an allocation there are unmapped pages so the grow will work out. If realloc needs to shrink the allocation it puts the high pages no longer needed in the malloc cache. There they can be re-used by other allocations. But if that happens, next a grow of first allocation will fail: the pages are already mapped. So realloc needs to do a new allocation followed by a copy and a cleanup of the original. So this strategy of a shrinking realloc to of put unneeded pages into the cache can work against us, plus it has the consequence that use of realloc leads to allocations close to each other: no free guard pages. The program below tests this scenario and runs awfully slow. The diff fixes this by applying two strategies. The first already makes a huge difference, but the second strategy will also reduce the total number of syscalls at the cost of some more memory use. 1. I do not put high pages of shrinking reallocs into to cache, but directly unmap. 2. For small shrinking reallocs realloc become a no-op. Pro: no syscalls at all, cons: the actual allocation is larger, so less overflow detection. So I do not do this if guard pages are active or the reduction is larger than the cache size. Some stats, First run is -current, second one is with (an earlier version of) the diff on an armv7 machine. Other systems also show huge differences. [otto@wand:19]$ time ./a.out 0m31.68s real 0m10.02s user 0m21.65s system [otto@wand:33]$ time ./a.out 0m00.16s real 0m00.12s user 0m00.03s system I do not see any diffference for builds. But I cna imagine real-life programs hitting the case. -Otto #include #include #include void *p; size_t psz; #define E(x) if ((x) == NULL) err(1, NULL) void f(void) { int i; void *s[64]; p = realloc(p, 1023*psz); E(p); for (i = 0; i < 64; i++) { s[i] = malloc(psz); E(s[i]); } p = realloc(p, 1024*psz); E(p); for (i = 0; i < 64; i++) free(s[i]); } int main() { int i; psz = getpagesize(); p = malloc(1024*psz); E(p); for (i = 0; i < 1000; i++) f(); } Index: malloc.c === RCS file: /cvs/src/lib/libc/stdlib/malloc.c,v retrieving revision 1.262 diff -u -p -r1.262 malloc.c --- malloc.c28 Jun 2019 13:32:42 - 1.262 +++ malloc.c31 Aug 2020 06:01:40 - @@ -728,28 +728,8 @@ unmap(struct dir_info *d, void *p, size_ wrterror(d, "malloc cache overflow"); } -static void -zapcacheregion(struct dir_info *d, void *p, size_t len) -{ - u_int i; - struct region_info *r; - size_t rsz; - - for (i = 0; i < d->malloc_cache; i++) { - r = >free_regions[i]; - if (r->p >= p && r->p <= (void *)((char *)p + len)) { - rsz = r->size << MALLOC_PAGESHIFT; - if (munmap(r->p, rsz)) - wrterror(d, "munmap %p", r->p); - r->p = NULL; - d->free_regions_size -= r->size; - STATS_SUB(d->malloc_used, rsz); - } - } -} - static void * -map(struct dir_info *d, void *hint, size_t sz, int zero_fill) +map(struct dir_info *d, size_t sz, int zero_fill) { size_t psz = sz >> MALLOC_PAGESHIFT; struct region_info *r, *big = NULL; @@ -762,7 +742,7 @@ map(struct dir_info *d, void *hint, size if (sz != PAGEROUND(sz)) wrterror(d, "map round"); - if (hint == NULL && psz > d->free_regions_size) { + if (psz > d->free_regions_size) { _MALLOC_LEAVE(d); p = MMAP(sz, d->mmap_flag); _MALLOC_ENTER(d); @@ -774,8 +754,6 @@ map(struct dir_info *d, void *hint, size for (i = 0; i < d->malloc_cache; i++) { r = >free_regions[(i + d->rotor) & (d->malloc_cache - 1)]; if (r->p != NULL) { - if (hint != NULL && r->p != hint) - continue; if (r->size == psz) { p = r->p; r->p = NULL; @@ -807,8 +785,6 @@ map(struct dir_info *d, void *hint, size memset(p, SOME_FREEJUNK, sz); return p; } - if (hint != NULL) - return