Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-09-02 Thread Otto Moerbeek
On Tue, Sep 01, 2020 at 11:56:36AM +0100, Stuart Henderson wrote:

> On 2020/08/31 08:39, Otto Moerbeek wrote:
> > A question from Theo made me think about realloc and come up with a
> > particular bad case for performance. I do not know if it happens in
> > practice, but it was easy to create a test program to hit the case.
> 
> Not very scientific testing (a single attempt at building one port), but
> this seems to help quite a lot when compiling programs written in rust.
> I encourage others to test the diff :-)
> 

It turned out this particular case was a fluke. But I'm still very
interested in cases where it does matter and tests in general as well.

-Otto



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-09-01 Thread Stuart Henderson
On 2020/08/31 08:39, Otto Moerbeek wrote:
> A question from Theo made me think about realloc and come up with a
> particular bad case for performance. I do not know if it happens in
> practice, but it was easy to create a test program to hit the case.

Not very scientific testing (a single attempt at building one port), but
this seems to help quite a lot when compiling programs written in rust.
I encourage others to test the diff :-)



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Otto Moerbeek
On Mon, Aug 31, 2020 at 11:25:51AM -0600, Theo de Raadt wrote:

> > Taking advantage of the sparse address space is smart and as 64-bit
> > is now the norm, that space is even sparser.
> 
> Fundamentally this is moving various forms of pressure to the kernel,
> which does not do the best job yet.

This effect is reduced by making small shrinks a no-op.

> 
> The pivot code in mmap for new mappings isn't entirely bug-free so we've
> avoided it turning it on.  The idea of that code is be random as
> neccessary -- creating "unknowable addresses", but in doing so avoid
> fragmenting the address space excessively.  Excessive fragmentation in turn
> fragmentations allocation in multi-level page-tables, and that in turn
> results in excessive TLB pressure.  Which is difficult to gauge since things
> keep working, but brings in a big performance cost.
> 
> Basically we were brave to do very high amounts of randomization early on.
> At a cost.  But our work to improve the cost isn't finished.



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Theo de Raadt
> Taking advantage of the sparse address space is smart and as 64-bit
> is now the norm, that space is even sparser.

Fundamentally this is moving various forms of pressure to the kernel,
which does not do the best job yet.

The pivot code in mmap for new mappings isn't entirely bug-free so we've
avoided it turning it on.  The idea of that code is be random as
neccessary -- creating "unknowable addresses", but in doing so avoid
fragmenting the address space excessively.  Excessive fragmentation in turn
fragmentations allocation in multi-level page-tables, and that in turn
results in excessive TLB pressure.  Which is difficult to gauge since things
keep working, but brings in a big performance cost.

Basically we were brave to do very high amounts of randomization early on.
At a cost.  But our work to improve the cost isn't finished.



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Todd C . Miller
On Mon, 31 Aug 2020 08:39:56 +0200, Otto Moerbeek wrote:

> 1. I do not put high pages of shrinking reallocs into to cache, but
> directly unmap.
>
> 2. For small shrinking reallocs realloc become a no-op. Pro: no
> syscalls at all, cons: the actual allocation is larger, so less
> overflow detection. So I do not do this if guard pages are active or
> the reduction is larger than the cache size.

I think this strikes a good balance.  It is a common pattern to
preallocate an estimated amount and then use realloc() to shrink
it once the actual data has been stored.  If that buffer is later
enlarged, you can get the kind of behavior your test program exhibits.

Taking advantage of the sparse address space is smart and as 64-bit
is now the norm, that space is even sparser.

 - todd



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Theo de Raadt
> If I am interpreting this correctly, realloc could be used to groom/shape
> the heap such that future allocations are less random and more predictable?

Traditionally the word "heap" refers to the zone of allocations in a
sbrk allocator, meaning things are packed tightly together in a known
space, and ordering of the objects inside that produces very low
variability.  I recommend against using the word heap, especially when
today we are using large-address space systems.

Additionally I think this phrasing forgets there are many many objects
in play, not just the ones being realloc'd.  Those objects disrupt the
available space by being allocated and freed.  Object allocation isn't
entirely controlled by the (small) malloc cache.

I guess the theory is that an attacker will succeed because a few realloc'd
objects don't 'relocate' as much as expected.

I don't believe this is likely.  I think we have placed a reasonable
number of hurdles at various levels with an eye on compute cost... we
recognize if we reject standard compsci "caching strategies" too much,
then perforance still stink.



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Otto Moerbeek
On Mon, Aug 31, 2020 at 08:28:25AM -0400, David Higgs wrote:

> On Mon, Aug 31, 2020 at 2:41 AM Otto Moerbeek  wrote:
> 
> > Hi,
> >
> > A question from Theo made me think about realloc and come up with a
> > particular bad case for performance. I do not know if it happens in
> > practice, but it was easy to create a test program to hit the case.
> >
> > We're talking allocation >= a page here. Smaller allocation follow
> > different rules.
> >
> > If an allocation is grown by realloc, I first tries to extend the
> > allocation by mapping pages next to the existing allocation. Since
> > the location of pages is randomized, chanches are high that next to an
> > allocation there are unmapped pages so the grow will work out.
> >
> > If realloc needs to shrink the allocation it puts the high pages no
> > longer needed in the malloc cache. There they can be re-used by other
> > allocations. But if that happens, next a grow of first allocation will
> > fail: the pages are already mapped. So realloc needs to do a new
> > allocation followed by a copy and a cleanup of the original.
> >
> > So this strategy of a shrinking realloc to of put unneeded pages into
> > the cache can work against us, plus it has the consequence that use of
> > realloc leads to allocations close to each other: no free guard pages.
> >
> 
> If I am interpreting this correctly, realloc could be used to groom/shape
> the heap such that future allocations are less random and more predictable?
> 
> --david

In a way yes, but that's a consequence of caching pages: new
allocations will come from the cache if possible. But with this diff
there are less possibilities. Also note that malloc option S disables
the cache. 

-Otto



Re: shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread David Higgs
On Mon, Aug 31, 2020 at 2:41 AM Otto Moerbeek  wrote:

> Hi,
>
> A question from Theo made me think about realloc and come up with a
> particular bad case for performance. I do not know if it happens in
> practice, but it was easy to create a test program to hit the case.
>
> We're talking allocation >= a page here. Smaller allocation follow
> different rules.
>
> If an allocation is grown by realloc, I first tries to extend the
> allocation by mapping pages next to the existing allocation. Since
> the location of pages is randomized, chanches are high that next to an
> allocation there are unmapped pages so the grow will work out.
>
> If realloc needs to shrink the allocation it puts the high pages no
> longer needed in the malloc cache. There they can be re-used by other
> allocations. But if that happens, next a grow of first allocation will
> fail: the pages are already mapped. So realloc needs to do a new
> allocation followed by a copy and a cleanup of the original.
>
> So this strategy of a shrinking realloc to of put unneeded pages into
> the cache can work against us, plus it has the consequence that use of
> realloc leads to allocations close to each other: no free guard pages.
>

If I am interpreting this correctly, realloc could be used to groom/shape
the heap such that future allocations are less random and more predictable?

--david


shrinking and growing reallocs: a theoretical? bad case for performance

2020-08-31 Thread Otto Moerbeek
Hi,

A question from Theo made me think about realloc and come up with a
particular bad case for performance. I do not know if it happens in
practice, but it was easy to create a test program to hit the case.

We're talking allocation >= a page here. Smaller allocation follow
different rules.

If an allocation is grown by realloc, I first tries to extend the
allocation by mapping pages next to the existing allocation. Since
the location of pages is randomized, chanches are high that next to an
allocation there are unmapped pages so the grow will work out.

If realloc needs to shrink the allocation it puts the high pages no
longer needed in the malloc cache. There they can be re-used by other
allocations. But if that happens, next a grow of first allocation will
fail: the pages are already mapped. So realloc needs to do a new
allocation followed by a copy and a cleanup of the original.

So this strategy of a shrinking realloc to of put unneeded pages into
the cache can work against us, plus it has the consequence that use of
realloc leads to allocations close to each other: no free guard pages.

The program below tests this scenario and runs awfully slow. The diff
fixes this by applying two strategies. The first already makes a huge
difference, but the second strategy will also reduce the total number
of syscalls at the cost of some more memory use.

1. I do not put high pages of shrinking reallocs into to cache, but
directly unmap.

2. For small shrinking reallocs realloc become a no-op. Pro: no
syscalls at all, cons: the actual allocation is larger, so less
overflow detection. So I do not do this if guard pages are active or
the reduction is larger than the cache size.

Some stats, First run is -current, second one is with (an earlier
version of) the diff on an armv7 machine. Other systems also show huge
differences.

[otto@wand:19]$ time ./a.out
0m31.68s real 0m10.02s user 0m21.65s system

[otto@wand:33]$ time ./a.out
0m00.16s real 0m00.12s user 0m00.03s system

I do not see any diffference for builds. But I cna imagine real-life
programs hitting the case.

-Otto



#include 
#include 
#include 

void *p;
size_t psz;
#define E(x) if ((x) == NULL) err(1, NULL)

void f(void)
{
int i;
void *s[64];

p = realloc(p, 1023*psz);
E(p);
for (i = 0; i < 64; i++) {
s[i] = malloc(psz);
E(s[i]);
}
p = realloc(p, 1024*psz);
E(p);
for (i = 0; i < 64; i++)
free(s[i]);

}

int main()
{
int i;

psz = getpagesize();
p = malloc(1024*psz);
E(p);
for (i = 0; i < 1000; i++)
f();
}


Index: malloc.c
===
RCS file: /cvs/src/lib/libc/stdlib/malloc.c,v
retrieving revision 1.262
diff -u -p -r1.262 malloc.c
--- malloc.c28 Jun 2019 13:32:42 -  1.262
+++ malloc.c31 Aug 2020 06:01:40 -
@@ -728,28 +728,8 @@ unmap(struct dir_info *d, void *p, size_
wrterror(d, "malloc cache overflow");
 }
 
-static void
-zapcacheregion(struct dir_info *d, void *p, size_t len)
-{
-   u_int i;
-   struct region_info *r;
-   size_t rsz;
-
-   for (i = 0; i < d->malloc_cache; i++) {
-   r = >free_regions[i];
-   if (r->p >= p && r->p <= (void *)((char *)p + len)) {
-   rsz = r->size << MALLOC_PAGESHIFT;
-   if (munmap(r->p, rsz))
-   wrterror(d, "munmap %p", r->p);
-   r->p = NULL;
-   d->free_regions_size -= r->size;
-   STATS_SUB(d->malloc_used, rsz);
-   }
-   }
-}
-
 static void *
-map(struct dir_info *d, void *hint, size_t sz, int zero_fill)
+map(struct dir_info *d, size_t sz, int zero_fill)
 {
size_t psz = sz >> MALLOC_PAGESHIFT;
struct region_info *r, *big = NULL;
@@ -762,7 +742,7 @@ map(struct dir_info *d, void *hint, size
if (sz != PAGEROUND(sz))
wrterror(d, "map round");
 
-   if (hint == NULL && psz > d->free_regions_size) {
+   if (psz > d->free_regions_size) {
_MALLOC_LEAVE(d);
p = MMAP(sz, d->mmap_flag);
_MALLOC_ENTER(d);
@@ -774,8 +754,6 @@ map(struct dir_info *d, void *hint, size
for (i = 0; i < d->malloc_cache; i++) {
r = >free_regions[(i + d->rotor) & (d->malloc_cache - 1)];
if (r->p != NULL) {
-   if (hint != NULL && r->p != hint)
-   continue;
if (r->size == psz) {
p = r->p;
r->p = NULL;
@@ -807,8 +785,6 @@ map(struct dir_info *d, void *hint, size
memset(p, SOME_FREEJUNK, sz);
return p;
}
-   if (hint != NULL)
-   return