Re: malloc cleanup and small optimization (step 2)

2017-12-30 Thread Daniel Micay
On 30 December 2017 at 06:44, Otto Moerbeek  wrote:
> On Sat, Dec 30, 2017 at 06:53:44AM +, kshe wrote:
>
>> Hi,
>>
>> Looking at this diff and the previous one, I found some more possible
>> cleanups for malloc.c (the patch below is to be applied after both of
>> them, even if the second one has not been committed yet):
>>
>> 1.  In malloc_bytes(), use ffs(3) instead of manual loops, which on many
>> architectures boils down to merely one or two instructions (for example,
>> see src/lib/libc/arch/amd64/string/ffs.S).
>
> I remember doing some measurements using ffs a long time ago, and it
> turned out that is was slower in some cases. Likely due to the
> function call overhead or maybe a non-optimal implementation. So I'm
> only willing to consider this after seeing benchmarks on a handfull of
> architectures.

There's __builtin_ffs but Clang / GCC might not be smart enough to use
it for ffs(3) automatically.



Re: malloc.c: better double free check

2017-09-24 Thread Daniel Micay
> In the end all double frees still will be caught by the actual free
> code, just with a delay. The delayed free buffer double free check is
> just a way of catching it as soon as possible to make debugging
> easier.  That's the reason the originla code could just do the check
> on the slot being replaced only.
> 
> The only case that could be missed is when the chunk is given out by
> malloc in between the original free and the double free. But that 
> case never be caught in all circumstances since the delay buffer is of
> finite size.
> 
>   -Otto

True, the delay buffer currently only guarantees allocations are kept
out of circulation for one cycle since the random choice is between
previously freed allocations, never the current one.

It matters more with the other change making half of the quarantine into
a ring buffer to provide a longer guaranteed delay. I think that makes
sense as a trade-off vs. an extra bit of entropy from a 2x larger random
array for a given total quarantine size. It also improves the write-
after-free detection, especially with a configurable quarantine size,
which makes it somewhat like the ASan quarantine but with delayed
detection of write-after-free and only indirect read-after-free
detection via junk filling (i.e. if something ends up crashing /
breaking from reading junk instead of what it expected).



Re: malloc.c: better double free check

2017-09-23 Thread Daniel Micay
On Sat, 2017-09-23 at 09:32 +0200, Otto Moerbeek wrote:
> On Fri, Sep 22, 2017 at 04:35:39PM -0400, Daniel Micay wrote:
> 
> > A linear search works well for the current small quarantine (16) but
> > won't work
> > well if you ever want to have a larger / configurable quarantine
> > size. It would
> > also be nice to make this fast enough to enable by default.
> > 
> > We (CopperheadOS) use an open addressed hash table for this based on
> > the
> > existing hash table since we use a larger quarantine with a FIFO
> > queue
> > alongside the random array and a configuration size. Ideally the
> > code would be
> > shared with the existing hash table but I didn't want to make it
> > into an
> > invasive change downstream.
> > 
> > These are the three downstream patches for OpenBSD malloc in our
> > copy of Bionic
> > (Android libc), so I'd need to port them to the current upstream
> > code to apply
> > cleanly. They're currently applied after other changes and it's a
> > slightly
> > older copy of the base code (after multi-pool support, but before
> > the canary
> > rework since we'll need to adapt that to our needs). Can get the
> > general idea
> > from the patches even though they're not going to apply cleanly
> > though.
> > 
> > [1] quarantine double-free detection via hash table
> 
> Thanks for sharing this, I'll take a look soon. 
> 
> Thinking a bit about this: wouldn't a closed hash table be sufficient?
> A collision would then either be a double free, otherwise just replace
> old with new. You'll get a O(1) lookup and insert and simpler code.

I wouldn't really want to have a random chance of missing a double-free
even if the chance is small though.

It's a copy of the hash table used to track regions with minor tweaks so
it'd be possible to make the existing code able to handle both but it
would mean turning it into macros. I'm not sure which is the lesser evil
for just 2 uses of a fairly simple data structure.



Re: malloc.c: better double free check

2017-09-22 Thread Daniel Micay
A linear search works well for the current small quarantine (16) but won't work
well if you ever want to have a larger / configurable quarantine size. It would
also be nice to make this fast enough to enable by default.

We (CopperheadOS) use an open addressed hash table for this based on the
existing hash table since we use a larger quarantine with a FIFO queue
alongside the random array and a configuration size. Ideally the code would be
shared with the existing hash table but I didn't want to make it into an
invasive change downstream.

These are the three downstream patches for OpenBSD malloc in our copy of Bionic
(Android libc), so I'd need to port them to the current upstream code to apply
cleanly. They're currently applied after other changes and it's a slightly
older copy of the base code (after multi-pool support, but before the canary
rework since we'll need to adapt that to our needs). Can get the general idea
from the patches even though they're not going to apply cleanly though.

[1] quarantine double-free detection via hash table

diff --git a/libc/bionic/omalloc.c b/libc/bionic/omalloc.c
index 23af59501..4f4e1290c 100644
--- a/libc/bionic/omalloc.c
+++ b/libc/bionic/omalloc.c
@@ -157,6 +157,7 @@ struct dir_info {
struct region_info free_regions[MALLOC_MAXCACHE];
/* delayed free chunk slots */
void *delayed_chunks[MALLOC_DELAYED_CHUNK_MASK + 1];
+   void *delayed_chunks_set[(MALLOC_DELAYED_CHUNK_MASK + 1) * 4];
size_t rbytesused;  /* random bytes used */
char *func; /* current function */
int mutex;
@@ -292,6 +293,22 @@ hash(void *p)
return sum;
 }
 
+static inline size_t
+hash_chunk(void *p)
+{
+   size_t sum;
+   uintptr_t u;
+
+   u = (uintptr_t)p >> MALLOC_MINSHIFT;
+   sum = u;
+   sum = (sum << 7) - sum + (u >> 16);
+#ifdef __LP64__
+   sum = (sum << 7) - sum + (u >> 32);
+   sum = (sum << 7) - sum + (u >> 48);
+#endif
+   return sum;
+}
+
 static inline
 struct dir_info *getpool(void)
 {
@@ -983,6 +1000,56 @@ delete(struct dir_info *d, struct region_info *ri)
}
 }
 
+void delayed_chunks_insert(struct dir_info *d, void *p)
+{
+   size_t index;
+   size_t mask = sizeof(d->delayed_chunks_set) / sizeof(void *) - 1;
+   void *q;
+
+   index = hash_chunk(p) & mask;
+   q = d->delayed_chunks_set[index];
+   while (q != NULL) {
+   if (p == q)
+   wrterror(d, "double free", p);
+   index = (index - 1) & mask;
+   q = d->delayed_chunks_set[index];
+   }
+   d->delayed_chunks_set[index] = p;
+}
+
+void delayed_chunks_delete(struct dir_info *d, void *p)
+{
+   size_t mask = sizeof(d->delayed_chunks_set) / sizeof(void *) - 1;
+   size_t i, j, r;
+   void *q;
+
+   i = hash_chunk(p) & mask;
+   q = d->delayed_chunks_set[i];
+   while (q != p) {
+   if (q == NULL)
+   wrterror(d, "pointer missing from address tracking 
table", p);
+   i = (i - 1) & mask;
+   q = d->delayed_chunks_set[i];
+   }
+
+   for (;;) {
+   d->delayed_chunks_set[i] = NULL;
+   j = i;
+   for (;;) {
+   i = (i - 1) & mask;
+   if (d->delayed_chunks_set[i] == NULL)
+   return;
+   r = hash_chunk(d->delayed_chunks_set[i]) & mask;
+   if ((i <= r && r < j) || (r < j && j < i) ||
+   (j < i && i <= r))
+   continue;
+   d->delayed_chunks_set[j] = d->delayed_chunks_set[i];
+   break;
+   }
+   }
+}
+
+
 /*
  * Allocate a page of chunks
  */
@@ -1474,11 +1541,21 @@ ofree(struct dir_info *argpool, void *p)
if (!mopts.malloc_freenow) {
if (find_chunknum(pool, r, p) == (uint32_t)-1)
goto done;
+
+   if (p == NULL)
+   goto done;
+
+   delayed_chunks_insert(pool, p);
+
i = getrbyte(pool) & MALLOC_DELAYED_CHUNK_MASK;
tmp = p;
p = pool->delayed_chunks[i];
-   if (tmp == p)
-   wrterror(pool, "double free", p);
+
+   if (p == NULL)
+   goto done;
+
+   delayed_chunks_delete(pool, p);
+
if (mopts.malloc_junk)
validate_junk(pool, p);
pool->delayed_chunks[i] = tmp;
@@ -2264,6 +2341,7 @@ malloc_dump(int fd, struct dir_info *pool)
r = find(pool, p);
if (r == NULL)
wrterror(pool, "bogus pointer in 

Re: random malloc junk

2016-09-14 Thread Daniel Micay
On Tue, 2016-09-13 at 13:27 +0200, Otto Moerbeek wrote:
> On Thu, Sep 08, 2016 at 06:42:33PM -0400, Daniel Micay wrote:
> 
> > A bit off-topic: 'J' enables junk-on-init which is for debugging,
> > but it
> > also currently has security improvements for large allocations.
> > There's
> > only partial junk-on-free by default (half a page), and 'U' disables
> > large allocation junk-on-free without 'J'. I think it would make
> > sense
> > to remove those optimizations since it's fine if the cost scales up
> > with
> > larger allocations and losing the guarantee of not leaking data via
> > uninitialized memory with 'U' is not great. Using 'U' is quite
> > expensive
> > regardless, and adds some pathological performance cases for small
> > size
> > allocations which is more important. I ended up removing both of
> > those
> > optimizations for the CopperheadOS port.
> 
> I would prefer to see a diff with this. For me, that should be easier
> to understand than you description.

This is the diff from the CopperheadOS port which won't apply directly
to malloc.c in OpenBSD, but should explain what I mean since it's just a
few lines. Just ignore the part where it removes malloc_junk=2, which is
because junk-on-init is split out (so this obsoleted the extra mode).

The current OpenBSD code only wipes up to MALLOC_MAXCHUNK with junk @ 1,
and it similarly doesn't wipe at all with 'U' (even though junk-on-free
also serves the purpose of preventing information leaks, not just
mitigating use-after-free). IMO, optimizing large allocation perf like
this isn't worthwhile.

diff --git a/libc/bionic/omalloc.c b/libc/bionic/omalloc.c
index e451d79..9277ee7 100644
--- a/libc/bionic/omalloc.c
+++ b/libc/bionic/omalloc.c
@@ -504,7 +504,7 @@ map(struct dir_info *d, void *hint, size_t sz, int 
zero_fill)
    madvise(p, sz, MADV_NORMAL);
    if (zero_fill)
    memset(p, 0, sz);
-   else if (mopts.malloc_junk == 2 &&
+   else if (mopts.malloc_junk &&
    mopts.malloc_freeunmap)
    memset(p, SOME_FREEJUNK, sz);
    return p;
@@ -524,7 +524,7 @@ map(struct dir_info *d, void *hint, size_t sz, int 
zero_fill)
    d->free_regions_size -= psz;
    if (zero_fill)
    memset(p, 0, sz);
-   else if (mopts.malloc_junk == 2 && mopts.malloc_freeunmap)
+   else if (mopts.malloc_junk && mopts.malloc_freeunmap)
    memset(p, SOME_FREEJUNK, sz);
    return p;
    }
@@ -603,7 +603,7 @@ omalloc_parseopt(char opt)
    mopts.malloc_junk = 0;
    break;
    case 'J':
-   mopts.malloc_junk = 2;
+   mopts.malloc_junk = 1;
    break;
    case 'i':
    mopts.malloc_junk_init = 0;
@@ -1517,8 +1517,7 @@ ofree(struct dir_info *pool, void *p)
    STATS_SUB(pool->malloc_guarded, mopts.malloc_guard);
    }
    if (mopts.malloc_junk && !mopts.malloc_freeunmap) {
-   size_t amt = mopts.malloc_junk == 1 ? MALLOC_MAXCHUNK :
-   PAGEROUND(sz) - mopts.malloc_guard;
+   size_t amt = PAGEROUND(sz) - mopts.malloc_guard;
    memset(p, SOME_FREEJUNK, amt);
    }
    unmap(pool, p, PAGEROUND(sz));



Re: random malloc junk

2016-09-08 Thread Daniel Micay
A nice security property of 0xdf filling is that a use-after-free of a
pointer is guaranteed to fault in a typical environment since it ends up
pointing outside userspace (I assume that's the case on OpenBSD). A heap
spray could potentially allow exploiting a random pointer. Perhaps it
would be better if only the byte range guaranteeing faults for pointers
was used? Less random, but strictly better than the current situation
rather than losing a nice guarantee.



Re: random malloc junk

2016-09-08 Thread Daniel Micay
> BTW, we should revisit canaries and work further on moving them
> closer to requested size. There's a chance this diff wil complicate
> that.

Can switch the canary code to memcpy/memcmp (to handle unaligned
canaries) and could then use the last byte as an index to the start of
the canary. For larger movement, it could have a special value (like
255) meaning that there's a 4 byte length at an 8 byte offset. If you
really wanted you could micro-optimize to lose only 1 canary bit, but
that seems too complicated to save 7 bits of the canary. It could also
sanity check the offsets based on the known minimum size of a chunk in
that size class.



Re: random malloc junk

2016-09-08 Thread Daniel Micay
On Wed, 2016-09-07 at 18:29 -0400, Ted Unangst wrote:
> Instead of always using a fixed byte pattern, I think malloc should
> use a
> random pattern. Now, this sometimes means it's harder to identify
> exactly
> what's used after free, so we should provide a means to get the old
> 0xdf
> pattern back.
> 
> Since we already have two junk modes, I thought I'd carry on along
> those
> lines. The default junk behavior, for free chunks only, is more of a
> security
> measure. I think this means we want random junk. The second level 'J'
> junk is
> more of a debugging tool, so that retains 0xdf.

A bit off-topic: 'J' enables junk-on-init which is for debugging, but it
also currently has security improvements for large allocations. There's
only partial junk-on-free by default (half a page), and 'U' disables
large allocation junk-on-free without 'J'. I think it would make sense
to remove those optimizations since it's fine if the cost scales up with
larger allocations and losing the guarantee of not leaking data via
uninitialized memory with 'U' is not great. Using 'U' is quite expensive
regardless, and adds some pathological performance cases for small size
allocations which is more important. I ended up removing both of those
optimizations for the CopperheadOS port.



malloc: add full delayed chunk double-free detection

2016-05-10 Thread Daniel Micay
This uses a hash table to maintain a set of delayed allocations, allowing full
and efficient double-free detection. The current code can only catch it when
the two pointers being swapped are equal, so double-frees that could be caught
are missed. A naive loop over every delayed chunk would work fine with the
current 16 slot array, but that would make scaling up the number of delayed
allocations much more costly.

I'm using this with 2 other changes making the size configurable and adding a
FIFO ring buffer in front of the randomized array of the same size, to get a
minimum guaranteed delay more than the current 1 free cycle. The randomization
still works fine and it ends up providing a nice balance. It's essentially
equivalent to the quarantine feature in Valgrind/ASan, but suitable for
production and with weaker use-after-free detection. It mixes well with the
earlier change to detect write-after-free via junk validation here.

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index bc328d2..9e8bd16 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -118,6 +118,7 @@ struct dir_info {
struct region_info free_regions[MALLOC_MAXCACHE];
/* delayed free chunk slots */
void *delayed_chunks[MALLOC_DELAYED_CHUNK_MASK + 1];
+   void *delayed_chunks_set[(MALLOC_DELAYED_CHUNK_MASK + 1) * 2];
size_t rbytesused;  /* random bytes used */
char *func; /* current function */
u_char rbytes[32];  /* random bytes */
@@ -249,6 +250,22 @@ hash(void *p)
return sum;
 }
 
+static inline size_t
+hash_chunk(void *p)
+{
+   size_t sum;
+   uintptr_t u;
+
+   u = (uintptr_t)p >> MALLOC_MINSHIFT;
+   sum = u;
+   sum = (sum << 7) - sum + (u >> 16);
+#ifdef __LP64__
+   sum = (sum << 7) - sum + (u >> 32);
+   sum = (sum << 7) - sum + (u >> 48);
+#endif
+   return sum;
+}
+
 static void
 wrterror(struct dir_info *d, char *msg, void *p)
 {
@@ -864,6 +881,58 @@ delete(struct dir_info *d, struct region_info *ri)
}
 }
 
+void delayed_chunks_insert(struct dir_info *d, void *p) {
+   size_t index;
+   size_t mask = sizeof(d->delayed_chunks_set) / sizeof(void *) - 1;
+   void *q;
+
+   index = hash_chunk(p) & mask;
+   q = d->delayed_chunks_set[index];
+   while (q != NULL) {
+   if (p == q) {
+   wrterror(d, "double free", p);
+   return;
+   }
+   index = (index - 1) & mask;
+   q = d->delayed_chunks_set[index];
+   }
+   d->delayed_chunks_set[index] = p;
+}
+
+void delayed_chunks_delete(struct dir_info *d, void *p) {
+   size_t mask = sizeof(d->delayed_chunks_set) / sizeof(void *) - 1;
+   size_t i, j, r;
+   void *q;
+
+   i = hash_chunk(p) & mask;
+   q = d->delayed_chunks_set[i];
+   while (q != p) {
+   if (q == NULL) {
+   wrterror(d, "pointer missing from address tracking 
table", p);
+   return;
+   }
+   i = (i - 1) & mask;
+   q = d->delayed_chunks_set[i];
+   }
+
+   for (;;) {
+   d->delayed_chunks_set[i] = NULL;
+   j = i;
+   for (;;) {
+   i = (i - 1) & mask;
+   if (d->delayed_chunks_set[i] == NULL)
+   return;
+   r = hash_chunk(d->delayed_chunks_set[i]) & mask;
+   if ((i <= r && r < j) || (r < j && j < i) ||
+   (j < i && i <= r))
+   continue;
+   d->delayed_chunks_set[j] = d->delayed_chunks_set[i];
+   break;
+   }
+   }
+}
+
+
 /*
  * Allocate a page of chunks
  */
@@ -1315,13 +1384,21 @@ ofree(struct dir_info *pool, void *p)
if (!mopts.malloc_freenow) {
if (find_chunknum(pool, r, p) == -1)
return;
+
+   if (p == NULL)
+   return;
+
+   delayed_chunks_insert(pool, p);
+
i = getrbyte(pool) & MALLOC_DELAYED_CHUNK_MASK;
tmp = p;
p = pool->delayed_chunks[i];
-   if (tmp == p) {
-   wrterror(pool, "double free", p);
+
+   if (p == NULL)
return;
-   }
+
+   delayed_chunks_delete(pool, p);
+
if (mopts.malloc_junk)
validate_junk(pool, p);
pool->delayed_chunks[i] = tmp;
@@ -1950,6 +2027,7 @@ malloc_dump(int fd)
wrterror(pool, "bogus pointer in malloc_dump", p);
continue;
}
+ 

Re: malloc: fix setting errno on out-of-memory

2016-05-10 Thread Daniel Micay
Sorry, my mail client was being stupid. Lets try again in good old mutt:

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index bc328d2..aa6f0a3 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -1224,8 +1224,9 @@ malloc(size_t size)
r = omalloc(d, size, 0, CALLER);
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1510,8 +1511,9 @@ realloc(void *ptr, size_t size)
 
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1563,8 +1565,9 @@ calloc(size_t nmemb, size_t size)
 
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1694,10 +1697,9 @@ posix_memalign(void **memptr, size_t alignment, size_t 
size)
d->active--;
_MALLOC_UNLOCK();
if (r == NULL) {
-   if (mopts.malloc_xmalloc) {
+   if (mopts.malloc_xmalloc)
wrterror(d, "out of memory", NULL);
-   errno = ENOMEM;
-   }
+   errno = ENOMEM;
goto err;
}
errno = saved_errno;



malloc: fix setting errno on out-of-memory

2016-05-10 Thread Daniel Micay
The ENOMEM errno wasn't being set in some code paths where it was gated
behind mopts.malloc_xmalloc:

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index bc328d2..aa6f0a3 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -1224,8 +1224,9 @@ malloc(size_t size)
r = omalloc(d, size, 0, CALLER);
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1510,8 +1511,9 @@ realloc(void *ptr, size_t size)
 
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1563,8 +1565,9 @@ calloc(size_t nmemb, size_t size)
 
d->active--;
_MALLOC_UNLOCK();
-   if (r == NULL && mopts.malloc_xmalloc) {
-   wrterror(d, "out of memory", NULL);
+   if (r == NULL) {
+   if (mopts.malloc_xmalloc)
+   wrterror(d, "out of memory", NULL);
errno = ENOMEM;
}
if (r != NULL)
@@ -1694,10 +1697,9 @@ posix_memalign(void **memptr, size_t alignment, size_t 
size)
d->active--;
_MALLOC_UNLOCK();
if (r == NULL) {
-   if (mopts.malloc_xmalloc) {
+   if (mopts.malloc_xmalloc)
wrterror(d, "out of memory", NULL);
-   errno = ENOMEM;
-   }
+   errno = ENOMEM;
goto err;
}
errno = saved_errno;
-- 
2.8.2



Re: multi-pool malloc wip diff

2016-03-28 Thread Daniel Micay
> - Curently fixed at 4 pools with a fixed thread -> pool mapping.
> - All pools are always initialized, even for single threaded programs,
> where
>   only one pool is used.
> - Especially realloc gets quite a bit uglier.
> - I'm pondering storing the thread -> pool mapping in the thread
>   struct instead of computing it each time from the tcb address.

If you wanted to get rid of the static mapping of threads to pools, it
would make sense to replace the per-pool region tables with a global, 2-
tier table. The first tier would be a table of mutexes paired with the
current style region tables, and it would pick the right one using a
secondary hash. It could be much larger than the number of pools to
avoid it being a bottleneck. This would eliminate the free / realloc
loops.

I don't think it would make sense right now, since it would add an extra
pair of lock / unlock calls to each malloc / free call in the common
case where allocations are freed by the allocating thread. If the thread
-> pool assignment was dynamic, it would be faster due to not needing
those loops. So if you wanted to do something like picking a random pool
each time, this would make a lot of sense. It would be increasingly
better with a larger number of pools.



Re: refine canaries

2015-12-09 Thread Daniel Micay
On Wed, 2015-12-09 at 07:47 -0500, Ted Unangst wrote:
> This is a kind of two steps forward, one step back diff.
> 
> I would like for the canary to be placed directly adjacent to the end
> of the
> user specified size. No slack. To accomplish this, we record the
> original size
> of the allocation at the end, then we can walk backwards to find the
> canary.
> Mechanically, this requires passing the real size down a little deeper
> in some
> places.

Makes sense. The goal was reliably catching small overflows but the size
class rounding proved to be a major problem (especially with the coarse
power of two size classes). And it doesn't seem like it would reduce the
potential for using this as a security feature.

> I also changed the canaries to be entirely random, with the correct
> value
> stored in each chunk. This is a security tradeoff, since now an
> attacker can
> overwrite the whole works predictably. But I consider this more of a
> debugging
> feature than a security feature, so I'd prefer to greatly improve the
> ability
> to catch one byte overwrites. A fair trade.

An alternative to this would be storing a random value in the metadata
for the chunk pages. That would avoid needing to spend more memory for
the inline random value and wouldn't have a negative security impact.

It wasn't really meant for protecting against sequential overflows since
it would usually detect them too late, but ideally it could still work
as a defence against them in some cases.

Not quite as good as a random value per canary, but a random value per
page would already be pretty fine-grained and each canary can still be
made unique via something like the current `random ^ hash(address)`.

There also might be a much better way of generating the canaries from
the random value. I just copied the existing pattern and added the hash
call to make it a little less bad. If performance wasn't a concern, it
could simply use SipHash but... it's way too slow. Hiding the seed from
people able to leak the canary value would only require encryption but
ideally it would actually be a proper MAC (that just seems unrealistic).



Re: malloc canaries and validation

2015-12-06 Thread Daniel Micay
It would be great to land this simple initial implementation and move
from there. I have ideas on how to make these features better but I'm
wary of doing a lot of work out-of-tree. If it lands in some form, that
would go a long way to encouraging further work on it. I basically just
don't want to end up maintaining a whole allocator implementation.

For detecting UAF, the ideal would be having a FIFO ring buffer placed
in front of the randomized delayed chunk array. The delay would then
depend on the amount of memory that's set aside for this. For example,
delaying up to 4M of small allocations (which is a lot, but totally
doable with modern hardware) would go a long way to catching most
writes after free via this junk validation feature. Could potentially
be configurable as the cache size is. It also becomes increasingly more
useful to check the currently delayed allocations at exit rather than
just when they get flushed out of the delayed array (and/or queue) if
there's actually a longer form of delay. It isn't all that useful right
now, but it would be a valuable component of this if it was fancier.

Right now it basically relies on luck to detect anything, unless the
allocation is immediately written to after a free and before the next
free. It's a place where a mix of determinism and non-determinism would
be best, i.e. a guaranteed amount of delay, but without losing the
randomization since that is useful for other reasons (randomized heap
layout).

The malloc canaries mostly just don't play well with the current power
of two size classes. They'd be very cheap if the size classes were
finer grained, i.e. accepting more external fragmentation but reducing
the internal fragmentation (wasted padding). There's a nice technique
used in jemalloc based on spacing classes rather than size classes but
it isn't appropriate in OpenBSD malloc because it would lose the nice
property of stuff being isolated based on size (and it would really be
best to *expand* that property more towards what PartitionAlloc does
rather than losing it).



Re: malloc canaries and validation

2015-12-06 Thread Daniel Micay
On Sun, 2015-12-06 at 17:10 -0700, Theo de Raadt wrote:
> Kept out of circulation?  It sounds like it would be incredibly
> expensive data-structure wise for the kernel to even attempt such a
> gaurantee..

I was just expecting it to be a FIFO ring buffer. It would have a limit
on the number of mappings it can delay along with a maximum upper bound
on the total mapping size (and it would flush the oldest mappings out
to stay under that threshold). It would increase virtual memory
fragmentation so it could be expensive.

I was thinking about it from a userspace perspective which is why I had
mentioned purging the delayed mappings with MADV_FREE. It wouldn't make
sense to do it in userspace though. Simply unmapping right away and
approaching it in the kernel is better.

> So what is the purpose of your proposal?  Is it to opportunistically
> find bugs?  Or to resist attack in UAF situations by "gauranteeing"
> the memory is unallocated at time of reuse?

Guaranteeing that the memory is not reused until a given amount of
allocator throughput happens. And reliable immediate detection of the
access since it will be guaranteed to fault.

> Maybe you are on a system with where mmap ASLR isn't as good?  The
> OpenBSD one isn't perfect, because we are trying to maintain a small
> amount of page-table locality (so that N-level page tables don't have
> to be walked fully from root for every access that misses in the
> TLB).

Yeah, the Linux kernel's mmap ASLR only randomizes the base. So I'm not
getting the same security properties as OpenBSD malloc has on OpenBSD.

I could implement fine-grained randomization in the kernel for 64-bit
(Android has no virtual memory to spare on 32-bit) but I thought a
delay might be useful even with that in place as it fully prevents
reuse for a given period of "time" (allocator throughput). It depends
on how aggressive the mmap randomization is and perhaps on the level of
control that an attacker can exert over memory allocation.



Re: malloc canaries and validation

2015-12-06 Thread Daniel Micay
It would also be interesting to try out a more aggressive form of
freeunmap for 64-bit where the allocations are purged with MADV_FREE
and then the virtual memory is kept out of circulation with a similar
FIFO queue approach. Could potentially do it by default when malloc
hints are enabled, so it wouldn't need a new option exposed (but it
would change the MADV_FREE option into something that enhances security
at the expense of more VM fragmentation rather than a performance vs.
memory trade-off so that may not make much sense after all).

It's the same issue as the junk validation feature where there's a need
for a reliable delay to get the most out of the feature. Randomization
does help, but it's not as good as knowing that virtual memory doesn't
go back into circulation until some configured amount of allocator
throughput has occurred.



Re: enhanced use-after-free detection for malloc v2

2015-11-02 Thread Daniel Micay
On 02/11/15 06:40 AM, Theo Buehler wrote:
> Sorry for this rather long mail:
> 
> I have three small comments on the patch itself
> (starting 80 lines below).
> 
> For those who want to try both new features, I attached a patch against
> -current that merges the three parts of Daniel's diff (plus the trivial
> two of the nits below) at the very end of this mail.
> 
> In addition to Daniel's test program, here's a small program that
> in 1000 test runs on amd64 always triggered an abort at the second
> free().
> 
> 
> $ cat heap.c
> #include 
> #include 
> #include 
> #include 
> 
> int
> main(void)
> {
>   char *p;
> 
>   p = malloc(1);
>   if (p == NULL)
>   err(1, "malloc");
> 
>   p = strcpy(p, "1234567");
>   printf("%s\n", p);
>   free(p);
> 
>   p = malloc(1);
>   p = strcpy(p, "12345678");
>   printf("%s\n", p);
>   free(p);
> 
>   return 0;
> }
> $ cc -o heap heap.c
> /tmp//ccHH1E1h.o: In function `main': heap.c:(.text+0x6b): warning: warning: 
> strcpy() is almost always
> misused, please use strlcpy()
> $ ./heap
> 1234567
> 12345678
> heap(5992) in free(): error: chunk canary corrupted 0x1f4619f0e9c0
> Abort trap (core dumped)
> $
> 
> 
> The performance impact of MALLOC_OPTIONS='CSV' is very noticeable,
> but 'CV' is negligible and 'CJV' makes the system feel a bit less
> snappy on my machine, but the actual effect seems almost negligible.
> 
> Here's a rough 'benchmark':
> 
> Building the base system on my i7 Thinkpad T420 with 4G RAM,
> with the various malloc options each with a freshly booted system.
> 
> defaults: 31m04
> CV:   33m16
> CJV:  33m46
> CSV:  53m32
> 
> Unfortunately, I have yet to hit an actual bug with this, one that isn't
> triggered by simple test programs.  Casual browsing with firefox, even
> watching movies with mplayer is possible with MALLOC_OPTIONS='CSV'.

It would help quite a bit to bump up the size of the quarantine. The
default 16 large one doesn't keep around the allocations for very long.

A more expensive version of the feature could also check for either zero
or junk data before allocating anything. Haven't decided how to expose
that via the option though.

Most bugs triggering in fairly normal situations with Firefox, etc. have
probably already been found by Valgrind though. The strength of features
like this in malloc is that they can enabled all the time with little
performance loss, so eventually they'll find new bugs (which has been
the case for us on Android, but perhaps that's mostly because many apps
were never tested at all with ASAN/Valgrind).

It may also be viable as a security feature in some situations, not just
a way of finding bugs. Depends on how far away the UAF is from the free
call since one other free is all that's needed to lose reliable
detection. It might work better with a FIFO ring buffer rather than the
current fully randomized array (perhaps a mix? dunno).

> On Sun, Nov 01, 2015 at 02:56:11PM -0500, Daniel Micay wrote:
>> @@ -560,6 +563,12 @@ omalloc_init(struct dir_info **dp)
>>  case 'J':
>>  mopts.malloc_junk = 2;
>>  break;
>> +case 'v':
>> +mopts.malloc_validate = 0;
>> +break;
>> +case 'V':
>> +mopts.malloc_validate = 1;
>> +break;
>>  case 'n':
>>  case 'N':
>>  break;
> 
> I'd keep the alphabetical order in the switch.
> 
>> @@ -608,6 +617,9 @@ omalloc_init(struct dir_info **dp)
>>  }
>>  }
>>  
>> +if (!mopts.malloc_junk)
>> +mopts.malloc_validate = 0;
>> +
> 
> Wouldn't it make sense to just let 'V' imply 'J' (and 'j' imply 'v') as
> is already done with 'F' and 'U', respectively?

The V option doesn't benefit from J since it's only able to check for
use-after-frees in small allocations. I don't think there's a good
reason to tie them together.

A side note is that it would be nice if junk freeing was split up into
the security feature (full junk fill on free) and the debugging feature
(full junk fill on init). I ended up doing that in CopperheadOS because
it makes J significantly cheaper without giving up the nice security
properties gained by sanitizing everything on free. Might be too late to
change the meaning in OpenBSD though.

>> @@ -1190,6 +1208,35 @@ malloc(size_t size)
>>  /*DEF_STRONG(malloc);*/
>

Re: enhanced use-after-free detection for malloc v2

2015-11-01 Thread Daniel Micay
(without mangling it this time...)

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..c408594 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -182,6 +182,7 @@ struct malloc_readonly {
int malloc_freeunmap;   /* mprotect free pages PROT_NONE? */
int malloc_hint;/* call madvice on free pages?  */
int malloc_junk;/* junk fill? */
+   int malloc_validate;/* validate junk */
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
@@ -218,6 +219,8 @@ static void malloc_exit(void);
 #define CALLER NULL
 #endif
 
+static void validate_delayed_chunks(void);
+
 /* low bits of r->p determine size: 0 means >= page size and p->size holding
  *  real size, otherwise r->size is a shift count, or 1 for malloc(0)
  */
@@ -560,6 +563,12 @@ omalloc_init(struct dir_info **dp)
case 'J':
mopts.malloc_junk = 2;
break;
+   case 'v':
+   mopts.malloc_validate = 0;
+   break;
+   case 'V':
+   mopts.malloc_validate = 1;
+   break;
case 'n':
case 'N':
break;
@@ -608,6 +617,9 @@ omalloc_init(struct dir_info **dp)
}
}
 
+   if (!mopts.malloc_junk)
+   mopts.malloc_validate = 0;
+
 #ifdef MALLOC_STATS
if (mopts.malloc_stats && (atexit(malloc_exit) == -1)) {
static const char q[] = "malloc() warning: atexit(2) failed."
@@ -616,6 +628,12 @@ omalloc_init(struct dir_info **dp)
}
 #endif /* MALLOC_STATS */
 
+   if (mopts.malloc_validate && (atexit(validate_delayed_chunks) == -1)) {
+   static const char q[] = "malloc() warning: atexit(2) failed."
+   " Will not be able to check for use after free\n";
+   write(STDERR_FILENO, q, sizeof(q) - 1);
+   }
+
while ((mopts.malloc_canary = arc4random()) == 0)
;
 
@@ -1190,6 +1208,35 @@ malloc(size_t size)
 /*DEF_STRONG(malloc);*/
 
 static void
+validate_junk(void *p) {
+   struct region_info *r;
+   struct dir_info *pool = getpool();
+   size_t byte, sz;
+   if (p == NULL)
+   return;
+   r = find(pool, p);
+   if (r == NULL)
+   wrterror("bogus pointer in validate_junk", p);
+   REALSIZE(sz, r);
+   for (byte = 0; byte < sz; byte++) {
+   if (((char *)p)[byte] != SOME_FREEJUNK) {
+   wrterror("use after free", p);
+   return;
+   }
+   }
+}
+
+static void
+validate_delayed_chunks(void) {
+   struct dir_info *pool = getpool();
+   int i;
+   if (pool == NULL)
+   return;
+   for (i = 0; i < MALLOC_DELAYED_CHUNK_MASK + 1; i++)
+   validate_junk(pool->delayed_chunks[i]);
+}
+
+static void
 ofree(void *p)
 {
struct dir_info *pool = getpool();
@@ -1253,6 +1300,8 @@ ofree(void *p)
wrterror("double free", p);
return;
}
+   if (mopts.malloc_validate)
+   validate_junk(p);
pool->delayed_chunks[i] = tmp;
}
if (p != NULL) {



Re: enhanced use-after-free detection for malloc v2

2015-10-30 Thread Daniel Micay
On 26/10/15 04:19 PM, Daniel Micay wrote:
> This is an improved revision of my earlier patch.
> 
> It now validates the junk data in the delayed_chunks array in an atexit 
> handler
> too, rather than just when allocations are swapped out.
> 
> It will now catch this simple UAF 100% of the time:
> 
> #include 
> #include 
> 
> int main(void) {
>   size_t i;
>   char *p;
>   for (i = 0; i < 32; i++) {
> p = malloc(16);
> if (!p) return 1;
>   }
> 
>   p = malloc(16);
>   if (!p) return 1;
>   free(p);
>   *p = 5;
> 
>   for (i = 0; i < 4; i++) {
> p = malloc(16);
> if (!p) return 1;
> free(p);
>   }
>   return 0;
> }
> 
> In general, it depends on the allocation still being in the delayed chunks
> array when the use-after-free happens. This means a larger delayed chunks
> array would improve the detection rate.

That revision was missing a NULL check as it got lost when refactoring
and the test cases weren't short enough to trigger it. Properly working
implementation:

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..c408594 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -182,6 +182,7 @@ struct malloc_readonly {
int malloc_freeunmap;   /* mprotect free pages PROT_NONE? */
int malloc_hint;/* call madvice on free pages?  */
int malloc_junk;/* junk fill? */
+   int malloc_validate;/* validate junk */
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
@@ -218,6 +219,8 @@ static void malloc_exit(void);
 #define CALLER NULL
 #endif

+static void validate_delayed_chunks(void);
+
 /* low bits of r->p determine size: 0 means >= page size and p->size
holding
  *  real size, otherwise r->size is a shift count, or 1 for malloc(0)
  */
@@ -560,6 +563,12 @@ omalloc_init(struct dir_info **dp)
case 'J':
mopts.malloc_junk = 2;
break;
+   case 'v':
+   mopts.malloc_validate = 0;
+   break;
+   case 'V':
+   mopts.malloc_validate = 1;
+   break;
case 'n':
case 'N':
break;
@@ -608,6 +617,9 @@ omalloc_init(struct dir_info **dp)
}
}

+   if (!mopts.malloc_junk)
+   mopts.malloc_validate = 0;
+
 #ifdef MALLOC_STATS
if (mopts.malloc_stats && (atexit(malloc_exit) == -1)) {
static const char q[] = "malloc() warning: atexit(2) failed."
@@ -616,6 +628,12 @@ omalloc_init(struct dir_info **dp)
}
 #endif /* MALLOC_STATS */

+   if (mopts.malloc_validate && (atexit(validate_delayed_chunks) == -1)) {
+   static const char q[] = "malloc() warning: atexit(2) failed."
+   " Will not be able to check for use after free\n";
+   write(STDERR_FILENO, q, sizeof(q) - 1);
+   }
+
while ((mopts.malloc_canary = arc4random()) == 0)
;

@@ -1190,6 +1208,35 @@ malloc(size_t size)
 /*DEF_STRONG(malloc);*/

 static void
+validate_junk(void *p) {
+   struct region_info *r;
+   struct dir_info *pool = getpool();
+   size_t byte, sz;
+   if (p == NULL)
+   return;
+   r = find(pool, p);
+   if (r == NULL)
+   wrterror("bogus pointer in validate_junk", p);
+   REALSIZE(sz, r);
+   for (byte = 0; byte < sz; byte++) {
+   if (((char *)p)[byte] != SOME_FREEJUNK) {
+   wrterror("use after free", p);
+   return;
+   }
+   }
+}
+
+static void
+validate_delayed_chunks(void) {
+   struct dir_info *pool = getpool();
+   int i;
+   if (pool == NULL)
+   return;
+   for (i = 0; i < MALLOC_DELAYED_CHUNK_MASK + 1; i++)
+   validate_junk(pool->delayed_chunks[i]);
+}
+
+static void
 ofree(void *p)
 {
struct dir_info *pool = getpool();
@@ -1253,6 +1300,8 @@ ofree(void *p)
wrterror("double free", p);
return;
}
+   if (mopts.malloc_validate)
+   validate_junk(p);
pool->delayed_chunks[i] = tmp;
}
if (p != NULL) {



enhanced use-after-free detection for malloc v2

2015-10-26 Thread Daniel Micay
This is an improved revision of my earlier patch.

It now validates the junk data in the delayed_chunks array in an atexit handler
too, rather than just when allocations are swapped out.

It will now catch this simple UAF 100% of the time:

#include 
#include 

int main(void) {
  size_t i;
  char *p;
  for (i = 0; i < 32; i++) {
p = malloc(16);
if (!p) return 1;
  }

  p = malloc(16);
  if (!p) return 1;
  free(p);
  *p = 5;

  for (i = 0; i < 4; i++) {
p = malloc(16);
if (!p) return 1;
free(p);
  }
  return 0;
}

In general, it depends on the allocation still being in the delayed chunks
array when the use-after-free happens. This means a larger delayed chunks
array would improve the detection rate.

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..4a635b6 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -182,6 +182,7 @@ struct malloc_readonly {
int malloc_freeunmap;   /* mprotect free pages PROT_NONE? */
int malloc_hint;/* call madvice on free pages?  */
int malloc_junk;/* junk fill? */
+   int malloc_validate;/* validate junk */
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
@@ -218,6 +219,8 @@ static void malloc_exit(void);
 #define CALLER NULL
 #endif
 
+static void validate_delayed_chunks(void);
+
 /* low bits of r->p determine size: 0 means >= page size and p->size holding
  *  real size, otherwise r->size is a shift count, or 1 for malloc(0)
  */
@@ -560,6 +563,12 @@ omalloc_init(struct dir_info **dp)
case 'J':
mopts.malloc_junk = 2;
break;
+   case 'v':
+   mopts.malloc_validate = 0;
+   break;
+   case 'V':
+   mopts.malloc_validate = 1;
+   break;
case 'n':
case 'N':
break;
@@ -608,6 +617,9 @@ omalloc_init(struct dir_info **dp)
}
}
 
+   if (!mopts.malloc_junk)
+   mopts.malloc_validate = 0;
+
 #ifdef MALLOC_STATS
if (mopts.malloc_stats && (atexit(malloc_exit) == -1)) {
static const char q[] = "malloc() warning: atexit(2) failed."
@@ -616,6 +628,12 @@ omalloc_init(struct dir_info **dp)
}
 #endif /* MALLOC_STATS */
 
+   if (mopts.malloc_validate && (atexit(validate_delayed_chunks) == -1)) {
+   static const char q[] = "malloc() warning: atexit(2) failed."
+   " Will not be able to check for use after free\n";
+   write(STDERR_FILENO, q, sizeof(q) - 1);
+   }
+
while ((mopts.malloc_canary = arc4random()) == 0)
;
 
@@ -1190,6 +1208,33 @@ malloc(size_t size)
 /*DEF_STRONG(malloc);*/
 
 static void
+validate_junk(void *p) {
+   struct region_info *r;
+   struct dir_info *pool = getpool();
+   size_t byte, sz;
+   r = find(pool, p);
+   if (r == NULL)
+   wrterror("bogus pointer in validate_junk", p);
+   REALSIZE(sz, r);
+   for (byte = 0; byte < sz; byte++) {
+   if (((char *)p)[byte] != SOME_FREEJUNK) {
+   wrterror("use after free", p);
+   return;
+   }
+   }
+}
+
+static void
+validate_delayed_chunks(void) {
+   struct dir_info *pool = getpool();
+   int i;
+   if (pool == NULL)
+   return;
+   for (i = 0; i < MALLOC_DELAYED_CHUNK_MASK + 1; i++)
+   validate_junk(pool->delayed_chunks[i]);
+}
+
+static void
 ofree(void *p)
 {
struct dir_info *pool = getpool();
@@ -1253,6 +1298,8 @@ ofree(void *p)
wrterror("double free", p);
return;
}
+   if (mopts.malloc_validate)
+   validate_junk(p);
pool->delayed_chunks[i] = tmp;
}
if (p != NULL) {



Re: support for malloc allocation canaries

2015-10-25 Thread Daniel Micay
On 25/10/15 06:20 PM, Ted Unangst wrote:
> Daniel Micay wrote:
>> This patch adds an opt-in malloc configuration option placing canaries after
>> small allocations to detect heap overflows on free(...). It's intended to be
>> used alongside guard pages for large allocations. Since it's essentially
>> adding extra padding to all small allocations, a small heap overflow will be
>> rendered harmless.
> 
> This is all very cool. I'd like to look more at it sometime soon. 

Probably worth looking at the use-after-free detection patch first since it's
simpler. And it's worth noting that the 2 features conflict with each other,
a small change is required to make them compatible:

diff --git a/libc/bionic/omalloc.c b/libc/bionic/omalloc.c
index 91ae559..dd39117 100644
--- a/libc/bionic/omalloc.c
+++ b/libc/bionic/omalloc.c
@@ -1371,6 +1371,8 @@ ofree(void *p)
size_t byte;
r = find(pool, p);
REALSIZE(sz, r);
+   if (sz > 0 && sz <= MALLOC_MAXCHUNK)
+   sz -= mopts.malloc_canaries;
for (byte = 0; byte < sz; byte++) {
if (((char *)p)[byte] != SOME_FREEJUNK) 
{
wrterror("use after free", p);



Re: enhanced use-after-free detection for malloc

2015-10-23 Thread Daniel Micay
Er, here it is without the screwed up whitespace (whoops):

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..7c33a7a 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -182,6 +182,7 @@ struct malloc_readonly {
int malloc_freeunmap;   /* mprotect free pages PROT_NONE? */
int malloc_hint;/* call madvice on free pages?  */
int malloc_junk;/* junk fill? */
+   int malloc_validate;/* validate junk */
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
@@ -560,6 +561,12 @@ omalloc_init(struct dir_info **dp)
case 'J':
mopts.malloc_junk = 2;
break;
+   case 'v':
+   mopts.malloc_validate = 0;
+   break;
+   case 'V':
+   mopts.malloc_validate = 1;
+   break;
case 'n':
case 'N':
break;
@@ -1253,6 +1260,17 @@ ofree(void *p)
wrterror("double free", p);
return;
}
+   if (mopts.malloc_junk && mopts.malloc_validate && p != 
NULL) {
+   size_t byte;
+   r = find(pool, p);
+   REALSIZE(sz, r);
+   for (byte = 0; byte < sz; byte++) {
+   if (((char *)p)[byte] != SOME_FREEJUNK) 
{
+   wrterror("use after free", p);
+   return;
+   }
+   }
+   }
pool->delayed_chunks[i] = tmp;
}
if (p != NULL) {
-- 
2.6.2



Re: support for malloc allocation canaries

2015-10-23 Thread Daniel Micay
On 23/10/15 07:22 AM, Janne Johansson wrote:
> With this patch on -current and
> ln -sf CJ /etc/malloc.conf
> a lot of things stopped working, like "man ls" and "tmux".
> With only C or only J, the system seems to work ok, but with both it
> doesn't.
> Will check the resulting core files.
> ntpd and tmux both bombed out on memset inside realloc if I read the
> dumps ok:

Thanks for testing it. I oversimplified handling the canary padding in
the orealloc malloc_junk == 2 branch while trying to clean up the code.

This should work properly:

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..567e56d 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -185,12 +185,14 @@ struct malloc_readonly {
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
+   size_t  malloc_canaries;/* use canaries after chunks? */
size_t  malloc_guard;   /* use guard pages after allocations? */
u_int   malloc_cache;   /* free pages we cache */
 #ifdef MALLOC_STATS
int malloc_stats;   /* dump statistics at end */
 #endif
u_int32_t malloc_canary;/* Matched against ones in malloc_pool 
*/
+   uintptr_t malloc_chunk_canary;
 };
 
 /* This object is mapped PROT_READ after initialisation to prevent tampering */
@@ -526,6 +528,12 @@ omalloc_init(struct dir_info **dp)
case 'A':
mopts.malloc_abort = 1;
break;
+   case 'c':
+   mopts.malloc_canaries = 0;
+   break;
+   case 'C':
+   mopts.malloc_canaries = sizeof(void *);
+   break;
 #ifdef MALLOC_STATS
case 'd':
mopts.malloc_stats = 0;
@@ -619,6 +627,9 @@ omalloc_init(struct dir_info **dp)
while ((mopts.malloc_canary = arc4random()) == 0)
;
 
+   arc4random_buf(_chunk_canary,
+   sizeof(mopts.malloc_chunk_canary));
+
/*
 * Allocate dir_info with a guard page on either side. Also
 * randomise offset inside the page at which the dir_info
@@ -984,8 +995,15 @@ malloc_bytes(struct dir_info *d, size_t size, void *f)
k += (lp - bp->bits) * MALLOC_BITS;
k <<= bp->shift;
 
+   if (mopts.malloc_canaries && bp->size > 0) {
+   char *end = (char *)bp->page + k + bp->size;
+   uintptr_t *canary = (uintptr_t *)(end - mopts.malloc_canaries);
+   *canary = mopts.malloc_chunk_canary ^ hash(canary);
+   }
+
if (mopts.malloc_junk == 2 && bp->size > 0)
-   memset((char *)bp->page + k, SOME_JUNK, bp->size);
+   memset((char *)bp->page + k, SOME_JUNK,
+   bp->size - mopts.malloc_canaries);
return ((char *)bp->page + k);
 }
 
@@ -999,6 +1017,13 @@ find_chunknum(struct dir_info *d, struct region_info *r, 
void *ptr)
if (info->canary != d->canary1)
wrterror("chunk info corrupted", NULL);
 
+   if (mopts.malloc_canaries && info->size > 0) {
+   char *end = (char *)ptr + info->size;
+   uintptr_t *canary = (uintptr_t *)(end - mopts.malloc_canaries);
+   if (*canary != (mopts.malloc_chunk_canary ^ hash(canary)))
+   wrterror("chunk canary corrupted", ptr);
+   }
+
/* Find the chunk number on the page */
chunknum = ((uintptr_t)ptr & MALLOC_PAGEMASK) >> info->shift;
 
@@ -1121,7 +1146,7 @@ omalloc(size_t sz, int zero_fill, void *f)
/* takes care of SOME_JUNK */
p = malloc_bytes(pool, sz, f);
if (zero_fill && p != NULL && sz > 0)
-   memset(p, 0, sz);
+   memset(p, 0, sz - mopts.malloc_canaries);
}
 
return p;
@@ -1176,6 +1201,8 @@ malloc(size_t size)
malloc_recurse();
return NULL;
}
+   if (size > 0 && size <= MALLOC_MAXCHUNK)
+   size += mopts.malloc_canaries;
r = omalloc(size, 0, CALLER);
malloc_active--;
_MALLOC_UNLOCK();
@@ -1242,7 +1269,7 @@ ofree(void *p)
int i;
 
if (mopts.malloc_junk && sz > 0)
-   memset(p, SOME_FREEJUNK, sz);
+   memset(p, SOME_FREEJUNK, sz - mopts.malloc_canaries);
if (!mopts.malloc_freenow) {
if (find_chunknum(pool, r, p) == -1)
return;
@@ -1386,16 +1413,25 @@ gotit:
}
}
if (newsz <= oldsz && newsz > oldsz / 2 && !mopts.malloc_realloc) {
-   if (mopts.malloc_junk == 2 && newsz > 0)
-   memset((char 

enhanced use-after-free detection for malloc

2015-10-23 Thread Daniel Micay
This patch adds a form of use-after-free detection based on validating that
the junk data is still in place when swapping out an allocation from the
delayed chunk cache. It will probably nearly double the cost of the junk free
feature that's enabled by default since it needs to do a whole extra pass over
the data, thus splitting it out into a separate option instead of always doing
it. It can catch UAF issues when Freeguard / Free unmap will not since those
can only kick in when a whole page is cleared out.

This could be extended to also check that the data is *either* zeroed or junk
when handing out a chunk as a new allocation. That would add an additional
pass over the data, so perhaps this should be given 3 levels like the junk
free feature (like this patch by default, disabled with v, also validating the
data when doing allocation with V).

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..7c33a7a 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -182,6 +182,7 @@ struct malloc_readonly {
int malloc_freeunmap;   /* mprotect free pages PROT_NONE? */
int malloc_hint;/* call madvice on free pages?  */
int malloc_junk;/* junk fill? */
+   int malloc_validate;/* validate junk */
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
@@ -560,6 +561,12 @@ omalloc_init(struct dir_info **dp)
case 'J':
mopts.malloc_junk = 2;
break;
+   case 'v':
+   mopts.malloc_validate = 0;
+   break;
+   case 'V':
+   mopts.malloc_validate = 1;
+   break;
case 'n':
case 'N':
break;
@@ -1253,6 +1260,17 @@ ofree(void *p)
wrterror("double free", p);
return;
}
+   if (mopts.malloc_junk && mopts.malloc_validate && p != 
NULL) {
+   size_t byte;
+   r = find(pool, p);
+   REALSIZE(sz, r);
+   for (byte = 0; byte < sz; byte++) {
+   if (((char *)p)[byte] != SOME_FREEJUNK) 
{
+   wrterror("use after free", p);
+   return;
+   }
+   }
+   }
pool->delayed_chunks[i] = tmp;
}
if (p != NULL) {



support for malloc allocation canaries

2015-10-22 Thread Daniel Micay
Hi,

This patch adds an opt-in malloc configuration option placing canaries after
small allocations to detect heap overflows on free(...). It's intended to be
used alongside guard pages for large allocations. Since it's essentially
adding extra padding to all small allocations, a small heap overflow will be
rendered harmless.

The current implementation uses pointer-size canaries, but it could be easily
extended to allow bumping up the size of the canaries by passing the option
multiple times. The entry points into malloc account for the canary size when
it's enabled and then it's generated on allocation and checked on free. Small
allocations without room for a canary are simply turned into large
allocations. Some care needs to be taken to avoid clobbering the canary in the
junk filling code and realloc copying.

The canary is placed at the very end of the memory allocations so there will
often be slack space in between the real allocation and the canary preventing
small overflows from being detected. It would be much better at detecting
corruption with finer-grained size classes. The extreme would be every
multiple of the alignment, but logarithmic growth would be more realistic (see
jemalloc's size classes). Finer-grained size classes would also reduce the
memory overhead caused by allocations being pushed into the next size class by
the canary.

The canaries are currently generated with canary_value ^ hash(canary_address).
It would be best to avoid involving addresses to avoid introducing address
leaks via read overflows where there were none before, but it's the easiest
way to get unique canaries and is a minor issue to improve down the road.

I implemented this feature after porting OpenBSD malloc to Android (in
CopperheadOS) and it has found a few bugs in the app ecosystem. Note that I've
only heavily tested it there, not on OpenBSD itself. I'm not sure if you want
this feature but it seemed worth submitting.

Hopefully you don't mind a patch generated with Git. :)

diff --git a/stdlib/malloc.c b/stdlib/malloc.c
index 424dd77..65b5027 100644
--- a/stdlib/malloc.c
+++ b/stdlib/malloc.c
@@ -185,12 +185,14 @@ struct malloc_readonly {
int malloc_move;/* move allocations to end of page? */
int malloc_realloc; /* always realloc? */
int malloc_xmalloc; /* xmalloc behaviour? */
+   size_t  malloc_canaries;/* use canaries after chunks? */
size_t  malloc_guard;   /* use guard pages after allocations? */
u_int   malloc_cache;   /* free pages we cache */
 #ifdef MALLOC_STATS
int malloc_stats;   /* dump statistics at end */
 #endif
u_int32_t malloc_canary;/* Matched against ones in malloc_pool 
*/
+   uintptr_t malloc_chunk_canary;
 };
 
 /* This object is mapped PROT_READ after initialisation to prevent tampering */
@@ -526,6 +528,12 @@ omalloc_init(struct dir_info **dp)
case 'A':
mopts.malloc_abort = 1;
break;
+   case 'c':
+   mopts.malloc_canaries = 0;
+   break;
+   case 'C':
+   mopts.malloc_canaries = sizeof(void *);
+   break;
 #ifdef MALLOC_STATS
case 'd':
mopts.malloc_stats = 0;
@@ -619,6 +627,9 @@ omalloc_init(struct dir_info **dp)
while ((mopts.malloc_canary = arc4random()) == 0)
;
 
+   arc4random_buf(_chunk_canary,
+   sizeof(mopts.malloc_chunk_canary));
+
/*
 * Allocate dir_info with a guard page on either side. Also
 * randomise offset inside the page at which the dir_info
@@ -984,8 +995,15 @@ malloc_bytes(struct dir_info *d, size_t size, void *f)
k += (lp - bp->bits) * MALLOC_BITS;
k <<= bp->shift;
 
+   if (mopts.malloc_canaries && bp->size > 0) {
+   char *end = (char *)bp->page + k + bp->size;
+   uintptr_t *canary = (uintptr_t *)(end - mopts.malloc_canaries);
+   *canary = mopts.malloc_chunk_canary ^ hash(canary);
+   }
+
if (mopts.malloc_junk == 2 && bp->size > 0)
-   memset((char *)bp->page + k, SOME_JUNK, bp->size);
+   memset((char *)bp->page + k, SOME_JUNK,
+   bp->size - mopts.malloc_canaries);
return ((char *)bp->page + k);
 }
 
@@ -999,6 +1017,13 @@ find_chunknum(struct dir_info *d, struct region_info *r, 
void *ptr)
if (info->canary != d->canary1)
wrterror("chunk info corrupted", NULL);
 
+   if (mopts.malloc_canaries && info->size > 0) {
+   char *end = (char *)ptr + info->size;
+   uintptr_t *canary = (uintptr_t *)(end - mopts.malloc_canaries);
+   if (*canary != (mopts.malloc_chunk_canary ^ hash(canary)))
+