Re: pool page colouring
On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. 4.3. Slab Coloring The slab allocator incorporates a simple coloring scheme that distributes buffers evenly throughout the cache, resulting in excellent cache utilization and bus balance. The concept is simple: each time a new slab is created, the buffer addresses start at a slightly different offset (color) from the slab base (which is always page-aligned). For example, for a cache of 200-byte objects with 8-byte alignment, the first slab’s buffers would be at addresses 0, 200, 400, ... relative to the slab base. The next slab’s buffers would be at offsets 8, 208, 408, ... and so on. The maximum slab color is determined by the amount of unused space in the slab. we run on enough different machines that i think we should consider this. so the question is if we do bring colouring back, how do we calculate it? arc4random? mask bits off ph_magic? atomic_inc something in the pool? read a counter from the pool? shift bits off the page address?
Re: pool page colouring
On 5 November 2014 01:12, Mike Belopuhov m...@belopuhov.com wrote: well, first of all, right now this is a rather theoretical gain. we need to test it to understand if it makes things easier. err. i meant to say go faster not easier.
Re: pool page colouring
On 5 November 2014 00:38, David Gwynne da...@gwynne.id.au wrote: On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. 4.3. Slab Coloring The slab allocator incorporates a simple coloring scheme that distributes buffers evenly throughout the cache, resulting in excellent cache utilization and bus balance. The concept is simple: each time a new slab is created, the buffer addresses start at a slightly different offset (color) from the slab base (which is always page-aligned). For example, for a cache of 200-byte objects with 8-byte alignment, the first slab’s buffers would be at addresses 0, 200, 400, ... relative to the slab base. The next slab’s buffers would be at offsets 8, 208, 408, ... and so on. The maximum slab color is determined by the amount of unused space in the slab. we run on enough different machines that i think we should consider this. well, first of all, right now this is a rather theoretical gain. we need to test it to understand if it makes things easier. to see cache statistics we can use performance counters, however current pctr code might be a bit out of date. so the question is if we do bring colouring back, how do we calculate it? arc4random? mask bits off ph_magic? atomic_inc something in the pool? read a counter from the pool? shift bits off the page address? the way i read it is that you have a per-pool running value pr_color that you increment by the item alignment or native cache line size modulo space available for every page you are getting from uvm. however i can see that it might entail a problem by locating a page header (or was it page boundary? don't have the code at hand) using simple math.
Re: pool page colouring
On 5 Nov 2014, at 10:12, Mike Belopuhov m...@belopuhov.com wrote: On 5 November 2014 00:38, David Gwynne da...@gwynne.id.au wrote: On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. 4.3. Slab Coloring The slab allocator incorporates a simple coloring scheme that distributes buffers evenly throughout the cache, resulting in excellent cache utilization and bus balance. The concept is simple: each time a new slab is created, the buffer addresses start at a slightly different offset (color) from the slab base (which is always page-aligned). For example, for a cache of 200-byte objects with 8-byte alignment, the first slab’s buffers would be at addresses 0, 200, 400, ... relative to the slab base. The next slab’s buffers would be at offsets 8, 208, 408, ... and so on. The maximum slab color is determined by the amount of unused space in the slab. we run on enough different machines that i think we should consider this. well, first of all, right now this is a rather theoretical gain. we need to test it to understand if it makes things easier. to see cache statistics we can use performance counters, however current pctr code might be a bit out of date. pctr is x86 specific though. how would you measure on all the other archs? so the question is if we do bring colouring back, how do we calculate it? arc4random? mask bits off ph_magic? atomic_inc something in the pool? read a counter from the pool? shift bits off the page address? the way i read it is that you have a per-pool running value pr_color that you increment by the item alignment or native cache line size modulo space available for every page you are getting from uvm. however i can see that it might entail a problem by locating a page header (or was it page boundary? don't have the code at hand) using simple math. the stuff that finds a page header for a page doesnt care about the address of individual items within a page, and colouring doesnt change an item being wholly contained within a page. ive run with arc4random_uniform coloured addresses for a couple of weeks now without problems of that nature.
Re: pool page colouring
On 5 Nov 2014, at 10:27, David Gwynne da...@gwynne.id.au wrote: On 5 Nov 2014, at 10:12, Mike Belopuhov m...@belopuhov.com wrote: On 5 November 2014 00:38, David Gwynne da...@gwynne.id.au wrote: On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. 4.3. Slab Coloring The slab allocator incorporates a simple coloring scheme that distributes buffers evenly throughout the cache, resulting in excellent cache utilization and bus balance. The concept is simple: each time a new slab is created, the buffer addresses start at a slightly different offset (color) from the slab base (which is always page-aligned). For example, for a cache of 200-byte objects with 8-byte alignment, the first slab’s buffers would be at addresses 0, 200, 400, ... relative to the slab base. The next slab’s buffers would be at offsets 8, 208, 408, ... and so on. The maximum slab color is determined by the amount of unused space in the slab. we run on enough different machines that i think we should consider this. well, first of all, right now this is a rather theoretical gain. we need to test it to understand if it makes things easier. to see cache statistics we can use performance counters, however current pctr code might be a bit out of date. pctr is x86 specific though. how would you measure on all the other archs? i would argue that page colouring was in the code before, so it should be now unless it can be proven useless. the cost of putting it back in terms of code is minimal, the only question has been how do we pick the colour without holding the pools mutex? so the question is if we do bring colouring back, how do we calculate it? arc4random? mask bits off ph_magic? atomic_inc something in the pool? read a counter from the pool? shift bits off the page address? the way i read it is that you have a per-pool running value pr_color that you increment by the item alignment or native cache line size modulo space available for every page you are getting from uvm. however i can see that it might entail a problem by locating a page header (or was it page boundary? don't have the code at hand) using simple math. the stuff that finds a page header for a page doesnt care about the address of individual items within a page, and colouring doesnt change an item being wholly contained within a page. ive run with arc4random_uniform coloured addresses for a couple of weeks now without problems of that nature.
Re: pool page colouring
On Wed, Nov 05, 2014 at 09:38, David Gwynne wrote: On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. except pool won't align a 300 byte inode on 512 byte boundaries.
Re: pool page colouring
On 5 Nov 2014, at 10:58, Ted Unangst t...@tedunangst.com wrote: On Wed, Nov 05, 2014 at 09:38, David Gwynne wrote: On 30 Oct 2014, at 07:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. looks like we were both ignorant and wrong. mikeb@ points out this from the original slab paper: 4.1. Impact of Buffer Address Distribution on Cache Utilization The address distribution of mid-size buffers can affect the system’s overall cache utilization. In par- ticular, power-of-two allocators - where all buffers are 2 n bytes and are 2 n -byte aligned - are pes- simal.* Suppose, for example, that every inode (∼ 300 bytes) is assigned a 512-byte buffer, 512-byte aligned, and that only the first dozen fields of an inode (48 bytes) are frequently referenced. Then the majority of inode-related memory traffic will be at addresses between 0 and 47 modulo 512. Thus the cache lines near 512-byte boundaries will be heavily loaded while the rest lie fallow. In effect only 9% (48/512) of the cache will be usable by inodes. Fully-associative caches would not suffer this problem, but current hardware trends are toward simpler rather than more complex caches. except pool won't align a 300 byte inode on 512 byte boundaries. im not sure thats the point of the example.
Re: pool page colouring
On Tue, Oct 28, 2014 at 16:49, David Gwynne wrote: when i shuffled the locking in pools around, page colouring was left behind. page colouring is where you offset items within a page if you have enough slack space. the previous implementation simply incremented the colour so each new page got the next offset. i didnt do this because the page and its items are now initted outside the lock, so maintaining that curcolour iterator wasnt as easy. this sidesteps the curcolor maintenance by just having each page randomly pick a colour when it's set up. tests? ok? So after all that we're back to this (but updated to apply since I broke it)? ok, why not? I was trying to save us the trouble, but maybe that was a bad idea. ok with me. Index: kern/subr_pool.c === RCS file: /cvs/src/sys/kern/subr_pool.c,v retrieving revision 1.163 diff -u -p -r1.163 subr_pool.c --- kern/subr_pool.c 13 Oct 2014 00:12:51 - 1.163 +++ kern/subr_pool.c 28 Oct 2014 03:05:50 - @@ -299,8 +299,7 @@ pool_init(struct pool *pp, size_t size, */ space = POOL_INPGHDR(pp) ? pp-pr_phoffset : pp-pr_pgsize; space -= pp-pr_itemsperpage * pp-pr_size; - pp-pr_maxcolor = (space / align) * align; - pp-pr_curcolor = 0; + pp-pr_maxcolors = (space / align) + 1; pp-pr_nget = 0; pp-pr_nfail = 0; @@ -750,6 +749,8 @@ pool_p_alloc(struct pool *pp, int flags) XSIMPLEQ_INIT(ph-ph_itemlist); ph-ph_page = addr; + ph-ph_colored = addr + + arc4random_uniform(pp-pr_maxcolors) * pp-pr_align; ph-ph_nmissing = 0; arc4random_buf(ph-ph_magic, sizeof(ph-ph_magic)); #ifdef DIAGNOSTIC @@ -760,6 +761,7 @@ pool_p_alloc(struct pool *pp, int flags) CLR(ph-ph_magic, POOL_MAGICBIT); #endif /* DIAGNOSTIC */ + addr = ph-ph_colored; n = pp-pr_itemsperpage; while (n--) { pi = (struct pool_item *)addr; @@ -996,8 +998,8 @@ pool_print_pagelist(struct pool_pagelist struct pool_item *pi; LIST_FOREACH(ph, pl, ph_pagelist) { - (*pr)(\t\tpage %p, nmissing %d\n, - ph-ph_page, ph-ph_nmissing); + (*pr)(\t\tpage %p, color %p, nmissing %d\n, + ph-ph_page, ph-ph_colored, ph-ph_nmissing); XSIMPLEQ_FOREACH(pi, ph-ph_itemlist, pi_list) { if (pi-pi_magic != POOL_IMAGIC(ph, pi)) { (*pr)(\t\t\titem %p, magic 0x%lx\n, @@ -1021,8 +1023,8 @@ pool_print1(struct pool *pp, const char modif++; } - (*pr)(POOL %s: size %u, align %u, roflags 0x%08x\n, - pp-pr_wchan, pp-pr_size, pp-pr_align, + (*pr)(POOL %s: size %u, align %u, maxcolors %u, roflags 0x%08x\n, + pp-pr_wchan, pp-pr_size, pp-pr_align, pp-pr_maxcolors, pp-pr_roflags); (*pr)(\talloc %p\n, pp-pr_alloc); (*pr)(\tminitems %u, minpages %u, maxpages %u, npages %u\n, Index: sys/pool.h === RCS file: /cvs/src/sys/sys/pool.h,v retrieving revision 1.53 diff -u -p -r1.53 pool.h --- sys/pool.h22 Sep 2014 01:04:58 - 1.53 +++ sys/pool.h28 Oct 2014 03:05:50 - @@ -128,8 +128,7 @@ struct pool { RB_HEAD(phtree, pool_item_header) pr_phtree; - int pr_maxcolor;/* Cache colouring */ - int pr_curcolor; + u_int pr_maxcolors; /* Cache colouring */ int pr_phoffset;/* Offset in page of page header */ /*
Re: pool page colouring
On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: if you want it to go fast, it would make more sense to set the item alignment in pool_init to the size of the cacheline. colouring would then become irrelevant from a speed perspective. There's some sense to this. Like round everything to nearest 64, except things less than 64 (round to 16 or 32).
Re: pool page colouring
On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. Index: kern/subr_pool.c === RCS file: /cvs/src/sys/kern/subr_pool.c,v retrieving revision 1.163 diff -u -p -r1.163 subr_pool.c --- kern/subr_pool.c13 Oct 2014 00:12:51 - 1.163 +++ kern/subr_pool.c29 Oct 2014 21:49:38 - @@ -82,7 +82,6 @@ struct pool_item_header { ph_node;/* Off-page page headers */ int ph_nmissing;/* # of chunks in use */ caddr_t ph_page;/* this page's address */ - caddr_t ph_colored; /* page's colored address */ u_long ph_magic; }; #define POOL_MAGICBIT (1 3) /* keep away from perturbed low bits */ @@ -217,7 +216,7 @@ void pool_init(struct pool *pp, size_t size, u_int align, u_int ioff, int flags, const char *wchan, struct pool_allocator *palloc) { - int off = 0, space; + int off = 0; unsigned int pgsize = PAGE_SIZE, items; #ifdef DIAGNOSTIC struct pool *iter; @@ -293,15 +292,6 @@ pool_init(struct pool *pp, size_t size, pp-pr_hardlimit_warning_last.tv_usec = 0; RB_INIT(pp-pr_phtree); - /* -* Use the space between the chunks and the page header -* for cache coloring. -*/ - space = POOL_INPGHDR(pp) ? pp-pr_phoffset : pp-pr_pgsize; - space -= pp-pr_itemsperpage * pp-pr_size; - pp-pr_maxcolor = (space / align) * align; - pp-pr_curcolor = 0; - pp-pr_nget = 0; pp-pr_nfail = 0; pp-pr_nput = 0; @@ -1232,7 +1222,7 @@ pool_walk(struct pool *pp, int full, int n; LIST_FOREACH(ph, pp-pr_fullpages, ph_pagelist) { - cp = ph-ph_colored; + cp = ph-ph_page; n = ph-ph_nmissing; while (n--) { @@ -1242,7 +1232,7 @@ pool_walk(struct pool *pp, int full, } LIST_FOREACH(ph, pp-pr_partpages, ph_pagelist) { - cp = ph-ph_colored; + cp = ph-ph_page; n = ph-ph_nmissing; do { Index: sys/pool.h === RCS file: /cvs/src/sys/sys/pool.h,v retrieving revision 1.53 diff -u -p -r1.53 pool.h --- sys/pool.h 22 Sep 2014 01:04:58 - 1.53 +++ sys/pool.h 29 Oct 2014 21:49:43 - @@ -128,8 +128,6 @@ struct pool { RB_HEAD(phtree, pool_item_header) pr_phtree; - int pr_maxcolor;/* Cache colouring */ - int pr_curcolor; int pr_phoffset;/* Offset in page of page header */ /*
Re: pool page colouring
On 29 October 2014 22:52, Ted Unangst t...@tedunangst.com wrote: On Wed, Oct 29, 2014 at 07:25, David Gwynne wrote: i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug. there. i fixed it. so is there any performance difference?
Re: pool page colouring
On Tue, Oct 28, 2014 at 16:49, David Gwynne wrote: when i shuffled the locking in pools around, page colouring was left behind. page colouring is where you offset items within a page if you have enough slack space. the previous implementation simply incremented the colour so each new page got the next offset. i didnt do this because the page and its items are now initted outside the lock, so maintaining that curcolour iterator wasnt as easy. this sidesteps the curcolor maintenance by just having each page randomly pick a colour when it's set up. Would it make more sense to use the page address to pick the color?
Re: pool page colouring
On 28 October 2014 17:02, Ted Unangst t...@tedunangst.com wrote: On Tue, Oct 28, 2014 at 16:49, David Gwynne wrote: when i shuffled the locking in pools around, page colouring was left behind. page colouring is where you offset items within a page if you have enough slack space. the previous implementation simply incremented the colour so each new page got the next offset. i didnt do this because the page and its items are now initted outside the lock, so maintaining that curcolour iterator wasnt as easy. this sidesteps the curcolor maintenance by just having each page randomly pick a colour when it's set up. Would it make more sense to use the page address to pick the color? Does it actually still make sense to keep page coloring? Is there still benefit on modern hardware?
Re: pool page colouring
On 29 Oct 2014, at 2:44 am, Mike Belopuhov m...@belopuhov.com wrote: On 28 October 2014 17:02, Ted Unangst t...@tedunangst.com wrote: On Tue, Oct 28, 2014 at 16:49, David Gwynne wrote: when i shuffled the locking in pools around, page colouring was left behind. page colouring is where you offset items within a page if you have enough slack space. the previous implementation simply incremented the colour so each new page got the next offset. i didnt do this because the page and its items are now initted outside the lock, so maintaining that curcolour iterator wasnt as easy. this sidesteps the curcolor maintenance by just having each page randomly pick a colour when it's set up. Would it make more sense to use the page address to pick the color? yeah. or we could derive it from a counter in the pool like the item or page get counters Does it actually still make sense to keep page coloring? Is there still benefit on modern hardware? if you want it to go fast, it would make more sense to set the item alignment in pool_init to the size of the cacheline. colouring would then become irrelevant from a speed perspective. however, if colouring is more about perturbing item addresses then it may still be worth it. eg, if you only fit one item on a page, without colouring your item addresses will always be on a page boundary. moving it around might flush out assumptions about low bits in addresses. i dunno. im fine with either removing colouring altogether or setting it from something else completely. i just want a decision to be made cos right now ph_color isnt set, which is a bug.