Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-29 Thread Indan Zupancic
On Tue, August 29, 2006 11:49, Peter Zijlstra said:
> On Tue, 2006-08-29 at 02:01 +0200, Indan Zupancic wrote:
>> Good that you're aware of it. Thing is, how much sense does the split-up into
>> adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why 
>> not
>> merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and 
>> only
>> add adjust_memalloc_reserve() when it's really needed? It saves an export.
>
> mm/ vs net/core/

I thought var_free_kbytes was exported too, but it isn't, so merging those two
won't save an export, scrap the idea.

If sk_adjust_memalloc() can be called with socks == 0 then a check for that 
needs
to be added, or else bugginess ensues (my fault ;-):

void sk_adjust_memalloc(int socks, int tx_reserve_pages)
{
unsigned long flags;
int nr_socks;

if (socks){
nr_socks = atomic_add_return(socks, &vmio_socks);
BUG_ON(nr_socks < 0);

if (nr_socks == socks)
tx_reserve_pages += RX_RESERVE_PAGES;
if (nr_socks == 0)
tx_reserve_pages -= RX_RESERVE_PAGES;
}
spin_lock_irqsave(&memalloc_lock, flags);

adjust_memalloc_reserve(tx_reserve_pages);

spin_unlock_irqrestore(&memalloc_lock, flags);
}


> Well, perhaps, its merit is that is allows for full service for a few
> sockets even under severe memory pressure. And it provides the
> primitives to solve this problem for all instances, (NBD, iSCSI, NFS,
> AoE, ...) hence framework.
>
> Evgeniy's allocator does not cater for this, so even if it were to
> replace all the allocation stuff, we would still need the SOCK_VMIO and
> all protocol hooks this patch introduces.

At least what I understood of it, the appealing thing was that it would make
the whole networking stack robust, as network operations will never fail.
So not only for VMIO sockets, but for all. Only thing that's missing then is
a way to guarantee that there's always forward progress. The ironic thing
is that to achieve that skbs/mem needs to be dropped/refused again, in the
case of blocking senders.

VMIO makes the choice what to refuse by making some skbs/sockets more important
than others and, indirectly, the kernel more important than userspace. A
different choice could be made. If one can be made which works for all, then
the VM deadlock problem is solved too.

Great if the VM deadlock problem is solved, but what about user space 
applications
that want to keep working under heavy memory pressure? Say, more or less all
decently written network servers. For those incoming data in general means more
work to do, and hence more memory to use and data to send. What they need is
receive throttling under memory pressure so that enough free memory is available
to handle the data that is received. A reliable network stack gives them nothing
if after receiving the data they can't do anything with it because the free
memory is up. So in that regard splitting the network allocator from the normal
allocator might only increase the problem.

(Swap only makes matters worse, as a program will slow down when it's used under
heavy memory pressure without knowing that there's any memory pressure. Failed
memory allocations are a good sign that there's memory pressure, but 
unfortunately
when that point is reached all performance is gone already and the application
can't really do anything useful, as it can't allocate nay memory. Something like
MAP_NONBLOCK, but which works with MAP_ANONYMOUS and let the mmap fail if swap
would be used may be helpful, because then the flow of new work can be throttled
earlier by the app. But coordination with malloc would be better.
Of course most can already be done by tuning the system and careful coding.
Sucks to be userspace I guess.)

>> - If Evgeniy's network allocator is as good as it looks, then why can't it
>> replace the existing one? Just adding private subsystem specific memory
>> allocators seems wrong. I might be missing the big picture, but it looks
>> like memory allocator things should be at least synchronized and discussed
>> with Christoph Lameter and his "modular slab allocator" patch.
>
> SLAB is very very good in that is will not suffer from external
> fragmentation (one could suffer from external fragmentation when viewing
> the slab allocator from the page allocation layer - but most of that is
> avoidable by allocation strategies in the slab layer), it does however
> suffer from internal fragmentation - by design.
>
> For variable size allocators it has been proven that for each allocator
> there is an allocation pattern that will defeat it. And figuring the
> pattern out and proving it will not happen in a long-running system is
> hard hard work.
>
> (free block coalescence is not a guarantee against fragmentation; there
> is even evidence that delayed coalescence will reduce fragmentation - it
> introduces history and this extra information can

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-29 Thread Peter Zijlstra
On Tue, 2006-08-29 at 02:01 +0200, Indan Zupancic wrote:
> On Mon, August 28, 2006 19:32, Peter Zijlstra said:

> > Ah, no accident there, I'm fully aware that there would need to be a
> > spinlock in adjust_memalloc_reserve() if there were another caller.
> > (I even had it there for some time) - added comment.
> 
> Good that you're aware of it. Thing is, how much sense does the split-up into
> adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
> merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
> add adjust_memalloc_reserve() when it's really needed? It saves an export.

mm/ vs net/core/

> Better to put the lock next to min_free_kbytes, both for readability and
> cache behaviour. And it satisfies the "lock data, not code" mantra.

True enough.

> If you prefer to avoid cmpxchg (which is often used in atomic_add_unless
> and can be expensive) then you can use something like:

Yes, way too large, out of lined it already. Don't care about the
cmpxchg, its not a fast path anyway.

> > @@ -195,6 +196,86 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
> >  /* Maximal space eaten by iovec or ancilliary data plus some space */
> >  int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
> >
> > +static DEFINE_SPINLOCK(memalloc_lock);
> > +static int memalloc_reserve;
> > +static unsigned int vmio_request_queues;
> > +
> > +atomic_t vmio_socks;
> > +atomic_t emergency_rx_pages_used;
> > +EXPORT_SYMBOL_GPL(vmio_socks);
> 
> Is this export needed? It's only used in net/core/skbuff.c and 
> net/core/sock.c,
> which are compiled into one module.
> 
> > +EXPORT_SYMBOL_GPL(emergency_rx_pages_used);
> 
> Same here. It's only used by code in sock.c and skbuff.c, and no external
> code calls emergency_rx_alloc(), nor emergency_rx_free().

Good point, I've gone over the link relations of these things and was
indeed capable of removing several EXPORTs. Thanks.

> I think I depleted my usefulness, there isn't much left to say for me.
> It's up to the big guys to decide about the merrit of this patch.

Thanks for all your feedback.

> IMHO:
> 
> - This patch isn't really a framework, more a minimal fix for one specific,
> though important problem. But it's small and doesn't have much impact

Well, perhaps, its merit is that is allows for full service for a few
sockets even under severe memory pressure. And it provides the
primitives to solve this problem for all instances, (NBD, iSCSI, NFS,
AoE, ...) hence framework.

Evgeniy's allocator does not cater for this, so even if it were to
replace all the allocation stuff, we would still need the SOCK_VMIO and
all protocol hooks this patch introduces.

> - If Evgeniy's network allocator is as good as it looks, then why can't it
> replace the existing one? Just adding private subsystem specific memory
> allocators seems wrong. I might be missing the big picture, but it looks
> like memory allocator things should be at least synchronized and discussed
> with Christoph Lameter and his "modular slab allocator" patch.

SLAB is very very good in that is will not suffer from external
fragmentation (one could suffer from external fragmentation when viewing
the slab allocator from the page allocation layer - but most of that is
avoidable by allocation strategies in the slab layer), it does however
suffer from internal fragmentation - by design.

For variable size allocators it has been proven that for each allocator
there is an allocation pattern that will defeat it. And figuring the
pattern out and proving it will not happen in a long-running system is
hard hard work.

(free block coalescence is not a guarantee against fragmentation; there
is even evidence that delayed coalescence will reduce fragmentation - it
introduces history and this extra information can help predict the
future.)

This is exactly why long running systems (like our kernel) love slabs.

For those interested in memory allocators, this paper is a good (albeit
a bit dated) introduction:
http://citeseer.ist.psu.edu/wilson95dynamic.html

That said, it might be that Evgeniy's allocator works out for our
network load - only time will tell, the math is not tractable afaik.

> All in all it seems it will take a while until Evgeniy's code will be merged,
> so I think applying Peter's patch soonish and removing it again the moment it
> becomes unnecessary is reasonable.

Thanks and like said, I think even then most of this patch will need to
survive.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Indan Zupancic
On Mon, August 28, 2006 19:32, Peter Zijlstra said:
> Also, I'm really past caring what the thing
> is called ;-) But if ppl object I guess its easy enough to run yet
> another sed command over the patches.

True, same here.

>> >> You can get rid of the memalloc_reserve and vmio_request_queues variables
>> >> if you want, they aren't really needed for anything. If using them reduces
>> >> the total code size I'd keep them though.
>> >
>> > I find my version easier to read, but that might just be the way my
>> > brain works.
>>
>> Maybe true, but I believe my version is more natural in the sense that it 
>> makes
>> more clear what the code is doing. Less bookkeeping, more real work, so to 
>> speak.
>
> Ok, I'll have another look at it, perhaps my gray matter has shifted ;-)

I don't care either way, just providing an alternative. I'd compile both and see
which one is smaller.


> Ah, no accident there, I'm fully aware that there would need to be a
> spinlock in adjust_memalloc_reserve() if there were another caller.
> (I even had it there for some time) - added comment.

Good that you're aware of it. Thing is, how much sense does the split-up into
adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
add adjust_memalloc_reserve() when it's really needed? It saves an export.

Feedback on the 28-Aug-2006 19:24 version from
programming.kicks-ass.net/kernel-patches/vm_deadlock/current/


> +void setup_per_zone_pages_min(void)
> +{
> + static DEFINE_SPINLOCK(lock);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&lock, flags);
> + __setup_per_zone_pages_min();
> + spin_unlock_irqrestore(&lock, flags);
> +}

Better to put the lock next to min_free_kbytes, both for readability and
cache behaviour. And it satisfies the "lock data, not code" mantra.


> +static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
> +{
> + void * page = NULL;
> +
> + if (size > PAGE_SIZE)
> + return page;
> +
> + if (atomic_add_unless(&emergency_rx_pages_used, 1, RX_RESERVE_PAGES)) {
> + page = (void *)__get_free_page(gfp_mask);
> + if (!page) {
> + WARN_ON(1);
> + atomic_dec(&emergency_rx_pages_used);
> + }
> + }
> +
> + return page;
> +}

If you prefer to avoid cmpxchg (which is often used in atomic_add_unless
and can be expensive) then you can use something like:

static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
{
void * page;

if (size > PAGE_SIZE)
return NULL;

if (atomic_inc_return(&emergency_rx_pages_used) == RX_RESERVE_PAGES)
goto out;
page = (void *)__get_free_page(gfp_mask);
if (page)
return page;
WARN_ON(1);
out:
atomic_dec(&emergency_rx_pages_used);
return NULL;
}

The tiny race should be totally harmless. Both versions are a bit big
to inline though.


> @@ -195,6 +196,86 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
>  /* Maximal space eaten by iovec or ancilliary data plus some space */
>  int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_reserve;
> +static unsigned int vmio_request_queues;
> +
> +atomic_t vmio_socks;
> +atomic_t emergency_rx_pages_used;
> +EXPORT_SYMBOL_GPL(vmio_socks);

Is this export needed? It's only used in net/core/skbuff.c and net/core/sock.c,
which are compiled into one module.

> +EXPORT_SYMBOL_GPL(emergency_rx_pages_used);

Same here. It's only used by code in sock.c and skbuff.c, and no external
code calls emergency_rx_alloc(), nor emergency_rx_free().

--

I think I depleted my usefulness, there isn't much left to say for me.
It's up to the big guys to decide about the merrit of this patch.
If Evgeniy's network allocator fixes all deadlocks and also has other
advantages, then great.

IMHO:

- This patch isn't really a framework, more a minimal fix for one specific,
though important problem. But it's small and doesn't have much impact
(numbers would be nice, e.g. vmlinux/modules size before and after, and
some network benchmark results).

- If Evgeniy's network allocator is as good as it looks, then why can't it
replace the existing one? Just adding private subsystem specific memory
allocators seems wrong. I might be missing the big picture, but it looks
like memory allocator things should be at least synchronized and discussed
with Christoph Lameter and his "modular slab allocator" patch.

All in all it seems it will take a while until Evgeniy's code will be merged,
so I think applying Peter's patch soonish and removing it again the moment it
becomes unnecessary is reasonable.

Greetings,

Indan


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Peter Zijlstra
On Mon, 2006-08-28 at 18:03 +0200, Indan Zupancic wrote:
> On Mon, August 28, 2006 12:22, Peter Zijlstra said:

> >> > @@ -391,6 +391,7 @@ enum sock_flags {
> >> >  SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
> >> >  SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
> >> >  SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> >> > +SOCK_VMIO, /* promise to never block on receive */
> >>
> >> It might be used for IO related to the VM, but that doesn't tell _what_ it 
> >> does.
> >> It also does much more than just not blocking on receive, so overal, aren't
> >> both the vmio name and the comment slightly misleading?
> >
> > I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
> > while, but that is a very bad name because nonblocking has this well
> > defined meaning when talking about sockets, and this is not that.
> >
> > Hence I came up with the VMIO, because that is the only selecting
> > criteria for being special. - I'll fix up the comment.
> 
> It's nice and short, but it might be weird if someone after a while finds 
> another way
> of using this stuff. And it's relation to 'emergency' looks unclear. So maybe 
> calling
> both the same makes most sense, no matter how you name it.

I've tried to come up with another use-case, but failed (of course that
doesn't mean there is no). Also, I'm really past caring what the thing
is called ;-) But if ppl object I guess its easy enough to run yet
another sed command over the patches.

> >> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
> >> >
> >> >  static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", 
> >> > "HighMem" };
> >> >  int min_free_kbytes = 1024;
> >> > +int var_free_kbytes;
> >>
> >> Using var_free_pages makes the code slightly simpler, as all that needless
> >> convertion isn't needed anymore. Perhaps the same is true for 
> >> min_free_kbytes...
> >
> > 't seems I'm a bit puzzled as to what you mean here.
> 
> I mean to store the variable reserve in pages instead of kilobytes. Currently 
> you're
> converting from the one to the other both when setting and when using the 
> value. That
> doesn't make much sense and can be avoided by storing the value in pages from 
> the start.

right, will have a peek.

> void kfree_skbmem(struct sk_buff *skb)
> {
>   struct sk_buff *other;
>   atomic_t *fclone_ref;
>   struct kmem_cache *cache = skbuff_head_cache;
>   struct sk_buff *free = skb;
> 
>   skb_release_data(skb);
>   switch (skb->fclone) {
>   case SKB_FCLONE_UNAVAILABLE:
>   goto free;
> 
>   case SKB_FCLONE_ORIG:
>   fclone_ref = (atomic_t *) (skb + 2);
>   if (atomic_dec_and_test(fclone_ref)){
>   cache = skbuff_fclone_cache;
>   goto free;
>   }
>   break;
> 
>   case SKB_FCLONE_CLONE:
>   fclone_ref = (atomic_t *) (skb + 1);
>   other = skb - 1;
> 
>   /* The clone portion is available for
>* fast-cloning again.
>*/
>   skb->fclone = SKB_FCLONE_UNAVAILABLE;
> 
>   if (atomic_dec_and_test(fclone_ref)){
>   cache = skbuff_fclone_cache;
>   free = other;
>   goto free;
>   }
>   break;
>   };
>   return;
> free:
>   if (!skb->emergency)
>   kmem_cache_free(cache, free);
>   else
>   emergency_rx_free(free, kmem_cache_size(cache));
> }

Ah, like so, sure, that looks good.

> >> You can get rid of the memalloc_reserve and vmio_request_queues variables
> >> if you want, they aren't really needed for anything. If using them reduces
> >> the total code size I'd keep them though.
> >
> > I find my version easier to read, but that might just be the way my
> > brain works.
> 
> Maybe true, but I believe my version is more natural in the sense that it 
> makes
> more clear what the code is doing. Less bookkeeping, more real work, so to 
> speak.

Ok, I'll have another look at it, perhaps my gray matter has shifted ;-)

> But after another look things seem a bit shaky, in the locking corner anyway.
> 
> sk_adjust_memalloc() calls adjust_memalloc_reserve(), which changes 
> var_free_kbytes
> and then calls setup_per_zone_pages_min(), which does the real work. But it 
> reads
> min_free_kbytes without holding any locks. In mainline that's fine as the 
> function
> is only called by the proc handler and in obscure memory hotplug stuff. But 
> with
> your code it can also be called at any moment when a VMIO socket is made, 
> which now
> races with the proc callback. More a theoretical than a real problem, but 
> still
> slightly messy.

Knew about that, hadn't made up my mind on a fix yet. Good spot never
the less. Time to actually fix it I guess.

> adjust_memalloc_reserve() has no locking at all, while it might be called 
> 

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Indan Zupancic
On Mon, August 28, 2006 12:22, Peter Zijlstra said:
> On Sat, 2006-08-26 at 04:37 +0200, Indan Zupancic wrote:
>> Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)
>
> hehe, me lazy, you gentoo ;-)
> sed -i -e 's/emerg/emregency/g' -e 's/EMERG/EMERGENCY/g' *.patch

I used it for a while, long ago, until I figured out that there were better
alternatives. I didn't like the overly complex init and portage system though.

But if you say "emerg" it will sound as "emerge", and all other fields in that
struct aren't abbreviated either and often longer, so it just makes more sense
to use the full name.


>> > @@ -391,6 +391,7 @@ enum sock_flags {
>> >SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
>> >SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
>> >SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
>> > +  SOCK_VMIO, /* promise to never block on receive */
>>
>> It might be used for IO related to the VM, but that doesn't tell _what_ it 
>> does.
>> It also does much more than just not blocking on receive, so overal, aren't
>> both the vmio name and the comment slightly misleading?
>
> I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
> while, but that is a very bad name because nonblocking has this well
> defined meaning when talking about sockets, and this is not that.
>
> Hence I came up with the VMIO, because that is the only selecting
> criteria for being special. - I'll fix up the comment.

It's nice and short, but it might be weird if someone after a while finds 
another way
of using this stuff. And it's relation to 'emergency' looks unclear. So maybe 
calling
both the same makes most sense, no matter how you name it.


>> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
>> >
>> >  static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", 
>> > "HighMem" };
>> >  int min_free_kbytes = 1024;
>> > +int var_free_kbytes;
>>
>> Using var_free_pages makes the code slightly simpler, as all that needless
>> convertion isn't needed anymore. Perhaps the same is true for 
>> min_free_kbytes...
>
> 't seems I'm a bit puzzled as to what you mean here.

I mean to store the variable reserve in pages instead of kilobytes. Currently 
you're
converting from the one to the other both when setting and when using the 
value. That
doesn't make much sense and can be avoided by storing the value in pages from 
the start.


>> > +noskb:
>> > +  /* Attempt emergency allocation when RX skb. */
>> > +  if (!(flags & SKB_ALLOC_RX))
>> > +  goto out;
>>
>> So only incoming skb allocation is guaranteed? What about outgoing skbs?
>> What am I missing? Or can we sleep then, and increasing var_free_kbytes is
>> sufficient to guarantee it?
>
> ->sk_allocation |= __GFP_EMERGENCY - will take care of the outgoing
> packets. Also, since one only sends a limited number of packets out and
> then will wait for answers, we do not need to worry about fragmentation
> issues that much in this case.

Ah, missed that one. Didn't knew that the alloc flags were stored in the sock.


>> > +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
>> > +{
>> > +  free_page((unsigned long)objp);
>> > +  emerg_rx_pages_dec();
>> > +}
>> > +
>> >  /*
>> >   *Free an skbuff by memory without cleaning the state.
>> >   */
>> > @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
>> >  {
>> >struct sk_buff *other;
>> >atomic_t *fclone_ref;
>> > +  void (*free_skb)(struct kmem_cache *, void *);
>> >
>> >skb_release_data(skb);
>> > +
>> > +  free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
>> > +
>> >switch (skb->fclone) {
>> >case SKB_FCLONE_UNAVAILABLE:
>> > -  kmem_cache_free(skbuff_head_cache, skb);
>> > +  free_skb(skbuff_head_cache, skb);
>> >break;
>> >
>> >case SKB_FCLONE_ORIG:
>> >fclone_ref = (atomic_t *) (skb + 2);
>> >if (atomic_dec_and_test(fclone_ref))
>> > -  kmem_cache_free(skbuff_fclone_cache, skb);
>> > +  free_skb(skbuff_fclone_cache, skb);
>> >break;
>> >
>> >case SKB_FCLONE_CLONE:
>> > @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
>> >skb->fclone = SKB_FCLONE_UNAVAILABLE;
>> >
>> >if (atomic_dec_and_test(fclone_ref))
>> > -  kmem_cache_free(skbuff_fclone_cache, other);
>> > +  free_skb(skbuff_fclone_cache, other);
>> >break;
>> >};
>> >  }
>>
>> I don't have the original code in front of me, but isn't it possible to
>> add a "goto free" which has all the freeing in one place? That would get
>> rid of the function pointer stuff and emerg_free_skb.
>
> perhaps, yes, however I prefer this one, it allows access to the size.

What size are you talking about? What I had in mind is probably less readable,
but it avoids a bunch of function calls and that indirect function call, so
with luck it has less overhead and smaller object size

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Peter Zijlstra
On Sat, 2006-08-26 at 04:37 +0200, Indan Zupancic wrote:
> On Fri, August 25, 2006 17:39, Peter Zijlstra said:
> > @@ -282,7 +282,8 @@ struct sk_buff {
> > nfctinfo:3;
> > __u8pkt_type:3,
> > fclone:2,
> > -   ipvs_property:1;
> > +   ipvs_property:1,
> > +   emerg:1;
> > __be16  protocol;
> 
> Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)

hehe, me lazy, you gentoo ;-)
sed -i -e 's/emerg/emregency/g' -e 's/EMERG/EMERGENCY/g' *.patch

> > @@ -391,6 +391,7 @@ enum sock_flags {
> > SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
> > SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
> > SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> > +   SOCK_VMIO, /* promise to never block on receive */
> 
> It might be used for IO related to the VM, but that doesn't tell _what_ it 
> does.
> It also does much more than just not blocking on receive, so overal, aren't
> both the vmio name and the comment slightly misleading?

I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
while, but that is a very bad name because nonblocking has this well
defined meaning when talking about sockets, and this is not that.

Hence I came up with the VMIO, because that is the only selecting
criteria for being special. - I'll fix up the comment.

> > +static inline int emerg_rx_pages_try_inc(void)
> > +{
> > +   return atomic_read(&vmio_socks) &&
> > +   atomic_add_unless(&emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
> > +}
> 
> It looks cleaner to move that first check to the caller, as it is often
> redundant and in the other cases makes it more clear what the caller is
> really doing.

Yes, very good suggestion indeed, what was I thinking?!

> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
> >
> >  static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", 
> > "HighMem" };
> >  int min_free_kbytes = 1024;
> > +int var_free_kbytes;
> 
> Using var_free_pages makes the code slightly simpler, as all that needless
> convertion isn't needed anymore. Perhaps the same is true for 
> min_free_kbytes...

't seems I'm a bit puzzled as to what you mean here.

> 
> > +noskb:
> > +   /* Attempt emergency allocation when RX skb. */
> > +   if (!(flags & SKB_ALLOC_RX))
> > +   goto out;
> 
> So only incoming skb allocation is guaranteed? What about outgoing skbs?
> What am I missing? Or can we sleep then, and increasing var_free_kbytes is
> sufficient to guarantee it?

->sk_allocation |= __GFP_EMERGENCY - will take care of the outgoing
packets. Also, since one only sends a limited number of packets out and
then will wait for answers, we do not need to worry about fragmentation
issues that much in this case.

> > +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
> > +{
> > +   free_page((unsigned long)objp);
> > +   emerg_rx_pages_dec();
> > +}
> > +
> >  /*
> >   * Free an skbuff by memory without cleaning the state.
> >   */
> > @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
> >  {
> > struct sk_buff *other;
> > atomic_t *fclone_ref;
> > +   void (*free_skb)(struct kmem_cache *, void *);
> >
> > skb_release_data(skb);
> > +
> > +   free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
> > +
> > switch (skb->fclone) {
> > case SKB_FCLONE_UNAVAILABLE:
> > -   kmem_cache_free(skbuff_head_cache, skb);
> > +   free_skb(skbuff_head_cache, skb);
> > break;
> >
> > case SKB_FCLONE_ORIG:
> > fclone_ref = (atomic_t *) (skb + 2);
> > if (atomic_dec_and_test(fclone_ref))
> > -   kmem_cache_free(skbuff_fclone_cache, skb);
> > +   free_skb(skbuff_fclone_cache, skb);
> > break;
> >
> > case SKB_FCLONE_CLONE:
> > @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
> > skb->fclone = SKB_FCLONE_UNAVAILABLE;
> >
> > if (atomic_dec_and_test(fclone_ref))
> > -   kmem_cache_free(skbuff_fclone_cache, other);
> > +   free_skb(skbuff_fclone_cache, other);
> > break;
> > };
> >  }
> 
> I don't have the original code in front of me, but isn't it possible to
> add a "goto free" which has all the freeing in one place? That would get
> rid of the function pointer stuff and emerg_free_skb.

perhaps, yes, however I prefer this one, it allows access to the size.

> > @@ -435,6 +486,17 @@ struct sk_buff *skb_clone(struct sk_buff
> > atomic_t *fclone_ref = (atomic_t *) (n + 1);
> > n->fclone = SKB_FCLONE_CLONE;
> > atomic_inc(fclone_ref);
> > +   } else if (skb->emerg) {
> > +   if (!emerg_rx_pages_try_inc())
> > +   return NULL;
> > +
> > +   n = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> > +   if (!n) {
> > +

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-25 Thread Indan Zupancic
On Fri, August 25, 2006 17:39, Peter Zijlstra said:
> @@ -282,7 +282,8 @@ struct sk_buff {
>   nfctinfo:3;
>   __u8pkt_type:3,
>   fclone:2,
> - ipvs_property:1;
> + ipvs_property:1,
> + emerg:1;
>   __be16  protocol;

Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)


> @@ -391,6 +391,7 @@ enum sock_flags {
>   SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
>   SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
>   SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> + SOCK_VMIO, /* promise to never block on receive */

It might be used for IO related to the VM, but that doesn't tell _what_ it does.
It also does much more than just not blocking on receive, so overal, aren't
both the vmio name and the comment slightly misleading?


> +static inline int emerg_rx_pages_try_inc(void)
> +{
> + return atomic_read(&vmio_socks) &&
> + atomic_add_unless(&emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
> +}

It looks cleaner to move that first check to the caller, as it is often
redundant and in the other cases makes it more clear what the caller is
really doing.


> @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
>
>  static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", 
> "HighMem" };
>  int min_free_kbytes = 1024;
> +int var_free_kbytes;

Using var_free_pages makes the code slightly simpler, as all that needless
convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...


> +noskb:
> + /* Attempt emergency allocation when RX skb. */
> + if (!(flags & SKB_ALLOC_RX))
> + goto out;

So only incoming skb allocation is guaranteed? What about outgoing skbs?
What am I missing? Or can we sleep then, and increasing var_free_kbytes is
sufficient to guarantee it?


> +
> + if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
> + goto out;
> +
> + if (!emerg_rx_pages_try_inc())
> + goto out;
> +
> + skb = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!skb) {
> + WARN_ON(1); /* we were promised memory but didn't get it? */
> + goto dec_out;
> + }
> +
> + if (!emerg_rx_pages_try_inc())
> + goto skb_free_out;
> +
> + data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!data) {
> + WARN_ON(1); /* we were promised memory but didn't get it? */
> + emerg_rx_pages_dec();
> +skb_free_out:
> + free_page((unsigned long)skb);
> + skb = NULL;
> +dec_out:
> + emerg_rx_pages_dec();
> + goto out;
> + }
> +
> + cache = NULL;
> + goto allocated;
>  }
>
>  /**
> @@ -267,7 +304,7 @@ struct sk_buff *__netdev_alloc_skb(struc
>  {
>   struct sk_buff *skb;
>
> - skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
> + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
>   if (likely(skb)) {
>   skb_reserve(skb, NET_SKB_PAD);
>   skb->dev = dev;
> @@ -315,10 +352,20 @@ static void skb_release_data(struct sk_b
>   if (skb_shinfo(skb)->frag_list)
>   skb_drop_fraglist(skb);
>
> - kfree(skb->head);
> + if (skb->emerg) {
> + free_page((unsigned long)skb->head);
> + emerg_rx_pages_dec();
> + } else
> + kfree(skb->head);
>   }
>  }
>
> +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
> +{
> + free_page((unsigned long)objp);
> + emerg_rx_pages_dec();
> +}
> +
>  /*
>   *   Free an skbuff by memory without cleaning the state.
>   */
> @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
>  {
>   struct sk_buff *other;
>   atomic_t *fclone_ref;
> + void (*free_skb)(struct kmem_cache *, void *);
>
>   skb_release_data(skb);
> +
> + free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
> +
>   switch (skb->fclone) {
>   case SKB_FCLONE_UNAVAILABLE:
> - kmem_cache_free(skbuff_head_cache, skb);
> + free_skb(skbuff_head_cache, skb);
>   break;
>
>   case SKB_FCLONE_ORIG:
>   fclone_ref = (atomic_t *) (skb + 2);
>   if (atomic_dec_and_test(fclone_ref))
> - kmem_cache_free(skbuff_fclone_cache, skb);
> + free_skb(skbuff_fclone_cache, skb);
>   break;
>
>   case SKB_FCLONE_CLONE:
> @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
>   skb->fclone = SKB_FCLONE_UNAVAILABLE;
>
>   if (atomic_dec_and_test(fclone_ref))
> - kmem_cache_free(skbuff_fclone_cache, other);
> + free_skb(skbuff_fclone_cache, other);
>   break;
>   };
>  }


[PATCH 1/4] net: VM deadlock avoidance framework

2006-08-25 Thread Peter Zijlstra

The core of the VM deadlock avoidance framework.

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of OOM because
the device itself might be needed to get out of OOM (reclaim pageout).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee full
service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

Since memory is tight and the reserve modest, we do not want to lose memory to
fragmentation effects. Hence a very simple allocator is used to guarantee that
the memory used for each packet is returned to the page allocator.

Converted protocols:
IPv4 & IPv6:
 - icmp
 - udp
 - tcp
IPv4:
 - igmp

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Daniel Phillips <[EMAIL PROTECTED]>
---
 include/linux/gfp.h|3 -
 include/linux/mmzone.h |1 
 include/linux/skbuff.h |   13 --
 include/net/sock.h |   37 +
 mm/page_alloc.c|   41 ++-
 net/core/skbuff.c  |  103 ++---
 net/core/sock.c|   97 ++
 net/ipv4/af_inet.c |3 +
 net/ipv4/icmp.c|3 +
 net/ipv4/igmp.c|3 +
 net/ipv4/tcp_ipv4.c|3 +
 net/ipv4/udp.c |8 +++
 net/ipv6/af_inet6.c|3 +
 net/ipv6/icmp.c|3 +
 net/ipv6/tcp_ipv6.c|3 +
 net/ipv6/udp.c |3 +
 16 files changed, 305 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall cpuset 
memory allocs */
+#define __GFP_EMERG  ((__force gfp_t)0x4u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-   __GFP_NOMEMALLOC|__GFP_HARDWALL)
+   __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_EMERG)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);
 
 #include 
 /* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   emerg:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
@@ -327,10 +328,13 @@ struct sk_buff {
 
 #include 
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
 extern void kfree