2.6.23-rc2: WARNING at kernel/irq/resend.c:70 check_irq_resend()

2007-08-08 Thread Indan Zupancic
Hi,

Just added an old network card, RTL-8029(AS), ne2k-pci driver, and tried to
expand the network (failed because I didn't use a cross-over cable).

The code snippet that spat the thing:

/*
 * IRQ resend
 *
 * Is called with interrupts disabled and desc-lock held.
 */
void check_irq_resend(struct irq_desc *desc, unsigned int irq)
{
unsigned int status = desc-status;

/*
 * Make sure the interrupt is enabled, before resending it:
 */
desc-chip-enable(irq);

/*
 * Temporary hack to figure out more about the problem, which
 * is causing the ancient network cards to die.
 */
if (desc-handle_irq != handle_edge_irq) {
WARN_ON_ONCE(1);
return;
}

and relevant dmesg snippet:

[  446.836399] eth1: RealTek RTL-8029 found at 0x9000, IRQ 16, 
00:00:B4:BB:BF:E5.
[  855.458468] r8169: eth0: link up
[  857.500667] r8169: eth0: link up
[  902.294283] r8169: eth0: link up
[ 4126.348427] r8169: eth0: link up
[ 5598.446233] r8169: eth0: link down
[ 5618.904846] r8169: eth0: link up
[ 5623.133090] r8169: eth0: link down
[ 5624.708071] r8169: eth0: link up
[ 6328.267872] WARNING: at /home/indan/src/git/linux-2.6/kernel/irq/resend.c:70 
check_irq_resend()
[ 6328.267896]  [b0135d8f] check_irq_resend+0x4f/0x8c
[ 6328.267912]  [b0135888] enable_irq+0x84/0xaa
[ 6328.267918]  [c08e37cc] ne2k_pci_block_output+0x0/0x138 [ne2k_pci]
[ 6328.267931]  [c08e37cc] ne2k_pci_block_output+0x0/0x138 [ne2k_pci]
[ 6328.267939]  [c08e37cc] ne2k_pci_block_output+0x0/0x138 [ne2k_pci]
[ 6328.267947]  [c08e0f8a] ei_start_xmit+0x29f/0x2b9 [8390]
[ 6328.267966]  [b022a53e] dev_hard_start_xmit+0x19e/0x1fd
[ 6328.267976]  [b023620a] __qdisc_run+0x6c/0x151
[ 6328.267986]  [b022c459] dev_queue_xmit+0x12a/0x271
[ 6328.267991]  [b0260ec9] arp_send+0x4c/0x64
[ 6328.268002]  [b0260524] arp_xmit+0x4d/0x51
[ 6328.268009]  [b026117c] arp_process+0x29b/0x50a
[ 6328.268018]  [b0244f0a] ip_forward+0x22c/0x23a
[ 6328.268026]  [b02434a8] ip_rcv_finish+0x0/0x282
[ 6328.268032]  [b0243db9] ip_rcv+0x479/0x4a8
[ 6328.268037]  [b02434a8] ip_rcv_finish+0x0/0x282
[ 6328.268044]  [b02614db] arp_rcv+0xf0/0x104
[ 6328.268049]  [b012ae17] hrtimer_run_queues+0x12/0x1a1
[ 6328.268055]  [b02613eb] arp_rcv+0x0/0x104
[ 6328.268061]  [b022a07d] netif_receive_skb+0x2d9/0x317
[ 6328.268068]  [b022baa0] process_backlog+0x6d/0xd2
[ 6328.268075]  [b022c1c2] net_rx_action+0x81/0x14d
[ 6328.268082]  [b011dd84] __do_softirq+0x35/0x75
[ 6328.268088]  [b0106030] do_softirq+0x3e/0x8d
[ 6328.268098]  [b0136530] handle_fasteoi_irq+0x0/0xbe
[ 6328.268103]  [b011dd44] irq_exit+0x25/0x30
[ 6328.268107]  [b010630f] do_IRQ+0x94/0xad
[ 6328.268114]  [b0104763] common_interrupt+0x23/0x28
[ 6328.268123]  [b0102a0c] default_idle+0x27/0x39
[ 6328.268128]  [b0102347] cpu_idle+0x41/0x55
[ 6328.268133]  [b032f99b] start_kernel+0x20e/0x213
[ 6328.268140]  [b032f317] unknown_bootoption+0x0/0x196
[ 6328.268146]  ===

# cat /proc/interrupts
   CPU0
  0:9804926   IO-APIC-edge  timer
  1:   8270   IO-APIC-edge  i8042
  9:  0   IO-APIC-fasteoi   acpi
 12: 181281   IO-APIC-edge  i8042
 16:  40410   IO-APIC-fasteoi   sata_sil, eth1
 17:  0   IO-APIC-fasteoi   ohci_hcd:usb1, NVidia nForce2
 18: 499428   IO-APIC-fasteoi   ohci_hcd:usb2
 19:  2   IO-APIC-fasteoi   ehci_hcd:usb3
 20: 689588   IO-APIC-fasteoi   eth0
NMI:  0
LOC:9805098
ERR:  0
MIS:  0

So the card is sharing an IRQ with the disk controller.

No idea if the network card died after this warning, for all
practical purposes it didn't work before, so couldn't check.

I can provide more information and run some tests if anyone
wants that, just keep in mind that the card isn't connected to
anything.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Indan Zupancic
On Mon, August 28, 2006 12:22, Peter Zijlstra said:
 On Sat, 2006-08-26 at 04:37 +0200, Indan Zupancic wrote:
 Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)

 hehe, me lazy, you gentoo ;-)
 sed -i -e 's/emerg/emregency/g' -e 's/EMERG/EMERGENCY/g' *.patch

I used it for a while, long ago, until I figured out that there were better
alternatives. I didn't like the overly complex init and portage system though.

But if you say emerg it will sound as emerge, and all other fields in that
struct aren't abbreviated either and often longer, so it just makes more sense
to use the full name.


  @@ -391,6 +391,7 @@ enum sock_flags {
 SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
  +  SOCK_VMIO, /* promise to never block on receive */

 It might be used for IO related to the VM, but that doesn't tell _what_ it 
 does.
 It also does much more than just not blocking on receive, so overal, aren't
 both the vmio name and the comment slightly misleading?

 I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
 while, but that is a very bad name because nonblocking has this well
 defined meaning when talking about sockets, and this is not that.

 Hence I came up with the VMIO, because that is the only selecting
 criteria for being special. - I'll fix up the comment.

It's nice and short, but it might be weird if someone after a while finds 
another way
of using this stuff. And it's relation to 'emergency' looks unclear. So maybe 
calling
both the same makes most sense, no matter how you name it.


  @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
 
   static char *zone_names[MAX_NR_ZONES] = { DMA, DMA32, Normal, 
  HighMem };
   int min_free_kbytes = 1024;
  +int var_free_kbytes;

 Using var_free_pages makes the code slightly simpler, as all that needless
 convertion isn't needed anymore. Perhaps the same is true for 
 min_free_kbytes...

 't seems I'm a bit puzzled as to what you mean here.

I mean to store the variable reserve in pages instead of kilobytes. Currently 
you're
converting from the one to the other both when setting and when using the 
value. That
doesn't make much sense and can be avoided by storing the value in pages from 
the start.


  +noskb:
  +  /* Attempt emergency allocation when RX skb. */
  +  if (!(flags  SKB_ALLOC_RX))
  +  goto out;

 So only incoming skb allocation is guaranteed? What about outgoing skbs?
 What am I missing? Or can we sleep then, and increasing var_free_kbytes is
 sufficient to guarantee it?

 -sk_allocation |= __GFP_EMERGENCY - will take care of the outgoing
 packets. Also, since one only sends a limited number of packets out and
 then will wait for answers, we do not need to worry about fragmentation
 issues that much in this case.

Ah, missed that one. Didn't knew that the alloc flags were stored in the sock.


  +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
  +{
  +  free_page((unsigned long)objp);
  +  emerg_rx_pages_dec();
  +}
  +
   /*
*Free an skbuff by memory without cleaning the state.
*/
  @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
   {
 struct sk_buff *other;
 atomic_t *fclone_ref;
  +  void (*free_skb)(struct kmem_cache *, void *);
 
 skb_release_data(skb);
  +
  +  free_skb = skb-emerg ? emerg_free_skb : kmem_cache_free;
  +
 switch (skb-fclone) {
 case SKB_FCLONE_UNAVAILABLE:
  -  kmem_cache_free(skbuff_head_cache, skb);
  +  free_skb(skbuff_head_cache, skb);
 break;
 
 case SKB_FCLONE_ORIG:
 fclone_ref = (atomic_t *) (skb + 2);
 if (atomic_dec_and_test(fclone_ref))
  -  kmem_cache_free(skbuff_fclone_cache, skb);
  +  free_skb(skbuff_fclone_cache, skb);
 break;
 
 case SKB_FCLONE_CLONE:
  @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
 skb-fclone = SKB_FCLONE_UNAVAILABLE;
 
 if (atomic_dec_and_test(fclone_ref))
  -  kmem_cache_free(skbuff_fclone_cache, other);
  +  free_skb(skbuff_fclone_cache, other);
 break;
 };
   }

 I don't have the original code in front of me, but isn't it possible to
 add a goto free which has all the freeing in one place? That would get
 rid of the function pointer stuff and emerg_free_skb.

 perhaps, yes, however I prefer this one, it allows access to the size.

What size are you talking about? What I had in mind is probably less readable,
but it avoids a bunch of function calls and that indirect function call, so
with luck it has less overhead and smaller object size:

void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
struct kmem_cache *cache = skbuff_head_cache;
struct sk_buff *free = skb;

skb_release_data(skb);
switch

Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-28 Thread Indan Zupancic
On Mon, August 28, 2006 19:32, Peter Zijlstra said:
 Also, I'm really past caring what the thing
 is called ;-) But if ppl object I guess its easy enough to run yet
 another sed command over the patches.

True, same here.

  You can get rid of the memalloc_reserve and vmio_request_queues variables
  if you want, they aren't really needed for anything. If using them reduces
  the total code size I'd keep them though.
 
  I find my version easier to read, but that might just be the way my
  brain works.

 Maybe true, but I believe my version is more natural in the sense that it 
 makes
 more clear what the code is doing. Less bookkeeping, more real work, so to 
 speak.

 Ok, I'll have another look at it, perhaps my gray matter has shifted ;-)

I don't care either way, just providing an alternative. I'd compile both and see
which one is smaller.


 Ah, no accident there, I'm fully aware that there would need to be a
 spinlock in adjust_memalloc_reserve() if there were another caller.
 (I even had it there for some time) - added comment.

Good that you're aware of it. Thing is, how much sense does the split-up into
adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
add adjust_memalloc_reserve() when it's really needed? It saves an export.

Feedback on the 28-Aug-2006 19:24 version from
programming.kicks-ass.net/kernel-patches/vm_deadlock/current/


 +void setup_per_zone_pages_min(void)
 +{
 + static DEFINE_SPINLOCK(lock);
 + unsigned long flags;
 +
 + spin_lock_irqsave(lock, flags);
 + __setup_per_zone_pages_min();
 + spin_unlock_irqrestore(lock, flags);
 +}

Better to put the lock next to min_free_kbytes, both for readability and
cache behaviour. And it satisfies the lock data, not code mantra.


 +static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
 +{
 + void * page = NULL;
 +
 + if (size  PAGE_SIZE)
 + return page;
 +
 + if (atomic_add_unless(emergency_rx_pages_used, 1, RX_RESERVE_PAGES)) {
 + page = (void *)__get_free_page(gfp_mask);
 + if (!page) {
 + WARN_ON(1);
 + atomic_dec(emergency_rx_pages_used);
 + }
 + }
 +
 + return page;
 +}

If you prefer to avoid cmpxchg (which is often used in atomic_add_unless
and can be expensive) then you can use something like:

static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
{
void * page;

if (size  PAGE_SIZE)
return NULL;

if (atomic_inc_return(emergency_rx_pages_used) == RX_RESERVE_PAGES)
goto out;
page = (void *)__get_free_page(gfp_mask);
if (page)
return page;
WARN_ON(1);
out:
atomic_dec(emergency_rx_pages_used);
return NULL;
}

The tiny race should be totally harmless. Both versions are a bit big
to inline though.


 @@ -195,6 +196,86 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
  /* Maximal space eaten by iovec or ancilliary data plus some space */
  int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

 +static DEFINE_SPINLOCK(memalloc_lock);
 +static int memalloc_reserve;
 +static unsigned int vmio_request_queues;
 +
 +atomic_t vmio_socks;
 +atomic_t emergency_rx_pages_used;
 +EXPORT_SYMBOL_GPL(vmio_socks);

Is this export needed? It's only used in net/core/skbuff.c and net/core/sock.c,
which are compiled into one module.

 +EXPORT_SYMBOL_GPL(emergency_rx_pages_used);

Same here. It's only used by code in sock.c and skbuff.c, and no external
code calls emergency_rx_alloc(), nor emergency_rx_free().

--

I think I depleted my usefulness, there isn't much left to say for me.
It's up to the big guys to decide about the merrit of this patch.
If Evgeniy's network allocator fixes all deadlocks and also has other
advantages, then great.

IMHO:

- This patch isn't really a framework, more a minimal fix for one specific,
though important problem. But it's small and doesn't have much impact
(numbers would be nice, e.g. vmlinux/modules size before and after, and
some network benchmark results).

- If Evgeniy's network allocator is as good as it looks, then why can't it
replace the existing one? Just adding private subsystem specific memory
allocators seems wrong. I might be missing the big picture, but it looks
like memory allocator things should be at least synchronized and discussed
with Christoph Lameter and his modular slab allocator patch.

All in all it seems it will take a while until Evgeniy's code will be merged,
so I think applying Peter's patch soonish and removing it again the moment it
becomes unnecessary is reasonable.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] net: VM deadlock avoidance framework

2006-08-25 Thread Indan Zupancic
On Fri, August 25, 2006 17:39, Peter Zijlstra said:
 @@ -282,7 +282,8 @@ struct sk_buff {
   nfctinfo:3;
   __u8pkt_type:3,
   fclone:2,
 - ipvs_property:1;
 + ipvs_property:1,
 + emerg:1;
   __be16  protocol;

Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)


 @@ -391,6 +391,7 @@ enum sock_flags {
   SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
   SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
   SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
 + SOCK_VMIO, /* promise to never block on receive */

It might be used for IO related to the VM, but that doesn't tell _what_ it does.
It also does much more than just not blocking on receive, so overal, aren't
both the vmio name and the comment slightly misleading?


 +static inline int emerg_rx_pages_try_inc(void)
 +{
 + return atomic_read(vmio_socks) 
 + atomic_add_unless(emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
 +}

It looks cleaner to move that first check to the caller, as it is often
redundant and in the other cases makes it more clear what the caller is
really doing.


 @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);

  static char *zone_names[MAX_NR_ZONES] = { DMA, DMA32, Normal, 
 HighMem };
  int min_free_kbytes = 1024;
 +int var_free_kbytes;

Using var_free_pages makes the code slightly simpler, as all that needless
convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...


 +noskb:
 + /* Attempt emergency allocation when RX skb. */
 + if (!(flags  SKB_ALLOC_RX))
 + goto out;

So only incoming skb allocation is guaranteed? What about outgoing skbs?
What am I missing? Or can we sleep then, and increasing var_free_kbytes is
sufficient to guarantee it?


 +
 + if (size + sizeof(struct skb_shared_info)  PAGE_SIZE)
 + goto out;
 +
 + if (!emerg_rx_pages_try_inc())
 + goto out;
 +
 + skb = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
 + if (!skb) {
 + WARN_ON(1); /* we were promised memory but didn't get it? */
 + goto dec_out;
 + }
 +
 + if (!emerg_rx_pages_try_inc())
 + goto skb_free_out;
 +
 + data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
 + if (!data) {
 + WARN_ON(1); /* we were promised memory but didn't get it? */
 + emerg_rx_pages_dec();
 +skb_free_out:
 + free_page((unsigned long)skb);
 + skb = NULL;
 +dec_out:
 + emerg_rx_pages_dec();
 + goto out;
 + }
 +
 + cache = NULL;
 + goto allocated;
  }

  /**
 @@ -267,7 +304,7 @@ struct sk_buff *__netdev_alloc_skb(struc
  {
   struct sk_buff *skb;

 - skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
 + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
   if (likely(skb)) {
   skb_reserve(skb, NET_SKB_PAD);
   skb-dev = dev;
 @@ -315,10 +352,20 @@ static void skb_release_data(struct sk_b
   if (skb_shinfo(skb)-frag_list)
   skb_drop_fraglist(skb);

 - kfree(skb-head);
 + if (skb-emerg) {
 + free_page((unsigned long)skb-head);
 + emerg_rx_pages_dec();
 + } else
 + kfree(skb-head);
   }
  }

 +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
 +{
 + free_page((unsigned long)objp);
 + emerg_rx_pages_dec();
 +}
 +
  /*
   *   Free an skbuff by memory without cleaning the state.
   */
 @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
  {
   struct sk_buff *other;
   atomic_t *fclone_ref;
 + void (*free_skb)(struct kmem_cache *, void *);

   skb_release_data(skb);
 +
 + free_skb = skb-emerg ? emerg_free_skb : kmem_cache_free;
 +
   switch (skb-fclone) {
   case SKB_FCLONE_UNAVAILABLE:
 - kmem_cache_free(skbuff_head_cache, skb);
 + free_skb(skbuff_head_cache, skb);
   break;

   case SKB_FCLONE_ORIG:
   fclone_ref = (atomic_t *) (skb + 2);
   if (atomic_dec_and_test(fclone_ref))
 - kmem_cache_free(skbuff_fclone_cache, skb);
 + free_skb(skbuff_fclone_cache, skb);
   break;

   case SKB_FCLONE_CLONE:
 @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
   skb-fclone = SKB_FCLONE_UNAVAILABLE;

   if (atomic_dec_and_test(fclone_ref))
 - kmem_cache_free(skbuff_fclone_cache, other);
 + free_skb(skbuff_fclone_cache, other);
   break;
   };
  }

I don't have the original code in front of me, but isn't it possible to
add a goto free which has all the freeing in one place? That would get
rid 

Re: [PATCH 4/4] nfs: deadlock prevention for NFS

2006-08-25 Thread Indan Zupancic
   /* Check if this were the first socks: */
   if (nr_socks - socks == 0)
   reserve += RX_RESERVE_PAGES;

Can of course be:

if (nr_socks == socks)
reserve += RX_RESERVE_PAGES;

Grumble,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Indan Zupancic
Hello,

Here's a patch to get rid of max_dgram_qlen proc option. All it does is
slow down unix datagram packet sending, without giving the program any
control over it.

Applying it decreases code size, simplifies the code and makes poll
behaviour more logical for connected datagram sockets in regard to
POLLOUT. With the current code poll can say the socket is writable,
because it only checks the buffers, while sendmsg will fail because too
many packets are queued up already. In practice this means that a slow
reader will cause a non-blocking sender to hog the CPU.

Either this, or it should be implemented correctly, which means poll needs
to be fixed to also check for max_dgram_qlen, and maybe an ioctl should be
added to change the variable per socket.

Greetings,

Indan


diff --git a/net/unix/Makefile b/net/unix/Makefile
--- a/net/unix/Makefile Sun Aug 06 19:00:05 2006 +
+++ b/net/unix/Makefile Tue Aug 22 21:06:09 2006 +0200
@@ -5,4 +5,3 @@ obj-$(CONFIG_UNIX)  += unix.o
 obj-$(CONFIG_UNIX) += unix.o

 unix-y := af_unix.o garbage.o
-unix-$(CONFIG_SYSCTL)  += sysctl_net_unix.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
--- a/net/unix/af_unix.cSun Aug 06 19:00:05 2006 +
+++ b/net/unix/af_unix.cTue Aug 22 21:05:39 2006 +0200
@@ -117,8 +117,6 @@
 #include net/checksum.h
 #include linux/security.h

-int sysctl_unix_max_dgram_qlen = 10;
-
 struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 DEFINE_SPINLOCK(unix_table_lock);
 static atomic_t unix_nr_socks = ATOMIC_INIT(0);
@@ -586,7 +584,6 @@ static struct sock * unix_create1(struct
af_unix_sk_receive_queue_lock_key);

sk-sk_write_space  = unix_write_space;
-   sk-sk_max_ack_backlog  = sysctl_unix_max_dgram_qlen;
sk-sk_destruct = unix_sock_destructor;
u = unix_sk(sk);
u-dentry = NULL;
@@ -1379,23 +1376,6 @@ restart:
goto out_unlock;
}

-   if (unix_peer(other) != sk 
-   (skb_queue_len(other-sk_receive_queue) 
-other-sk_max_ack_backlog)) {
-   if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
-
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
-
-   goto restart;
-   }
-
skb_queue_tail(other-sk_receive_queue, skb);
unix_state_runlock(other);
other-sk_data_ready(other, len);
@@ -2076,7 +2056,6 @@ static int __init af_unix_init(void)
 #ifdef CONFIG_PROC_FS
proc_net_fops_create(unix, 0, unix_seq_fops);
 #endif
-   unix_sysctl_register();
 out:
return rc;
 }
@@ -2084,7 +2063,6 @@ static void __exit af_unix_exit(void)
 static void __exit af_unix_exit(void)
 {
sock_unregister(PF_UNIX);
-   unix_sysctl_unregister();
proc_net_remove(unix);
proto_unregister(unix_proto);
 }
diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c deleted 
file mode 100644
--- a/net/unix/sysctl_net_unix.cSun Aug 06 19:00:05 2006 +
+++ /dev/null   Thu Jan 01 00:00:00 1970 +
@@ -1,60 +0,0 @@
-/*
- * NET4:   Sysctl interface to net af_unix subsystem.
- *
- * Authors:Mike Shaver.
- *
- * This program is free software; you can redistribute it and/or - 
*   modify it under the terms
of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include linux/mm.h
-#include linux/sysctl.h
-
-#include net/af_unix.h
-
-static ctl_table unix_table[] = {
-   {
-   .ctl_name   = NET_UNIX_MAX_DGRAM_QLEN,
-   .procname   = max_dgram_qlen,
-   .data   = sysctl_unix_max_dgram_qlen,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table unix_net_table[] = {
-   {
-   .ctl_name   = NET_UNIX,
-   .procname   = unix,
-   .mode   = 0555,
-   .child  = unix_table
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table unix_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = net,
-   .mode   = 0555,
-   .child  = unix_net_table
-   },
-   { .ctl_name = 0 }
-};
-
-static struct ctl_table_header * unix_sysctl_header;
-
-void unix_sysctl_register(void)
-{
-   unix_sysctl_header = register_sysctl_table(unix_root_table, 0);
-}
-
-void unix_sysctl_unregister(void)
-{
-   unregister_sysctl_table(unix_sysctl_header);
-}
-




-
To unsubscribe from this 

Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Indan Zupancic
On Tue, August 22, 2006 22:39, Alexey Kuznetsov said:
 Feel free to do this correctly. :-)
 Deleting wrong code rarely helps.

 It is the only protection of commiting infinite amount of memory to a socket.

Doesn't the if (atomic_read(sk-sk_wmem_alloc)  sk-sk_sndbuf) check in 
sock_alloc_send_pskb()
limit things already? Or don't unix datagram sockets have a limited sendbuffer? 
Ouch, that
complicates things.

But wait a moment, I'm running a kernel now with the patch applied and some 
testcode shows that
everything seems to behave as expected, with send() returning EAGAIN after a 
while, and poll
giving POLLOUT when more data can be send.

What am I missing?

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Indan Zupancic
On Wed, August 23, 2006 0:34, Alexey Kuznetsov said:
  It is the only protection of commiting infinite amount of memory to a 
  socket.

 Doesn't the if (atomic_read(sk-sk_wmem_alloc)  sk-sk_sndbuf) check in
 sock_alloc_send_pskb()
 limit things already?

 Unfortunately, it does not. You can open a socket, send
 something to a selected victim, close it, and repeat this
 until receiver accumulates enough of skbs to kill the system.

Well, it seems the devil is in the details, as usual.

Isn't a socket freed until all skb are handled? In which case the limit on the 
number of open
files limits the total memory usage? (Same as with streaming sockets?)

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Indan Zupancic
On Wed, August 23, 2006 1:32, Alexey Kuznetsov said:

 Isn't a socket freed until all skb are handled? In which case the limit on 
 the number of open
 files limits the total memory usage? (Same as with streaming sockets?)

 Alas. Number of closed sockets is not limited. Actually, it is limited
 by sk_max_ack_backlog*max_files, which is a lot.

Hmm... So setting sk_max_ack_backlog to 1 makes it limited by max_files,
which should make the worst case the same as for streaming sockets, right?

 The problem is specific for unconnected datagram sockets
 (predicate unix_peer(other) != sk)

Doesn't that mean that both sockets are connected to eachother? I mean,
if only this socket connects to the other the above check isn't true.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 17:06, Peter Zijlstra said:
 On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
 Peter Zijlstra wrote:
  Index: linux-2.6/include/linux/gfp.h
  ===
  --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.0 
  +0200
  +++ linux-2.6/include/linux/gfp.h  2006-08-12 12:56:09.0 +0200
  @@ -46,6 +46,7 @@ struct vm_area_struct;
   #define __GFP_ZERO((__force gfp_t)0x8000u)/* Return zeroed page 
  on success */
   #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
  reserves */
   #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall 
  cpuset memory allocs
 */
  +#define __GFP_MEMALLOC  ((__force gfp_t)0x4u) /* Use emergency 
  reserves */

 This symbol name has nothing to do with its purpose.  The entire area of
 code you are modifying could be described as having something to do with
 'memalloc'.

 GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
 symbol name.

 I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
 the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC
 would be nice.

 I'm rather bad at picking names, but here goes:

 PF_MEMALLOC  - PF_EMERGALLOC
 __GFP_NOMEMALLOC - __GFP_NOEMERGALLOC
 __GFP_MEMALLOC   - __GFP_EMERGALLOC

 Is that suitable and shall I prepare patches? Or do we want more ppl to
 chime in and have a few more rounds?

Pardon my ignorance, but if we're doing cleanup anyway, why not use only one 
flag instead of two?
Why is __GFP_NOMEMALLOC needed when not setting __GFP_MEMALLOC could mean the 
same? Or else what
is the expected behaviour if both flags are set?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 16:14, Peter Zijlstra said:
 Hi,

 here the latest effort, it includes a whole new trivial allocator with a
 horrid name and an almost full rewrite of the deadlock prevention core.
 This version does not do anything per device and hence does not depend
 on the new netdev_alloc_skb() API.

 The reason to add a second allocator to the receive side is twofold:
 1) it allows easy detection of the memory pressure / OOM situation;
 2) it allows the receive path to be unbounded and go at full speed when
resources permit.

 The choice of using the global memalloc reserve as a mempool makes that
 the new allocator has to release pages as soon as possible; if we were
 to hoard pages in the allocator the memalloc reserve would not get
 replenished readily.

Version 2 had about 250 new lines of code, while v3 has close to 600, when
including the SROG code. And that while things should have become simpler.
So why use SROG instead of the old alloc_pages() based code? And why couldn't
you use a slightly modified SLOB instead of writing a new allocator?
It looks like overkill to me.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/4] deadlock prevention core

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 16:14, Peter Zijlstra said:
 +struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
 +{
 + struct sk_buff *skb;
 +
 + skb = ___alloc_skb(size, gfp_mask  ~__GFP_MEMALLOC, fclone);
 +
 + if (!skb  (gfp_mask  __GFP_MEMALLOC)  memalloc_skbs_available())
 + skb = ___alloc_skb(size, gfp_mask, fclone);
 +
 + return skb;
 +}
 +

I'd drop the memalloc_skbs_available() check, as that's already done by
___alloc_skb.

 +static DEFINE_SPINLOCK(memalloc_lock);
 +static int memalloc_socks;
 +static unsigned long memalloc_reserve;

Why is this a long? adjust_memalloc_reserve() takes an int.
Is it needed at all, considering var_free_kbytes already exists?

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/4] deadlock prevention core

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 19:44, Peter Zijlstra said:
 Euhm, right :-) long comes naturaly when I think about quantities op
 pages. The adjust_memalloc_reserve() argument is an increment, a delta;
 perhaps I should change that to long.

Maybe, but having 16 TB of reserved memory seems plenty for a while.

 Having them separate would allow ajust_memalloc_reserve() to be used by
 other callers too (would need some extra locking).

True, but currently memalloc_reserve isn't used in a sensible way,
or I'm missing something.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 19:33, Peter Zijlstra said:
 Simpler yes, but also more complete; the old patches had serious issues
 with the alternative allocation scheme.

It sure is more complete, and looks nicer, but the price is IMHO too high.
I'm curious what those serious issues are, and if they can't be fixed.

 As for why SROG, because trying to stick all the semantics needed for
 all skb operations into the old approach was nasty, I had it almost
 complete but it was horror (and more code than the SROG approach).

What was missing or wrong in the old approach? Can't you use the new
approach, but use alloc_pages() instead of SROG?

Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/4] deadlock prevention core

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 20:08, Peter Zijlstra said:
 On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
 True, but currently memalloc_reserve isn't used in a sensible way,
 or I'm missing something.

 Well, I'm somewhat reluctant to stick network related code into mm/, it
 seems well separated now.

What I had in mind was something like:

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+   unsigned long flags;
+   unsigned int reserve;
+   int err;
+
+   spin_lock_irqsave(memalloc_lock, flags);
+
+   memalloc_socks += nr_socks;
+   BUG_ON(memalloc_socks  0);
+
+   reserve = nr_socks * (2 * MAX_PHYS_SEGMENTS +   /* outbound */
+ 5 * MAX_CONCURRENT_SKBS); /* inbound */
+
+   err = adjust_memalloc_reserve(reserve);
+   spin_unlock_irqrestore(memalloc_lock, flags);
+   if (err) {
+   printk(KERN_WARNING
+   Unable to change RX reserve to: %lu, error: %d\n,
+   reserve, err);
+   }
+   return err;
+}

The original code missed the brackets, so 5 * MAX_CONCURRENT_SKBS wasn't done
per socket. But the comment said it was per socket, so I added in this version.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 3/4] deadlock prevention core

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 20:47, Peter Zijlstra said:
 Ah right, I did that in v3, with a similar comment, but I realised that
 the inbound reserve need not be per socket, and that the comment was
 ambiguous enough to allow this reading.

True, but better to change the comment than to confuse people.
Lots of it is outdated because reservations aren't per device anymore.

Changes to your version:
- Got rid of memalloc_socks.
- Don't include inetdevice.h (it isn't needed anymore, right?)
- Updated comment.

(I'm editing the diff, so this won't apply)

Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c  2006-08-12 12:56:06.0 +0200
+++ linux-2.6/net/core/sock.c   2006-08-12 13:02:59.0 +0200
@@ -111,6 +111,8 @@
 #include linux/poll.h
 #include linux/tcp.h
 #include linux/init.h
+#include linux/blkdev.h

 #include asm/uaccess.h
 #include asm/system.h
@@ -195,6 +197,78 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+static unsigned long memalloc_reserve;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+/**
+ *sk_adjust_memalloc - adjust the global memalloc reserve
+ *@nr_socks: number of new %SOCK_MEMALLOC sockets
+ *
+ *This function adjusts the memalloc reserve based on system demand.
+ *For each %SOCK_MEMALLOC socket 2 * %MAX_PHYS_SEGMENTS pages are
+ *reserved for outbound traffic (assumption: each %SOCK_MEMALLOC
+ *socket will have a %request_queue associated).
+ *
+ *Pages for inbound traffic are already reserved.
+ *
+ *2 * %MAX_PHYS_SEGMENTS - the request queue can hold up to 150%,
+ *the remaining 50% goes to being sure we can write packets
+ *for the outgoing pages.
+ */
+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+   unsigned long flags;
+   unsigned int reserve;
+   int err;
+
+   spin_lock_irqsave(memalloc_lock, flags);
+
+   memalloc_socks += nr_socks;
+   BUG_ON(memalloc_socks  0);
+
+   reserve = nr_socks * 2 * MAX_PHYS_SEGMENTS; /* outbound */
+
+   err = adjust_memalloc_reserve(reserve);
+   spin_unlock_irqrestore(memalloc_lock, flags);
+   if (err) {
+   printk(KERN_WARNING
+   Unable to adjust RX reserve by %lu, error: %d\n,
+   reserve, err);
+   }
+   return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);

What's missing now is an adjust_memalloc_reserve(5 * MAX_CONCURRENT_SKBS)
call in some init code.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

2006-08-12 Thread Indan Zupancic
On Sat, August 12, 2006 20:54, Peter Zijlstra said:
  - single allocation group per packet - that is, when I free a packet
 and all its associated object I get my memory back.

This is easy.

  - not waste too much space managing the various objects

This too, when ignoring clones and COW.

 skb operations want to allocate various sk_buffs for the same data
 clones. Also, it wants to be able to break the COW or realloc the data.

So this seems to be what adds all the complexity.

 So I tried manual packing (parts of that you have seen in previous
 attempts). This gets hard when you want to do unlimited clones and COW
 breaks. To do either you need to go link several pages.

It gets messy quite quickly, yes.

 So needing a list of pages and wanting packing gave me SROG. The biggest
 wart is having to deal with higher order pages. Explicitly coding in
 knowledge of the object you're packing just makes the code bigger - such
 is the power of abstraction.

I assume you meant Not explicitly coding in, or else I'm tempted to disagree.
Abstraction that has only one user which uses it in one way only adds bloat.
But looking at the code a bit more I'm afraid you're right.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] VM deadlock prevention core -v3

2006-08-10 Thread Indan Zupancic
On Thu, August 10, 2006 18:50, Peter Zijlstra said:
 You are right if the reserve wasn't device bound - which I will abandon
 because you are right that with multi-path routing, bridge device and
 other advanced goodies this scheme is broken in that there is no
 unambiguous mapping from sockets to devices.

The natural thing seems to make reserves socket bound, but that has
overhead too and the simplicity of a global reserve is very tempting.

What about adding a flag to sk_set_memalloc() which says if memalloc is on
or off on the socket? (Or add sk_unset_memalloc). That way it's possible
to switch it off again, which doesn't seem like that a bad idea, because
then it can be turned on only when the socket can be used to reduce total
memory usage. Also if it is turned off again when no more memory can be
freed by using this socket, it will solve the starvation problem as a
starved socket now has a new chance to do its thing.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-09 Thread Indan Zupancic
On Wed, August 9, 2006 2:25, Daniel Phillips said:
 Indan Zupancic wrote:
 Hello,
 Saw the patch on lkml, and wondered about some things.
 On Tue, August 8, 2006 21:33, Peter Zijlstra said:

+static inline void dev_unreserve_skb(struct net_device *dev)
+{
+if (atomic_dec_return(dev-rx_reserve_used)  0)
+atomic_inc(dev-);
+}
+

 This looks strange. Is it possible for rx_reserve_used to become negative?
 A quick look at the code says no, except in the one case where there isn't a
 if (unlikely(dev_reserve_used(skb-dev))) check:

 Yes, you can see what I'm trying to do there, I was short an atomic op to
 do it,  My ugly solution may well be... probably is racy.  Let's rewrite
 it with something better.  We want the atomic op that some people call
 monus: decrement unless zero.

Currently atomic_inc_not_zero(), atomic_add_unless() and atomic_cmpxchg()
exist, so making an atomic_dec_not_zero() should be easy.

@@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
 #endif

 kfree_skbmem(skb);
+if (dev  (dev-flags  IFF_MEMALLOC))
+dev_unreserve_skb(dev);
 }

 So it seems that that  0 check in dev_unreserve_skb() was only added to 
 handle
 this case (though there seems to be a race between those two atomic ops).
 Isn't it better to remove that check and just do:

 if (dev  (dev-flags  IFF_MEMALLOC)  dev_reserve_used(dev))

 Seems to me that unreserve is also called from each protocol handler.
 Agreed, the  0 check is fairly sickening.

Yes, but those all do that (racy) if (unlikely(dev_reserve_used(skb-dev)))
check. Adding another racy check doesn't improve the situation.


 The use of atomic seems a bit dubious. Either it's necessary, in which case
 changes depending on tests seem unsafe as they're not atomic and something
 crucial could change between the read and the check, or normal reads and 
 writes
 combined with barriers would be sufficient. All in all it seems better to
 move that if (unlikely(dev_reserve_used(skb-dev))) check into
 dev_unreserve_skb(), and make the whole atomic if necessary. Then let
 dev_unreserve_skb() return wether rx_reserve_used was positive.
 Getting rid of dev_reserve_used() and using the atomic_read directly might be
 better, as it is set with the bare atomic instructions too and rarely used 
 without
 dev_unreserve_skb().

 Barriers should work for this reserve accounting, but is that better than
 an atomic op?  I don't know, let's let the barrier mavens opine.

The atomic ops are fine, but if you do two atomic ops then that as a whole
isn't atomic any more and often racy. That's what seems to be the case
here.

 IMHO the cleanest thing to do is code up monus, in fact I dimly recall
 somebody already added something similar.

I'm not sure, to me it looks like dev_unreserve_skb() is always called
without really knowing if it is justified or not, or else there wouldn't
be a chance that the counter became negative. So avoiding the negative
reserve usage seems like papering over bad accounting.

The assumption made seems to be that if there's reserve used, then it must
be us using it, and it's unreserved. So it appears that either that
assumption is wrong, and we can unreserve for others while we never
reserved for ourselves, or it is correct, in which case it probably makes
more sense to check for the IFF_MEMALLOC flag.

All in all it seems like a per skb flag which tells us if this skb was the
one reserving anything is missing. Or rx_reserve_used must be updated for
all in flight skbs whenever the IFF_MEMALLOC flag changes, so that we can
be sure that the accounting works correctly. Oh wait, isn't that what the
memalloc flag is for? So shouldn't it be sufficient to check only with
sk_is_memalloc()? That avoids lots of checks and should guarantee that the
accounting is correct, except in the case when the IFF_MEMALLOC flag is
cleared and the counter is set to zero manually. Can't that be avoided and
just let it decrease to zero naturally? So checking IFF_MEMALLOC for new
skbs and use sk_is_memalloc() for existing ones seems workable, if I'm not
missing anything (I think I do).

 Side note: please don't be shy, just reply-all in future so the discussion
 stays public.  Part of what we do is try to share our development process
 so people see not only what we have done, but why we did it.  (And sometimes
 so they can see what dumb mistakes we make, but I won't get into that...)

Yes, I know, but as I don't have much kernel programming experience I
didn't want to add unnecessary noise.

 I beg for forgiveness in advance having taken the liberty of CCing this
 reply to lkml.

No problem.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-09 Thread Indan Zupancic
On Wed, August 9, 2006 14:54, Peter Zijlstra said:
 On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
  That avoids lots of checks and should guarantee that the
 accounting is correct, except in the case when the IFF_MEMALLOC flag is
 cleared and the counter is set to zero manually. Can't that be avoided and
 just let it decrease to zero naturally?

 That would put the atomic op on the free path unconditionally, I think
 davem gets nightmares from that.

I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
to unconditionally decrement the reserved usage only when memalloc is true
on the free path. That way all skbs that increased the reserve also decrease
it, and the counter should never go below zero.

Also as far as I can see it should be possible to replace all atomic
if (unlikely(dev_reserve_used(skb-dev))) checks witha check if
memalloc is set. That should make davem happy, as there aren't any
atomic instructions left in hot paths.

If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.
When the skb is being destroyed it doesn't matter if IFF_MEMALLOC is set
or not, only if that skb used reserves and thus only the memalloc flag
needs to be checked. This means that changing the IFF_MEMALLOC doesn't
affect in-flight skbs but only newly created ones, and there's no need to
update in-flight skbs whenever the flag is changed as all should go well.

+int sk_set_memalloc(struct sock *sk)
+{
+   struct inet_sock *inet = inet_sk(sk);
+   struct net_device *dev = ip_dev_find(inet-rcv_saddr);
+   int err = 0;
+
+   if (!dev)
+   return -ENODEV;
+
+   if (!(dev-features  NETIF_F_MEMALLOC)) {
+   err = -EPERM;
+   goto out;
+   }
+
+   if (atomic_read(dev-memalloc_socks) == 0) {
+   spin_lock(dev-memalloc_lock);
+   if (atomic_read(dev-memalloc_socks) == 0) {
+   dev-memalloc_reserve =
+   dev-rx_reserve * skb_pages(dev-mtu);
+   err = adjust_memalloc_reserve(dev-memalloc_reserve);
+   if (err) {
+   spin_unlock(dev-memalloc_lock);
+   printk(KERN_WARNING
+   %s: Unable to allocate RX reserve, error: %d\n,
+   dev-name, err);
+   goto out;
+   }
+   sock_set_flag(sk, SOCK_MEMALLOC);
+   dev-flags |= IFF_MEMALLOC;
+   }
+   atomic_inc(dev-memalloc_socks);
+   spin_unlock(dev-memalloc_lock);
+   } else
+   atomic_inc(dev-memalloc_socks);
+
+out:
+   dev_put(dev);
+   return err;
+}

It seems that here SOCK_MEMALLOC is only set on the first socket.
Shouldn't it be set on all sockets instead?

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-09 Thread Indan Zupancic
On Wed, August 9, 2006 16:00, Peter Zijlstra said:
 On Wed, 2006-08-09 at 15:48 +0200, Indan Zupancic wrote:
 On Wed, August 9, 2006 14:54, Peter Zijlstra said:
  On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
   That avoids lots of checks and should guarantee that the
  accounting is correct, except in the case when the IFF_MEMALLOC flag is
  cleared and the counter is set to zero manually. Can't that be avoided and
  just let it decrease to zero naturally?
 
  That would put the atomic op on the free path unconditionally, I think
  davem gets nightmares from that.

 I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
 to unconditionally decrement the reserved usage only when memalloc is true
 on the free path. That way all skbs that increased the reserve also decrease
 it, and the counter should never go below zero.

 OK, so far so good, except we loose the notion of getting memory back
 from regular skbs.

I don't understand this, regular skbs don't have anything to do with
rx_reserve_used as far as I can see. I'm only talking about keeping
that field up to date and correct. rx_reserve_used is only increased
by a skb when memalloc is set to true on that skb, so only if that field
is set rx_reserve_used needs to be reduced when the skb is freed.

Why is it needed for the protocol specific code to call dev_unreserve_skb?

Only problem is if the device can change. rx_reserve_used should probably
be updated when that happens, as a skb can't use reserved memory on a device
it was moved away from. (right?)

 Also as far as I can see it should be possible to replace all atomic
 if (unlikely(dev_reserve_used(skb-dev))) checks witha check if
 memalloc is set. That should make davem happy, as there aren't any
 atomic instructions left in hot paths.

 dev_reserve_used() uses atomic_read() which isn't actually a LOCK'ed
 instruction, so that should not matter.

Perhaps, but the main reason to check memalloc instead of using
dev_reserve_used is because the latter doesn't tell which skb did the
reservation.

 If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.

 Not quite, if IFF_MEMALLOC is set new skbs _could_ get memalloc set. We
 only fall back to alloc_pages() if the regular path fails to alloc. If the
 skb is backed by a page (as opposed to kmem_cache fluff) sk_buff::memalloc
 is set.

Yes, true. But doesn't matter for the rx_reserve_used accounting, as long as
memalloc set means that it did increase rx_reserve_used.

 Also, I've been thinking (more pain), should I not up the reserve for
 each SOCK_MEMALLOC socket.

Up rx_reserve_used or the total ammount of reserved memory? Probably 'no' for
both though, as it's either device specific or skb dependent.

I'm slowly getting a clearer image of the big picture, I'll take another look
when you post the updated code.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-09 Thread Indan Zupancic
On Wed, August 9, 2006 21:45, Peter Zijlstra said:
 On Wed, 2006-08-09 at 20:34 +0200, Indan Zupancic wrote:
 Why is it needed for the protocol specific code to call dev_unreserve_skb?

 It uses this to get an indication of memory pressure; if we have
 memalloc'ed skbs memory pressure must be high, hence we must drop all
 non critical packets. But you are right in that this is a problematic
 area; the mapping from skb to device is non trivial.

 Your suggestion of testing skb-memalloc might work just as good; indeed
 if we have regressed into the fallback allocator we know we have
 pressure.

You seem to have explained dev_reserve_used usage, not the dev_unreserve_skb 
calls.
But I've just found -v2 and see that they're gone now, great. -v2 looks much 
better.

 Only problem is if the device can change. rx_reserve_used should probably
 be updated when that happens, as a skb can't use reserved memory on a device
 it was moved away from. (right?)

 Well yes, this is a problem, only today have I understood how volatile
 the mapping actually is. I think you are right in that transferring the
 accounting from the old to the new device is correct solution.

 However this brings us the problem of limiting the fallback allocator;
 currently this is done in __netdev_alloc_skb where rx_reserve_used it
 compared against rx_reserve. If we transfer accounting away this will
 not work anymore. I'll have to think about this case, perhaps we already
 have a problem here.

The point of the reservations is to avoid deadlocks, and they're always big
enough to hold all in-flight skbs, right? So what about solving the whole
device problem by using a global counter and limit instead of per device?

The question is whether traffic on one device can starve traffic on other
devices or not, and how big a problem that is. It probably can, tricky stuff.
Though getting rid of the per device stuff would simplify a lot...

  Also, I've been thinking (more pain), should I not up the reserve for
  each SOCK_MEMALLOC socket.

 Up rx_reserve_used or the total ammount of reserved memory? Probably 'no' for
 both though, as it's either device specific or skb dependent.

 I came up with yes, if for each socket you gain a request queue, the
 number of in-flight pages is proportional to the number of sockets.

Yes, seems so.

Good night,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html