Re: [RFC 1/2] [IPV6] ADDRCONF: Preparation for configurable address selection policy with ifindex.

2007-10-30 Thread David Miller
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 14:52:37 +0900 (JST)

 Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED]

What is the substance of this change?  Please add a description of
this to the changelog entry as currently the description is far too
brief and vague.

Even saying simply that the change allows the interface index
to be passed into the address selection routines would be
a great improvement.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9260] New: tipc_config.h is not installed when doing make headers_install

2007-10-30 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 12:07:16 -0700

 On Mon, 29 Oct 2007 09:10:26 -0700 (PDT)
 [EMAIL PROTECTED] wrote:
 
  http://bugzilla.kernel.org/show_bug.cgi?id=9260
 ...
  Problem Description:
  When doing make headers_install the file tipc_config.h is not installed. 
  It
  describes the interface to configure the TIPC module and it is needed when
  building the config utility (tipc-config). 
  Adding the following line to include/linux/Kbuild solves this:
  header-y += tipc_config.h
  

Fair enough, I'll commit the following and submit to
-stable as well.

From 502ef38da15d817f8e67acefc12dc2212f7f8aa1 Mon Sep 17 00:00:00 2001
From: David S. Miller [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 01:19:19 -0700
Subject: [PATCH] [TIPC]: Add tipc_config.h to include/linux/Kbuild.

Needed, as reported in:

http://bugzilla.kernel.org/show_bug.cgi?id=9260

Signed-off-by: David S. Miller [EMAIL PROTECTED]
---
 include/linux/Kbuild |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 6a65231..bd33c22 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -149,6 +149,7 @@ header-y += ticable.h
 header-y += times.h
 header-y += tiocl.h
 header-y += tipc.h
+header-y += tipc_config.h
 header-y += toshiba.h
 header-y += ultrasound.h
 header-y += un.h
-- 
1.5.2.5

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines

2007-10-30 Thread Luis R. Rodriguez
On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote:
 On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote:
  This commit made an incorrect assumption:
  --
  Author: Lennert Buytenhek [EMAIL PROTECTED]
   Date:   Fri Oct 19 04:10:10 2007 +0200
 
  mv643xx_eth: Move ethernet register definitions into private header
 
  Move the mv643xx's ethernet-related register definitions from
  include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since
  they aren't of any use outside the ethernet driver.
 
  Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]
  Acked-by: Tzachi Perelstein [EMAIL PROTECTED]
  Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
  --
 
  arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there.
 
  [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe
 
  v2.6.24-rc1-138-g0119130
 
  This patch fixes this by internalizing 3 defines onto pegasos which are
  simply no longer available elsewhere. Without this your compile will fail

 That compile failure was fixed in commit
 30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro.

 However, as I examine that commit, I see that it defines offsets from
 the eth block in the chip, rather than the full chip registeri block
 as the Pegasos 2 code expects.  So, I think it fixes the compile
 failure, but leaves the Pegasos 2 broken.

 Luis, do you have Pegasos 2 hardware?  Can you (or anyone) verify that
 the following patch is needed for the Pegasos 2?

Nope, sorry.

  Luis
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Saner thash_entries default with much memory

2007-10-30 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Fri, 26 Oct 2007 17:34:17 +0200

 On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
  I propose 2 millions of entries as the arbitrary high limit. This
 
 It's probably still far too large.

I agree.  Perhaps a better number is something on the order of
(512 * 1024) so I think I'll check in a variant of Jean's patch
with just the limit decreased like that.

Using just some back of the envelope calculations, on UP 64-bit
systems each socket uses about 2424 bytes minimum of memory (this is
the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP).
This is an underestimate because it does not even consider things like
allocator overhead.

Next, machines that service that many sockets typically have them
mostly with full transmit queues talking to a very slow receiver at
the other end.  So let's estimate that on average each socket consumes
about 64K of retransmit queue data.

I think this is an extremely conservative estimate beause it doesn't
even consider overhead coming from struct sk_buff and related state.

So for (512 * 1024) of established sockets we consume roughly 35GB of
memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'.

So to me (512 * 1024) is a very reasonable limit and (with lockdep
and spinlock debugging disabled) this makes the EHASH table consume
8MB on UP 64-bit and ~12MB on SMP 64-bit systems.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] [IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.

2007-10-30 Thread David Miller
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 14:52:54 +0900 (JST)

 diff --git a/include/linux/if_addrlabel.h b/include/linux/if_addrlabel.h
 new file mode 100644
 index 000..66978a5
 --- /dev/null
 +++ b/include/linux/if_addrlabel.h
 @@ -0,0 +1,55 @@
 +/*
 + * ifaddrlabel.h - netlink interface for address labels
 + *
 + * Copyright (C)2007 USAGI/WIDE Project,  All Rights Reserved.
 + *
 + * Redistribution and use in source and binary forms, with or without
 + * modification, are permitted provided that the following conditions
 + * are met:

Please, this is just a very primitive header file definiting a
simplistic struct and a few enumerations.  Can't you just GPL it with
just the USAGI/WIDE copyright line, instead of using this complicated
license text?

If it important for the USAGI Project to take credit for this work,
they will receive it fully in the copyright line and the changelog
entry.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv4] SNMP: Refer correct memory location to display ICMP out-going statistics

2007-10-30 Thread David Miller
From: David Stevens [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 13:54:50 -0700

 Dave,
 I didn't see a response for this one... in case it fell through 
 the
 cracks. Just want to make sure my bone-headed error doesn't hang
 around too long. :-)

It's in my tree now, never fear :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dn_route.c momentarily exiting RCU read-side critical section

2007-10-30 Thread David Miller
From: Paul E. McKenney [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 14:15:40 -0700

 net/decnet/dn_route.c in dn_rt_cache_get_next() is as follows:
 
 static struct dn_route *dn_rt_cache_get_next(struct seq_file *seq, struct 
 dn_route *rt)
 {
   struct dn_rt_cache_iter_state *s = rcu_dereference(seq-private);
 
   rt = rt-u.dst.dn_next;
   while(!rt) {
   rcu_read_unlock_bh();
   if (--s-bucket  0)
   break;
 
 ...  But what happens if seq-private is freed up right here?
 ...  Or what prevents this from happening?
 ...
 Similar code is in rt_cache_get_next().
 
 So, what am I missing here?

seq-private is allocated on file open (here via seq_open_private()),
and freed up on file close (via seq_release_private).

So it cannot be freed up in the middle of an iteration.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rpc_rdma: we need to cast u64 to unsigned long long for printing

2007-10-30 Thread David Miller
From: Stephen Rothwell [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 16:12:40 +1100

 as some architectures have unsigned long for u64.
 
 net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_create_chunks':
 net/sunrpc/xprtrdma/rpc_rdma.c:222: warning: format '%llx' expects type 'long 
 long unsigned int', but argument 4 has type 'u64'
 net/sunrpc/xprtrdma/rpc_rdma.c:234: warning: format '%llx' expects type 'long 
 long unsigned int', but argument 5 has type 'u64'
 net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_count_chunks':
 net/sunrpc/xprtrdma/rpc_rdma.c:577: warning: format '%llx' expects type 'long 
 long unsigned int', but argument 4 has type 'u64
 
 Noticed on PowerPC pseries_defconfig build.
 
 Signed-off-by: Stephen Rothwell [EMAIL PROTECTED]

I've applied this, thanks Stephen.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ehea: add kexec support

2007-10-30 Thread Christoph Raisch


Michael Ellerman [EMAIL PROTECTED] wrote on 28.10.2007 23:32:17:


 How do you plan to support kdump?


When kexec is fully supported kdump should work out of the box
as for any other ethernet card (if you load the right eth driver).
There's nothing specific to kdump you have to handle in
ethernet device drivers.
Hope I didn't miss anything here...

Gruss / Regards
Christoph R

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel panic removing devices from a teql queuing discipline

2007-10-30 Thread David Miller
From: Chuck Ebbert [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 14:00:01 -0400

 The panic is in __teql_resolve (which has been inlined into teql_master_xmit) 
 in
 net/sched/sch_teql.c at this line:
 
   if (n  n-tbl == mn-tbl 
 
 Specifically the dereference of n-tbl is faulting as n is not valid.
 
 And the address looks like part of an ASCCI string...  figt

I studied sch_teql.c a bit and I suspect that the slave list
management in teql_destroy() and teql_qdisc_init() might be
suspect.

If someone can take a closer look at this, I'd appreciate it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Add support for the RDC R6040 Fast Ethernet controller

2007-10-30 Thread Ilpo Järvinen
On Mon, 29 Oct 2007, Florian Fainelli wrote:

 +static int  mdio_read(struct net_device *dev, int phy_id, int location);
 +static void mdio_write(struct net_device *dev, int phy_id, int location, int 
 value);
 +static int r6040_open(struct net_device *dev);
 +static int r6040_start_xmit(struct sk_buff *skb, struct net_device *dev);
 +static irqreturn_t r6040_interrupt(int irq, void *dev_id);
 +static int r6040_close(struct net_device *dev);
 +static void set_multicast_list(struct net_device *dev);
 +static struct ethtool_ops netdev_ethtool_ops;
 +static int r6040_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
 +static void r6040_down(struct net_device *dev);
 +static void r6040_up(struct net_device *dev);
 +static void r6040_tx_timeout(struct net_device *dev);
 +static void r6040_timer(unsigned long);
 +static void r6040_mac_address(struct net_device *dev);
 +
 +static int phy_mode_chk(struct net_device *dev);
 +static int phy_read(int ioaddr, int phy_adr, int reg_idx);
 +static void phy_write(int ioaddr, int phy_adr, int reg_idx, int dat);
 +static void rx_buf_alloc(struct r6040_private *lp, struct net_device *dev);
 +#ifdef CONFIG_R6040_NAPI
 +static int r6040_poll(struct napi_struct *napi, int budget);
 +#endif
 +

...Most of those forward declarations can go if the functions are ordered 
properly. One can trivially notice that the mdio_{read,write} are 
unnecessary already:

 +static int mdio_read(struct net_device *dev, int phy_id, int regnum)
 +{
 + struct r6040_private *lp = netdev_priv(dev);
 + long ioaddr = dev-base_addr;
 + return  (phy_read(ioaddr, lp-phy_addr, regnum)) ;
 +}
 +
 +static void mdio_write(struct net_device *dev, int phy_id, int regnum, int 
 value)
 +{
 + struct r6040_private *lp = netdev_priv(dev);
 + long ioaddr = dev-base_addr;
 +
 + phy_write(ioaddr, lp-phy_addr, regnum, value);
 +}


-- 
 i.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] [IPV6] ADDRCONF: Preparation for configurable address selection policy with ifindex.

2007-10-30 Thread Krishna Kumar2
Hi Yoshifuji,

YOSHIFUJI Hideaki wrote on 10/30/2007 11:22:37 AM:

 -static inline int ipv6_saddr_label(const struct in6_addr *addr, int
type)
 +static inline int ipv6_addr_label(const struct in6_addr *addr, int type,
 +int ifindex)

This function doesn't use this new argument passed to it. Did you perhaps
intend to use it to
initializing daddr_index?

 +   int daddr_ifindex = daddr_dev ? daddr_dev-ifindex : 0;

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost packets)

2007-10-30 Thread Bryan Wu
From: Michael Hennerich [EMAIL PROTECTED]

Fix Ethernet communication bug(dupliated and lost packets)
in RMII PHY mode- dont call mac_disable and mac_enable during
10/100 REFCLK changes - mac_enable screws up the DMA descriptor chain

Signed-off-by: Michael Hennerich [EMAIL PROTECTED]
Signed-off-by: Bryan Wu [EMAIL PROTECTED]
---
 drivers/net/bfin_mac.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c
index 53fe7de..084acfd 100644
--- a/drivers/net/bfin_mac.c
+++ b/drivers/net/bfin_mac.c
@@ -371,7 +371,6 @@ static void bf537_adjust_link(struct net_device *dev)
if (phydev-speed != lp-old_speed) {
 #if defined(CONFIG_BFIN_MAC_RMII)
u32 opmode = bfin_read_EMAC_OPMODE();
-   bf537mac_disable();
switch (phydev-speed) {
case 10:
opmode |= RMII_10;
@@ -386,7 +385,6 @@ static void bf537_adjust_link(struct net_device *dev)
break;
}
bfin_write_EMAC_OPMODE(opmode);
-   bf537mac_enable();
 #endif
 
new_state = 1;
-- 
1.5.3.4
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Boaz Harrosh
On Mon, Oct 29 2007 at 22:16 +0200, Jens Axboe [EMAIL PROTECTED] wrote:
 On Fri, Oct 26 2007, Herbert Xu wrote:
 [CRYPTO] tcrypt: Move sg_init_table out of timing loops

 This patch moves the sg_init_table out of the timing loops for hash
 algorithms so that it doesn't impact on the speed test results.
 
 Wouldn't it be better to just make sg_init_one() call sg_init_table?
 
 diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
 index 4571231..ccc55a6 100644
 --- a/include/linux/scatterlist.h
 +++ b/include/linux/scatterlist.h
 @@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist *sg)
  }
  
  /**
 - * sg_init_one - Initialize a single entry sg list
 - * @sg:   SG entry
 - * @buf:  Virtual address for IO
 - * @buflen:   IO length
 - *
 - * Notes:
 - *   This should not be used on a single entry that is part of a larger
 - *   table. Use sg_init_table() for that.
 - *
 - **/
 -static inline void sg_init_one(struct scatterlist *sg, const void *buf,
 -unsigned int buflen)
 -{
 - memset(sg, 0, sizeof(*sg));
 -#ifdef CONFIG_DEBUG_SG
 - sg-sg_magic = SG_MAGIC;
 -#endif
 - sg_mark_end(sg, 1);
 - sg_set_buf(sg, buf, buflen);
 -}
 -
 -/**
   * sg_init_table - Initialize SG table
   * @sgl:The SG table
   * @nents:  Number of entries in table
 @@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist 
 *sgl, unsigned int nents)
  }
  
  /**
 + * sg_init_one - Initialize a single entry sg list
 + * @sg:   SG entry
 + * @buf:  Virtual address for IO
 + * @buflen:   IO length
 + *
 + * Notes:
 + *   This should not be used on a single entry that is part of a larger
 + *   table. Use sg_init_table() for that.
 + *
 + **/
 +static inline void sg_init_one(struct scatterlist *sg, const void *buf,
 +unsigned int buflen)
 +{
 + sg_init_table(sg, 1);
 + sg_set_buf(sg, buf, buflen);
 +}
 +
 +/**
   * sg_phys - Return physical address of an sg entry
   * @sg:   SG entry
   *
 
Yes please submit this patch. scsi-ml is full of sg_init_one, specially
on the error recovery path.

Thanks
Boaz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Jens Axboe
On Tue, Oct 30 2007, Boaz Harrosh wrote:
 On Mon, Oct 29 2007 at 22:16 +0200, Jens Axboe [EMAIL PROTECTED] wrote:
  On Fri, Oct 26 2007, Herbert Xu wrote:
  [CRYPTO] tcrypt: Move sg_init_table out of timing loops
 
  This patch moves the sg_init_table out of the timing loops for hash
  algorithms so that it doesn't impact on the speed test results.
  
  Wouldn't it be better to just make sg_init_one() call sg_init_table?
  
  diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
  index 4571231..ccc55a6 100644
  --- a/include/linux/scatterlist.h
  +++ b/include/linux/scatterlist.h
  @@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist 
  *sg)
   }
   
   /**
  - * sg_init_one - Initialize a single entry sg list
  - * @sg: SG entry
  - * @buf:Virtual address for IO
  - * @buflen: IO length
  - *
  - * Notes:
  - *   This should not be used on a single entry that is part of a larger
  - *   table. Use sg_init_table() for that.
  - *
  - **/
  -static inline void sg_init_one(struct scatterlist *sg, const void *buf,
  -  unsigned int buflen)
  -{
  -   memset(sg, 0, sizeof(*sg));
  -#ifdef CONFIG_DEBUG_SG
  -   sg-sg_magic = SG_MAGIC;
  -#endif
  -   sg_mark_end(sg, 1);
  -   sg_set_buf(sg, buf, buflen);
  -}
  -
  -/**
* sg_init_table - Initialize SG table
* @sgl:  The SG table
* @nents:Number of entries in table
  @@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist 
  *sgl, unsigned int nents)
   }
   
   /**
  + * sg_init_one - Initialize a single entry sg list
  + * @sg: SG entry
  + * @buf:Virtual address for IO
  + * @buflen: IO length
  + *
  + * Notes:
  + *   This should not be used on a single entry that is part of a larger
  + *   table. Use sg_init_table() for that.
  + *
  + **/
  +static inline void sg_init_one(struct scatterlist *sg, const void *buf,
  +  unsigned int buflen)
  +{
  +   sg_init_table(sg, 1);
  +   sg_set_buf(sg, buf, buflen);
  +}
  +
  +/**
* sg_phys - Return physical address of an sg entry
* @sg: SG entry
*
  
 Yes please submit this patch. scsi-ml is full of sg_init_one, specially
 on the error recovery path.

Will do.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines

2007-10-30 Thread Sven Luther
On Tue, Oct 30, 2007 at 03:44:59AM -0400, Luis R. Rodriguez wrote:
 On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote:
  On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote:
   This commit made an incorrect assumption:
   --
   Author: Lennert Buytenhek [EMAIL PROTECTED]
Date:   Fri Oct 19 04:10:10 2007 +0200
  
   mv643xx_eth: Move ethernet register definitions into private header
  
   Move the mv643xx's ethernet-related register definitions from
   include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since
   they aren't of any use outside the ethernet driver.
  
   Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]
   Acked-by: Tzachi Perelstein [EMAIL PROTECTED]
   Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
   --
  
   arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there.
  
   [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe
  
   v2.6.24-rc1-138-g0119130
  
   This patch fixes this by internalizing 3 defines onto pegasos which are
   simply no longer available elsewhere. Without this your compile will fail
 
  That compile failure was fixed in commit
  30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro.
 
  However, as I examine that commit, I see that it defines offsets from
  the eth block in the chip, rather than the full chip registeri block
  as the Pegasos 2 code expects.  So, I think it fixes the compile
  failure, but leaves the Pegasos 2 broken.
 
  Luis, do you have Pegasos 2 hardware?  Can you (or anyone) verify that
  the following patch is needed for the Pegasos 2?
 
 Nope, sorry.

I am busy right now, but have various pegasos machines available for
testing. What exactly should i test ? 

Friendly,

Sven Luther
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines

2007-10-30 Thread Dale Farnsworth
On Tue, Oct 30, 2007 at 10:36:06AM +0100, Sven Luther wrote:
 On Tue, Oct 30, 2007 at 03:44:59AM -0400, Luis R. Rodriguez wrote:
  On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote:
   On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote:
This commit made an incorrect assumption:
--
Author: Lennert Buytenhek [EMAIL PROTECTED]
 Date:   Fri Oct 19 04:10:10 2007 +0200
   
mv643xx_eth: Move ethernet register definitions into private header
   
Move the mv643xx's ethernet-related register definitions from
include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since
they aren't of any use outside the ethernet driver.
   
Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]
Acked-by: Tzachi Perelstein [EMAIL PROTECTED]
Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
--
   
arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there.
   
[EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe
   
v2.6.24-rc1-138-g0119130
   
This patch fixes this by internalizing 3 defines onto pegasos which are
simply no longer available elsewhere. Without this your compile will 
fail
  
   That compile failure was fixed in commit
   30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro.
  
   However, as I examine that commit, I see that it defines offsets from
   the eth block in the chip, rather than the full chip registeri block
   as the Pegasos 2 code expects.  So, I think it fixes the compile
   failure, but leaves the Pegasos 2 broken.
  
   Luis, do you have Pegasos 2 hardware?  Can you (or anyone) verify that
   the following patch is needed for the Pegasos 2?
  
  Nope, sorry.
 
 I am busy right now, but have various pegasos machines available for
 testing. What exactly should i test ? 

Thanks Sven.

Test whether an Ethernet port works at all.  I think it's currently
broken, but should work with the patch I supplied.

-Dale
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] core: fix free_netdev when register fails during notification call chain

2007-10-30 Thread Daniel Lezcano

Point 1:
The unregistering of a network device schedule a netdev_run_todo.
This function calls dev-destructor when it is set and the
destructor calls free_netdev.

Point 2:
In the case of an initialization of a network device the usual code
is:
 * alloc_netdev
 * register_netdev
- if this one fails, call free_netdev and exit with error.

Point 3:
In the register_netdevice function at the later state, when the device
is at the registered state, a call to the netdevice_notifiers is made.
If one of the notification falls into an error, a rollback to the
registered state is done using unregister_netdevice.

Conclusion:
When a network device fails to register during initialization because
one network subsystem returned an error during a notification call
chain, the network device is freed twice because of fact 1 and fact 2.
The second free_netdev will be done with an invalid pointer.

Proposed solution:
The following patch move all the code of unregister_netdevice *except*
the call to net_set_todo, to a new function rollback_registered.

The following functions are changed in this way:
 * register_netdevice: calls rollback_registered when a notification fails
 * unregister_netdevice: calls rollback_register + net_set_todo, the call
 order to net_set_todo is changed because it is the
 latest now. Since it justs add an element to a 
list

 that should not break anything.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/core/dev.c |  112 
++---

 1 file changed, 59 insertions(+), 53 deletions(-)

Index: net-2.6/net/core/dev.c
===
--- net-2.6.orig/net/core/dev.c
+++ net-2.6/net/core/dev.c
@@ -3496,6 +3496,60 @@ static void net_set_todo(struct net_devi
 spin_unlock(net_todo_list_lock);
 }

+static void rollback_registered(struct net_device *dev)
+{
+BUG_ON(dev_boot_phase);
+ASSERT_RTNL();
+
+/* Some devices call without registering for initialization unwind. */
+if (dev-reg_state == NETREG_UNINITIALIZED) {
+printk(KERN_DEBUG unregister_netdevice: device %s/%p never 
+  was registered\n, dev-name, dev);
+
+WARN_ON(1);
+return;
+}
+
+BUG_ON(dev-reg_state != NETREG_REGISTERED);
+
+/* If device is running, close it first. */
+dev_close(dev);
+
+/* And unlink it from device chain. */
+unlist_netdevice(dev);
+
+dev-reg_state = NETREG_UNREGISTERING;
+
+synchronize_net();
+
+/* Shutdown queueing discipline. */
+dev_shutdown(dev);
+
+
+/* Notify protocols, that we are about to destroy
+   this device. They should clean all the things.
+*/
+call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+
+/*
+ *Flush the unicast and multicast chains
+ */
+dev_addr_discard(dev);
+
+if (dev-uninit)
+dev-uninit(dev);
+
+/* Notifier chain MUST detach us from master device. */
+BUG_TRAP(!dev-master);
+
+/* Remove entries from kobject tree */
+netdev_unregister_kobject(dev);
+
+synchronize_net();
+
+dev_put(dev);
+}
+
 /**
  *register_netdevice- register a network device
  *@dev: device to register
@@ -3633,8 +3687,10 @@ int register_netdevice(struct net_device
 /* Notify protocols, that a new device appeared. */
 ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
 ret = notifier_to_errno(ret);
-if (ret)
-unregister_netdevice(dev);
+if (ret) {
+rollback_registered(dev);
+dev-reg_state = NETREG_UNREGISTERED;
+}

 out:
 return ret;
@@ -3911,59 +3967,9 @@ void synchronize_net(void)

 void unregister_netdevice(struct net_device *dev)
 {
-BUG_ON(dev_boot_phase);
-ASSERT_RTNL();
-
-/* Some devices call without registering for initialization unwind. */
-if (dev-reg_state == NETREG_UNINITIALIZED) {
-printk(KERN_DEBUG unregister_netdevice: device %s/%p never 
-  was registered\n, dev-name, dev);
-
-WARN_ON(1);
-return;
-}
-
-BUG_ON(dev-reg_state != NETREG_REGISTERED);
-
-/* If device is running, close it first. */
-dev_close(dev);
-
-/* And unlink it from device chain. */
-unlist_netdevice(dev);
-
-dev-reg_state = NETREG_UNREGISTERING;
-
-synchronize_net();
-
-/* Shutdown queueing discipline. */
-dev_shutdown(dev);
-
-
-/* Notify protocols, that we are about to destroy
-   this device. They should clean all the things.
-*/
-call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
-
-/*
- *Flush the unicast and multicast chains
- */
-dev_addr_discard(dev);
-
-if (dev-uninit)
-dev-uninit(dev);
-
-/* Notifier chain MUST detach us from master device. */
-BUG_TRAP(!dev-master);
-
-/* Remove entries from kobject tree */
-netdev_unregister_kobject(dev);
-
+

[PATCH][NETNS] fix net released by rcu callback

2007-10-30 Thread Daniel Lezcano

When a network namespace reference is held by a network subsystem,
and when this reference is decremented in a rcu update callback, we
must ensure that there is no more outstanding rcu update before
trying to free the network namespace.

In the normal case, the rcu_barrier is called when the network namespace
is exiting in the cleanup_net function.

But when a network namespace creation fails, and the subsystems are
undone (like the cleanup), the rcu_barrier is missing.

This patch adds the missing rcu_barrier.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/core/net_namespace.c |2 ++
 1 file changed, 2 insertions(+)

Index: net-2.6/net/core/net_namespace.c
===
--- net-2.6.orig/net/core/net_namespace.c
+++ net-2.6/net/core/net_namespace.c
@@ -112,6 +112,8 @@ out_undo:
if (ops-exit)
ops-exit(net);
}
+
+   rcu_barrier();
goto out;
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.24] ixgb: TX hangs under heavy load

2007-10-30 Thread Andy Gospodarek

Auke,

It has become clear that this patch resolves some tx-lockups on the ixgb
driver.  IBM did some checking and realized this hunk is in your
sourceforge driver, but not anywhere else.  Mind if we add it?

Thanks,

-andy

Signed-off-by: Andy Gospodarek [EMAIL PROTECTED]

---

 ixgb_main.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index d444de5..3ec7a41 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1324,7 +1324,7 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff 
*skb,
 
/* Workaround for premature desc write-backs
 * in TSO mode.  Append 4-byte sentinel desc */
-   if (unlikely(mss  !nr_frags  size == len
+   if (unlikely(mss  (f == (nr_frags-1))  size == len
  size  8))
size -= 4;
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPV6] cleanup : remove proc_net_remove called twice

2007-10-30 Thread Daniel Lezcano

The file /proc/net/if_inet6 is removed twice.
First time in:
inet6_exit
 -addrconf_cleanup
And followed a few lines after by:
inet6_exit
 - if6_proc_exit

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/ipv6/addrconf.c |4 
 1 file changed, 4 deletions(-)

Index: net-2.6/net/ipv6/addrconf.c
===
--- net-2.6.orig/net/ipv6/addrconf.c
+++ net-2.6/net/ipv6/addrconf.c
@@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void)
del_timer(addr_chk_timer);

rtnl_unlock();
-
-#ifdef CONFIG_PROC_FS
-   proc_net_remove(init_net, if_inet6);
-#endif
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface

2007-10-30 Thread Alexey Dobriyan
One proc_net_create() user less.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 net/ipv6/route.c |   70 +++
 1 file changed, 25 insertions(+), 45 deletions(-)

--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2288,71 +2288,49 @@ struct rt6_proc_arg
 
 static int rt6_info_route(struct rt6_info *rt, void *p_arg)
 {
-   struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg;
+   struct seq_file *m = p_arg;
 
-   if (arg-skip  arg-offset / RT6_INFO_LEN) {
-   arg-skip++;
-   return 0;
-   }
-
-   if (arg-len = arg-length)
-   return 0;
-
-   arg-len += sprintf(arg-buffer + arg-len,
-   NIP6_SEQFMT  %02x ,
-   NIP6(rt-rt6i_dst.addr),
+   seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_dst.addr),
rt-rt6i_dst.plen);
 
 #ifdef CONFIG_IPV6_SUBTREES
-   arg-len += sprintf(arg-buffer + arg-len,
-   NIP6_SEQFMT  %02x ,
-   NIP6(rt-rt6i_src.addr),
+   seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_src.addr),
rt-rt6i_src.plen);
 #else
-   arg-len += sprintf(arg-buffer + arg-len,
-    00 );
+   seq_puts(m,  00 );
 #endif
 
if (rt-rt6i_nexthop) {
-   arg-len += sprintf(arg-buffer + arg-len,
-   NIP6_SEQFMT,
+   seq_printf(m, NIP6_SEQFMT,
NIP6(*((struct in6_addr 
*)rt-rt6i_nexthop-primary_key)));
} else {
-   arg-len += sprintf(arg-buffer + arg-len,
-   );
+   seq_puts(m, );
}
-   arg-len += sprintf(arg-buffer + arg-len,
-%08x %08x %08x %08x %8s\n,
+   seq_printf(m,  %08x %08x %08x %08x %8s\n,
rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt),
rt-u.dst.__use, rt-rt6i_flags,
rt-rt6i_dev ? rt-rt6i_dev-name : );
return 0;
 }
 
-static int rt6_proc_info(char *buffer, char **start, off_t offset, int length)
+static int ipv6_route_show(struct seq_file *m, void *v)
 {
-   struct rt6_proc_arg arg = {
-   .buffer = buffer,
-   .offset = offset,
-   .length = length,
-   };
-
-   fib6_clean_all(rt6_info_route, 0, arg);
-
-   *start = buffer;
-   if (offset)
-   *start += offset % RT6_INFO_LEN;
-
-   arg.len -= offset % RT6_INFO_LEN;
-
-   if (arg.len  length)
-   arg.len = length;
-   if (arg.len  0)
-   arg.len = 0;
+   fib6_clean_all(rt6_info_route, 0, m);
+   return 0;
+}
 
-   return arg.len;
+static int ipv6_route_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, ipv6_route_show, NULL);
 }
 
+static const struct file_operations ipv6_route_proc_fops = {
+   .open   = ipv6_route_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static int rt6_stats_seq_show(struct seq_file *seq, void *v)
 {
seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n,
@@ -2499,9 +2477,11 @@ void __init ip6_route_init(void)
 
fib6_init();
 #ifdef CONFIG_PROC_FS
-   p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info);
-   if (p)
+   p = create_proc_entry(ipv6_route, 0, init_net.proc_net);
+   if (p) {
p-owner = THIS_MODULE;
+   p-proc_fops = ipv6_route_proc_fops;
+   }
 
proc_net_fops_create(init_net, rt6_stats, S_IRUGO, 
rt6_stats_seq_fops);
 #endif

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Remove /proc/net/ip_vs_lblcr

2007-10-30 Thread Alexey Dobriyan
It's under CONFIG_IP_VS_LBLCR_DEBUG option which never existed.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 I can convert it to seq_file if anyone is secretly using it.

 net/ipv4/ipvs/ip_vs_lblcr.c |   76 
 1 file changed, 76 deletions(-)

--- a/net/ipv4/ipvs/ip_vs_lblcr.c
+++ b/net/ipv4/ipvs/ip_vs_lblcr.c
@@ -48,8 +48,6 @@
 /* for sysctl */
 #include linux/fs.h
 #include linux/sysctl.h
-/* for proc_net_create/proc_net_remove */
-#include linux/proc_fs.h
 #include net/net_namespace.h
 
 #include net/ip_vs.h
@@ -547,71 +545,6 @@ static void ip_vs_lblcr_check_expire(unsigned long data)
mod_timer(tbl-periodic_timer, jiffies+CHECK_EXPIRE_INTERVAL);
 }
 
-
-#ifdef CONFIG_IP_VS_LBLCR_DEBUG
-static struct ip_vs_lblcr_table *lblcr_table_list;
-
-/*
- * /proc/net/ip_vs_lblcr to display the mappings of
- *  destination IP address == its serverSet
- */
-static int
-ip_vs_lblcr_getinfo(char *buffer, char **start, off_t offset, int length)
-{
-   off_t pos=0, begin;
-   int len=0, size;
-   struct ip_vs_lblcr_table *tbl;
-   unsigned long now = jiffies;
-   int i;
-   struct ip_vs_lblcr_entry *en;
-
-   tbl = lblcr_table_list;
-
-   size = sprintf(buffer, LastTime Dest IP address  Server set\n);
-   pos += size;
-   len += size;
-
-   for (i=0; iIP_VS_LBLCR_TAB_SIZE; i++) {
-   read_lock_bh(tbl-lock);
-   list_for_each_entry(en, tbl-bucket[i], list) {
-   char tbuf[16];
-   struct ip_vs_dest_list *d;
-
-   sprintf(tbuf, %u.%u.%u.%u, NIPQUAD(en-addr));
-   size = sprintf(buffer+len, %8lu %-16s ,
-  now-en-lastuse, tbuf);
-
-   read_lock(en-set.lock);
-   for (d=en-set.list; d!=NULL; d=d-next) {
-   size += sprintf(buffer+len+size,
-   %u.%u.%u.%u ,
-   NIPQUAD(d-dest-addr));
-   }
-   read_unlock(en-set.lock);
-   size += sprintf(buffer+len+size, \n);
-   len += size;
-   pos += size;
-   if (pos = offset)
-   len=0;
-   if (pos = offset+length) {
-   read_unlock_bh(tbl-lock);
-   goto done;
-   }
-   }
-   read_unlock_bh(tbl-lock);
-   }
-
-  done:
-   begin = len - (pos - offset);
-   *start = buffer + begin;
-   len -= begin;
-   if(lenlength)
-   len = length;
-   return len;
-}
-#endif
-
-
 static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
 {
int i;
@@ -650,9 +583,6 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
tbl-periodic_timer.expires = jiffies+CHECK_EXPIRE_INTERVAL;
add_timer(tbl-periodic_timer);
 
-#ifdef CONFIG_IP_VS_LBLCR_DEBUG
-   lblcr_table_list = tbl;
-#endif
return 0;
 }
 
@@ -843,18 +773,12 @@ static int __init ip_vs_lblcr_init(void)
 {
INIT_LIST_HEAD(ip_vs_lblcr_scheduler.n_list);
sysctl_header = register_sysctl_table(lblcr_root_table);
-#ifdef CONFIG_IP_VS_LBLCR_DEBUG
-   proc_net_create(init_net, ip_vs_lblcr, 0, ip_vs_lblcr_getinfo);
-#endif
return register_ip_vs_scheduler(ip_vs_lblcr_scheduler);
 }
 
 
 static void __exit ip_vs_lblcr_cleanup(void)
 {
-#ifdef CONFIG_IP_VS_LBLCR_DEBUG
-   proc_net_remove(init_net, ip_vs_lblcr);
-#endif
unregister_sysctl_table(sysctl_header);
unregister_ip_vs_scheduler(ip_vs_lblcr_scheduler);
 }

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Saner thash_entries default with much memory

2007-10-30 Thread Jean Delvare
Hi David,

Le mardi 30 octobre 2007, David Miller a écrit :
 From: Andi Kleen [EMAIL PROTECTED]
 Date: Fri, 26 Oct 2007 17:34:17 +0200
 
  On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote:
   I propose 2 millions of entries as the arbitrary high limit. This
  
  It's probably still far too large.
 
 I agree.  Perhaps a better number is something on the order of
 (512 * 1024) so I think I'll check in a variant of Jean's patch
 with just the limit decreased like that.

That's very fine with me. I originally proposed an admittedly high
limit value to increase the chance to see it accepted. I am not
familiar enough with networking to know what a more reasonable
limit would be, so I'm leaving it to the experts.

 Using just some back of the envelope calculations, on UP 64-bit
 systems each socket uses about 2424 bytes minimum of memory (this is
 the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP).
 This is an underestimate because it does not even consider things like
 allocator overhead.
 
 Next, machines that service that many sockets typically have them
 mostly with full transmit queues talking to a very slow receiver at
 the other end.  So let's estimate that on average each socket consumes
 about 64K of retransmit queue data.
 
 I think this is an extremely conservative estimate beause it doesn't
 even consider overhead coming from struct sk_buff and related state.
 
 So for (512 * 1024) of established sockets we consume roughly 35GB of
 memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'.
 
 So to me (512 * 1024) is a very reasonable limit and (with lockdep
 and spinlock debugging disabled) this makes the EHASH table consume
 8MB on UP 64-bit and ~12MB on SMP 64-bit systems.

OK, let's go with (512 * 1024) then. Want me to send an updated patch?

Thanks,
-- 
Jean Delvare
Suse L3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface

2007-10-30 Thread Benjamin Thery
Alexey Dobriyan wrote:
 One proc_net_create() user less.

Funny, I was working on a similar patch.

See comment below.


 Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
 ---
 
  net/ipv6/route.c |   70 
 +++
  1 file changed, 25 insertions(+), 45 deletions(-)
 
 --- a/net/ipv6/route.c
 +++ b/net/ipv6/route.c
 @@ -2288,71 +2288,49 @@ struct rt6_proc_arg
  
  static int rt6_info_route(struct rt6_info *rt, void *p_arg)
  {
 - struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg;
 + struct seq_file *m = p_arg;
  
 - if (arg-skip  arg-offset / RT6_INFO_LEN) {
 - arg-skip++;
 - return 0;
 - }
 -
 - if (arg-len = arg-length)
 - return 0;
 -
 - arg-len += sprintf(arg-buffer + arg-len,
 - NIP6_SEQFMT  %02x ,
 - NIP6(rt-rt6i_dst.addr),
 + seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_dst.addr),
   rt-rt6i_dst.plen);
  
  #ifdef CONFIG_IPV6_SUBTREES
 - arg-len += sprintf(arg-buffer + arg-len,
 - NIP6_SEQFMT  %02x ,
 - NIP6(rt-rt6i_src.addr),
 + seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_src.addr),
   rt-rt6i_src.plen);
  #else
 - arg-len += sprintf(arg-buffer + arg-len,
 -  00 );
 + seq_puts(m,  00 );
  #endif
  
   if (rt-rt6i_nexthop) {
 - arg-len += sprintf(arg-buffer + arg-len,
 - NIP6_SEQFMT,
 + seq_printf(m, NIP6_SEQFMT,
   NIP6(*((struct in6_addr 
 *)rt-rt6i_nexthop-primary_key)));
   } else {
 - arg-len += sprintf(arg-buffer + arg-len,
 - );
 + seq_puts(m, );
   }
 - arg-len += sprintf(arg-buffer + arg-len,
 -  %08x %08x %08x %08x %8s\n,
 + seq_printf(m,  %08x %08x %08x %08x %8s\n,
   rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt),
   rt-u.dst.__use, rt-rt6i_flags,
   rt-rt6i_dev ? rt-rt6i_dev-name : );
   return 0;
  }
  
 -static int rt6_proc_info(char *buffer, char **start, off_t offset, int 
 length)
 +static int ipv6_route_show(struct seq_file *m, void *v)
  {
 - struct rt6_proc_arg arg = {
 - .buffer = buffer,
 - .offset = offset,
 - .length = length,
 - };
 -
 - fib6_clean_all(rt6_info_route, 0, arg);
 -
 - *start = buffer;
 - if (offset)
 - *start += offset % RT6_INFO_LEN;
 -
 - arg.len -= offset % RT6_INFO_LEN;
 -
 - if (arg.len  length)
 - arg.len = length;
 - if (arg.len  0)
 - arg.len = 0;
 + fib6_clean_all(rt6_info_route, 0, m);
 + return 0;
 +}
  
 - return arg.len;
 +static int ipv6_route_open(struct inode *inode, struct file *file)
 +{
 + return single_open(file, ipv6_route_show, NULL);
  }
  
 +static const struct file_operations ipv6_route_proc_fops = {
 + .open   = ipv6_route_open,
 + .read   = seq_read,
 + .llseek = seq_lseek,
 + .release= single_release,
 +};
 +
  static int rt6_stats_seq_show(struct seq_file *seq, void *v)
  {
   seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n,
 @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void)
  
   fib6_init();
  #ifdef   CONFIG_PROC_FS
 - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info);
 - if (p)

 + p = create_proc_entry(ipv6_route, 0, init_net.proc_net);
 + if (p) {
   p-owner = THIS_MODULE;
 + p-proc_fops = ipv6_route_proc_fops;
 + }

You should use proc_net_fops_create() instead of the above code. 
It does the same thing.

Otherwise the patch looks fine to me.
Tested on i386.

Benjamin

   proc_net_fops_create(init_net, rt6_stats, S_IRUGO, 
 rt6_stats_seq_fops);
  #endif
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software RD

   http://www.bull.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface

2007-10-30 Thread Benjamin Thery
Cosmetic comment:

I forgot to say there are a few indentation errors when
I apply your patch. See below.


Benjamin Thery wrote:
 Alexey Dobriyan wrote:
 One proc_net_create() user less.
 
 Funny, I was working on a similar patch.
 
 See comment below.
 
 
 Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
 ---

  net/ipv6/route.c |   70 
 +++
  1 file changed, 25 insertions(+), 45 deletions(-)

 --- a/net/ipv6/route.c
 +++ b/net/ipv6/route.c
 @@ -2288,71 +2288,49 @@ struct rt6_proc_arg
  
  static int rt6_info_route(struct rt6_info *rt, void *p_arg)
  {
 -struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg;
 +struct seq_file *m = p_arg;
  
 -if (arg-skip  arg-offset / RT6_INFO_LEN) {
 -arg-skip++;
 -return 0;
 -}
 -
 -if (arg-len = arg-length)
 -return 0;
 -
 -arg-len += sprintf(arg-buffer + arg-len,
 -NIP6_SEQFMT  %02x ,
 -NIP6(rt-rt6i_dst.addr),
 +seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_dst.addr),
  rt-rt6i_dst.plen);
  
  #ifdef CONFIG_IPV6_SUBTREES
 -arg-len += sprintf(arg-buffer + arg-len,
 -NIP6_SEQFMT  %02x ,
 -NIP6(rt-rt6i_src.addr),
 +seq_printf(m, NIP6_SEQFMT  %02x , NIP6(rt-rt6i_src.addr),
  rt-rt6i_src.plen);

Indent is wrong for the above line.

  #else
 -arg-len += sprintf(arg-buffer + arg-len,
 - 00 );
 +seq_puts(m,  00 );
  #endif
  
  if (rt-rt6i_nexthop) {
 -arg-len += sprintf(arg-buffer + arg-len,
 -NIP6_SEQFMT,
 +seq_printf(m, NIP6_SEQFMT,
  NIP6(*((struct in6_addr 
 *)rt-rt6i_nexthop-primary_key)));

Idem.

  } else {
 -arg-len += sprintf(arg-buffer + arg-len,
 -);
 +seq_puts(m, );
  }
 -arg-len += sprintf(arg-buffer + arg-len,
 - %08x %08x %08x %08x %8s\n,
 +seq_printf(m,  %08x %08x %08x %08x %8s\n,
  rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt),
  rt-u.dst.__use, rt-rt6i_flags,
  rt-rt6i_dev ? rt-rt6i_dev-name : );

Indent of the 3 above lines.

  return 0;
  }
  
 -static int rt6_proc_info(char *buffer, char **start, off_t offset, int 
 length)
 +static int ipv6_route_show(struct seq_file *m, void *v)
  {
 -struct rt6_proc_arg arg = {
 -.buffer = buffer,
 -.offset = offset,
 -.length = length,
 -};
 -
 -fib6_clean_all(rt6_info_route, 0, arg);
 -
 -*start = buffer;
 -if (offset)
 -*start += offset % RT6_INFO_LEN;
 -
 -arg.len -= offset % RT6_INFO_LEN;
 -
 -if (arg.len  length)
 -arg.len = length;
 -if (arg.len  0)
 -arg.len = 0;
 +fib6_clean_all(rt6_info_route, 0, m);
 +return 0;
 +}
  
 -return arg.len;
 +static int ipv6_route_open(struct inode *inode, struct file *file)
 +{
 +return single_open(file, ipv6_route_show, NULL);
  }
  
 +static const struct file_operations ipv6_route_proc_fops = {
 +.open   = ipv6_route_open,
 +.read   = seq_read,
 +.llseek = seq_lseek,
 +.release= single_release,
 +};
 +
  static int rt6_stats_seq_show(struct seq_file *seq, void *v)
  {
  seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n,
 @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void)
  
  fib6_init();
  #ifdef  CONFIG_PROC_FS
 -p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info);
 -if (p)
 
 +p = create_proc_entry(ipv6_route, 0, init_net.proc_net);
 +if (p) {
  p-owner = THIS_MODULE;
 +p-proc_fops = ipv6_route_proc_fops;
 +}
 
 You should use proc_net_fops_create() instead of the above code. 
 It does the same thing.
 
 Otherwise the patch looks fine to me.
 Tested on i386.
 
 Benjamin
 
  proc_net_fops_create(init_net, rt6_stats, S_IRUGO, 
 rt6_stats_seq_fops);
  #endif

 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 


-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software RD

   http://www.bull.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Configuring the same IP on multiple addresses

2007-10-30 Thread Vlad Yasevich
David Miller wrote:
 From: David Miller [EMAIL PROTECTED]
 Date: Mon, 29 Oct 2007 15:25:59 -0700 (PDT)
 
 Can you guys please just state upfront what virtualization
 issue is made more difficult by features you want to remove?
 
 Sorry, I mentioned virtualization because that's been the
 largest majority of the cases being presented lately.

Nope, not virtualization.

 
 I suspect in your case it's some multicast or SCTP thing :-)
 

Neither of these really either, although I should try to see how
SCTP behaves in this configuration.

As Brian said, a customer asked us a question, and we didn't know the
history.  No one is trying to remove functionality or features.
We'd just like to know the why, and the answer of why not doesn't
fly very well.

Although in the IPv6 case, there might be issues.

-vlad

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] DM9601: Support for ADMtek ADM8515 NIC

2007-10-30 Thread Peter Korsgaard
Add device ID for the ADMtek ADM8515 USB NIC to the DM9601 driver.

Signed-off-by: Peter Korsgaard [EMAIL PROTECTED]

diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c
index a2de32f..2c68573 100644
--- a/drivers/net/usb/dm9601.c
+++ b/drivers/net/usb/dm9601.c
@@ -586,6 +586,10 @@ static const struct usb_device_id products[] = {
 USB_DEVICE(0x0a46, 0x0268),/* ShanTou ST268 USB NIC */
 .driver_info = (unsigned long)dm9601_info,
 },
+   {
+USB_DEVICE(0x0a46, 0x8515),/* ADMtek ADM8515 USB NIC */
+.driver_info = (unsigned long)dm9601_info,
+},
{}, // END
 };
 
-- 
1.5.3.4

-- 
Bye, Peter Korsgaard
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops in 2.6.21-rc4, 2.6.23

2007-10-30 Thread Jarek Poplawski
On Mon, Oct 29, 2007 at 01:41:47AM -0700, David Miller wrote:
...
 Actually, this was caused by a real bug in the SKB_WITH_OVERHEAD macro
 definition, which Herbert Xu quickly spotted and fixed.
 
 Which I hope you've found this by yourself by now.
 

...Btw, of course you have to be right, and I should find this in max.
12 days yet, if I'm as smart as I hope. But as for now, I really can't
see any meaningful difference between this buggy SKB_WITH_OVERHEAD
version and 'generic' 2.6.20.

There is also a tiny doubt, how this all could influence 2.6.21-rc4,
which seems to be 'generic' here as well. I guess it has to be some
git issue... the more so, as I can't see there this other (bisected)
patch as well?! Then, of course, this could be my sight issue - but
then these 12 days are definitely not enough...

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nf_nat_h323.c unneeded rcu_dereference() calls

2007-10-30 Thread Patrick McHardy

Paul E. McKenney wrote:

Hello!

While reviewing rcu_dereference() uses, I came across a number of cases
where I couldn't see how the rcu_dereference() helped.  One class of
cases is where the variable is never subsequently dereferenced, so that
patches like the following one would be appropriate.

So, what am I missing here?



Nothing, it was mainly intended as documentation that the hooks are
protected by RCU. I agree that its probably more confusing this way
since we're not even in a rcu_read_lock protected section.

I've queued a patch to remove them all.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Herbert Xu
On Tue, Oct 30, 2007 at 06:50:58AM +0100, Jens Axboe wrote:

 How so? The reason you changed it to sg_init_table() + sg_set_buf() is
 exactly because sg_init_one() didn't properly init the entry (as they
 name promised).

For one of the cases yes but the other one repeatedly calls
sg_init_one on the same sg entry while we really only need
to initialise it once and call sg_set_buf afterwards.

Normally this is irrelevant but the loops in question are
trying to estimate the speed of the algorithms so it's good
to exclude as much noise from them as possible.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Jens Axboe
On Tue, Oct 30 2007, Herbert Xu wrote:
 On Tue, Oct 30, 2007 at 06:50:58AM +0100, Jens Axboe wrote:
 
  How so? The reason you changed it to sg_init_table() + sg_set_buf() is
  exactly because sg_init_one() didn't properly init the entry (as they
  name promised).
 
 For one of the cases yes but the other one repeatedly calls
 sg_init_one on the same sg entry while we really only need
 to initialise it once and call sg_set_buf afterwards.
 
 Normally this is irrelevant but the loops in question are
 trying to estimate the speed of the algorithms so it's good
 to exclude as much noise from them as possible.

Ah OK, I was referring to the replacement mentioned above.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface

2007-10-30 Thread Stephen Hemminger
On Tue, 30 Oct 2007 16:11:47 +0300
Alexey Dobriyan [EMAIL PROTECTED] wrote:
  
 +static const struct file_operations ipv6_route_proc_fops = {
 + .open   = ipv6_route_open,
 + .read   = seq_read,
 + .llseek = seq_lseek,
 + .release= single_release,
 +};
 +

This needs
.owner  = THIS_MODULE,



  static int rt6_stats_seq_show(struct seq_file *seq, void *v)
  {
   seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n,
 @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void)
  
   fib6_init();
  #ifdef   CONFIG_PROC_FS
 - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info);
 - if (p)
 + p = create_proc_entry(ipv6_route, 0, init_net.proc_net);
 + if (p) {
   p-owner = THIS_MODULE;
 + p-proc_fops = ipv6_route_proc_fops;
 + }
  
   proc_net_fops_create(init_net, rt6_stats, S_IRUGO, 
 rt6_stats_seq_fops);
  #endif
 

Use proc_net_fops_create()
proc_net_fops_create(init_net, ipv6_route, S_IRUGO, 
ipv6_route_proc_fops)


You can get rid of #ifdef since proc_net_fops_create stub does correct thing
if PROC_FS is not configured.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dn_route.c momentarily exiting RCU read-side critical section

2007-10-30 Thread Paul E. McKenney
On Tue, Oct 30, 2007 at 01:10:36AM -0700, David Miller wrote:
 From: Paul E. McKenney [EMAIL PROTECTED]
 Date: Mon, 29 Oct 2007 14:15:40 -0700
 
  net/decnet/dn_route.c in dn_rt_cache_get_next() is as follows:
  
  static struct dn_route *dn_rt_cache_get_next(struct seq_file *seq, struct 
  dn_route *rt)
  {
  struct dn_rt_cache_iter_state *s = rcu_dereference(seq-private);
  
  rt = rt-u.dst.dn_next;
  while(!rt) {
  rcu_read_unlock_bh();
  if (--s-bucket  0)
  break;
  
  ...  But what happens if seq-private is freed up right here?
  ...  Or what prevents this from happening?
  ...
  Similar code is in rt_cache_get_next().
  
  So, what am I missing here?
 
 seq-private is allocated on file open (here via seq_open_private()),
 and freed up on file close (via seq_release_private).
 
 So it cannot be freed up in the middle of an iteration.

Thank you for the info!!!

OK, for my next stupid question: why is the rcu_dereference(seq-private)
required, as opposed to simply seq-private?

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nf_nat_h323.c unneeded rcu_dereference() calls

2007-10-30 Thread Paul E. McKenney
On Tue, Oct 30, 2007 at 03:06:20PM +0100, Patrick McHardy wrote:
 Paul E. McKenney wrote:
 Hello!
 
 While reviewing rcu_dereference() uses, I came across a number of cases
 where I couldn't see how the rcu_dereference() helped.  One class of
 cases is where the variable is never subsequently dereferenced, so that
 patches like the following one would be appropriate.
 
 So, what am I missing here?
 
 Nothing, it was mainly intended as documentation that the hooks are
 protected by RCU. I agree that its probably more confusing this way
 since we're not even in a rcu_read_lock protected section.
 
 I've queued a patch to remove them all.

Thank you!!!

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface

2007-10-30 Thread Patrick McHardy

Stephen Hemminger wrote:

On Tue, 30 Oct 2007 16:11:47 +0300
Alexey Dobriyan [EMAIL PROTECTED] wrote:
 
+static const struct file_operations ipv6_route_proc_fops = {

+   .open   = ipv6_route_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+


This needs
.owner  = THIS_MODULE,



Your ip_queue conversion patch was also missing this, I've
fixed it up.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS.

2007-10-30 Thread Peter Zijlstra
GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/pagelist.c |2 +-
 fs/nfs/write.c|6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all
if (pagecount = ARRAY_SIZE(p-page_array))
p-pagevec = p-page_array;
else {
-   p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
+   p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOIO);
if (!p-pagevec) {
kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
struct nfs_page *p;
-   p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+   p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
if (p) {
memset(p, 0, sizeof(*p));
INIT_LIST_HEAD(p-wb_list);

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/33] mm: kmem_estimate_pages()

2007-10-30 Thread Peter Zijlstra
Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/slab.h |3 +
 mm/slub.c|   82 +++
 2 files changed, 85 insertions(+)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -60,6 +60,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int 
objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -94,6 +95,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2293,6 +2293,37 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+   unsigned long slabs;
+
+   if (WARN_ON(!s) || WARN_ON(!s-objects))
+   return 0;
+
+   slabs = DIV_ROUND_UP(objects, s-objects);
+
+   /*
+* Account the possible additional overhead if the slab holds more that
+* one object.
+*/
+   if (s-objects  1) {
+   /*
+* Account the possible additional overhead if per cpu slabs
+* are currently empty and have to be allocated. This is very
+* unlikely but a possible scenario immediately after
+* kmem_cache_shrink.
+*/
+   slabs += num_online_cpus();
+   }
+
+   return slabs  s-order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2630,6 +2661,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+   struct kmem_cache *s = get_slab(size, flags);
+   if (!s)
+   return 0;
+
+   return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+   int i;
+   unsigned long pages;
+
+   /*
+* multiply by two, in order to account the worst case slack space
+* due to the power-of-two allocation sizes.
+*/
+   pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+   /*
+* add the kmem_cache overhead of each possible kmalloc cache
+*/
+   for (i = 1; i  PAGE_SHIFT; i++) {
+   struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+   if (unlikely(flags  SLUB_DMA))
+   s = dma_kmalloc_cache(i, flags);
+   else
+#endif
+   s = kmalloc_caches[i];
+
+   if (s)
+   pages += kmem_estimate_pages(s, flags, 0);
+   }
+
+   return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/33] mm: prepare swap entry methods for use in page methods

2007-10-30 Thread Peter Zijlstra
Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm.h  |8 
 include/linux/swap.h|   48 
 include/linux/swapops.h |   44 
 mm/swapfile.c   |1 +
 4 files changed, 57 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include linux/prio_tree.h
 #include linux/debug_locks.h
 #include linux/mm_types.h
+#include linux/swap.h
 
 struct mempolicy;
 struct anon_vma;
@@ -573,6 +574,13 @@ static inline struct address_space *page
return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+   swp_entry_t swap = { .val = page_private(page) };
+   BUG_ON(!PageSwapCache(page));
+   return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page-mapping  PAGE_MAPPING_ANON) != 0;
Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -80,6 +80,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e) ((1UL  SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+   swp_entry_t ret;
+
+   ret.val = (type  SWP_TYPE_SHIFT(ret)) |
+   (offset  SWP_OFFSET_MASK(ret));
+   return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+   return (entry.val  SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+   return entry.val  SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current-reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_
return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+   return NULL;
+}
 #define can_share_swap_page(p) (page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6/include/linux/swapops.h
===
--- linux-2.6.orig/include/linux/swapops.h
+++ linux-2.6/include/linux/swapops.h
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e) ((1UL  SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-   swp_entry_t ret;
-
-   ret.val = (type  SWP_TYPE_SHIFT(ret)) |
-   (offset  SWP_OFFSET_MASK(ret));
-   return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-   return (entry.val  SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-   return entry.val  SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */
Index: linux-2.6/mm/swapfile.c

[PATCH 16/33] netvm: network reserve infrastructure

2007-10-30 Thread Peter Zijlstra
Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)network TX reserve
3)  protocol TX pages
4)network RX reserve
5)  SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |   35 +++-
 net/Kconfig|3 +
 net/core/sock.c|  113 +
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -50,6 +50,7 @@
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/mm.h
 #include linux/security.h
+#include linux/reserve.h
 
 #include linux/filter.h
 
@@ -397,6 +398,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -419,9 +421,40 @@ static inline int sock_flag(struct sock 
return test_bit(flag, sk-sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+   return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+   return atomic_read(memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-   return gfp_mask;
+   return gfp_mask | (sk-sk_allocation  __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include linux/tcp.h
 #include linux/init.h
 #include linux/highmem.h
+#include linux/reserve.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+struct mem_reserve net_skb_reserve;
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+   return mem_reserve_kmalloc_charge(net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+   return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+   return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+   mem_reserve_kmalloc_charge(net_skb_reserve, -bytes, 0);
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_MEMALLOC sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ *@tx_reserve_pages is an upper-bound of memory used for TX hence
+ *we need not account the pages like we do for RX pages.

[PATCH 08/33] mm: emergency pool

2007-10-30 Thread Peter Zijlstra
Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mmzone.h |3 +
 mm/page_alloc.c|   82 +++--
 mm/vmstat.c|6 +--
 3 files changed, 78 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -213,7 +213,7 @@ enum zone_type {
 
 struct zone {
/* Fields commonly accessed by the page allocator */
-   unsigned long   pages_min, pages_low, pages_high;
+   unsigned long   pages_emerg, pages_min, pages_low, pages_high;
/*
 * We don't know if the memory that we're going to allocate will be 
freeable
 * or/and it will be released eventually, so to avoid totally wasting 
several
@@ -682,6 +682,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1252,7 +1254,7 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags  ALLOC_HARDER)
min -= min / 4;
 
-   if (free_pages = min + z-lowmem_reserve[classzone_idx])
+   if (free_pages = min + z-lowmem_reserve[classzone_idx] + 
z-pages_emerg)
return 0;
for (o = 0; o  order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -1733,8 +1735,8 @@ nofail_alloc:
 nopage:
if (!(gfp_mask  __GFP_NOWARN)  printk_ratelimit()) {
printk(KERN_WARNING %s: page allocation failure.
-order:%d, mode:0x%x\n,
-   p-comm, order, gfp_mask);
+order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n,
+   p-comm, order, gfp_mask, alloc_flags, p-flags);
dump_stack();
show_mem();
}
@@ -1952,9 +1954,9 @@ void show_free_areas(void)
\n,
zone-name,
K(zone_page_state(zone, NR_FREE_PAGES)),
-   K(zone-pages_min),
-   K(zone-pages_low),
-   K(zone-pages_high),
+   K(zone-pages_emerg + zone-pages_min),
+   K(zone-pages_emerg + zone-pages_low),
+   K(zone-pages_emerg + zone-pages_high),
K(zone_page_state(zone, NR_ACTIVE)),
K(zone_page_state(zone, NR_INACTIVE)),
K(zone-present_pages),
@@ -4113,7 +4115,7 @@ static void calculate_totalreserve_pages
}
 
/* we treat pages_high as reserved pages. */
-   max += zone-pages_high;
+   max += zone-pages_high + zone-pages_emerg;
 
if (max  zone-present_pages)
max = zone-present_pages;
@@ -4170,7 +4172,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-   unsigned long pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
+   unsigned pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
+   unsigned pages_emerg = var_free_kbytes  (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -4182,11 +4185,13 @@ static void __setup_per_zone_pages_min(v
}
 
for_each_zone(zone) {
-   u64 tmp;
+   u64 tmp, tmp_emerg;
 
spin_lock_irqsave(zone-lru_lock, flags);
tmp = (u64)pages_min * zone-present_pages;
do_div(tmp, lowmem_pages);
+   tmp_emerg = (u64)pages_emerg * zone-present_pages;
+   do_div(tmp_emerg, lowmem_pages);
if (is_highmem(zone)) {
/*
 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ 

[PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages

2007-10-30 Thread Peter Zijlstra
In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page-index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file-f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm.h  |   26 ++
 include/linux/pagemap.h |2 +-
 2 files changed, 27 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include linux/debug_locks.h
 #include linux/mm_types.h
 #include linux/swap.h
+#include linux/fs.h
 
 struct mempolicy;
 struct anon_vma;
@@ -581,6 +582,16 @@ static inline struct swap_info_struct *p
return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page)))
+   return page_swap_info(page)-swap_file-f_mapping;
+#endif
+   return page-mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page-mapping  PAGE_MAPPING_ANON) != 0;
@@ -598,6 +609,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use -index
+ * whereas swapcache pages use swp_offset(-private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page))) {
+   swp_entry_t swap = { .val = page_private(page) };
+   return swp_offset(swap);
+   }
+#endif
+   return page-index;
+}
+
+/*
  * The atomic page-_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-   return ((loff_t)page-index)  PAGE_CACHE_SHIFT;
+   return ((loff_t)page_file_index(page))  PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 30/33] nfs: swap vs nfs_writepage

2007-10-30 Thread Peter Zijlstra
For now just use the -writepage() path for swap traffic. Trond would like
to see -swap_page() or some such additional a_op.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/write.c |   23 +++
 1 file changed, 23 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -336,6 +336,29 @@ static int nfs_do_writepage(struct page 
nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);
 
+   if (unlikely(IS_SWAPFILE(inode))) {
+   struct rpc_cred *cred;
+   struct nfs_open_context *ctx;
+   int status;
+
+   cred = rpcauth_lookupcred(NFS_CLIENT(inode)-cl_auth, 0);
+   if (IS_ERR(cred))
+   return PTR_ERR(cred);
+
+   ctx = nfs_find_open_context(inode, cred, FMODE_WRITE);
+   if (!ctx)
+   return -EBADF;
+
+   status = nfs_writepage_setup(ctx, page, 0, 
nfs_page_length(page));
+
+   put_nfs_open_context(ctx);
+
+   if (status  0) {
+   nfs_set_pageerror(page);
+   return status;
+   }
+   }
+
nfs_pageio_cond_complete(pgio, page-index);
return nfs_page_async_flush(pgio, page);
 }

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/33] mm: tag reseve pages

2007-10-30 Thread Peter Zijlstra
Tag pages allocated from the reserves with a non-zero page-reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm_types.h |1 +
 mm/page_alloc.c  |4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -70,6 +70,7 @@ struct page {
union {
pgoff_t index;  /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+   int reserve;/* page_alloc: page is a reserve page */
};
struct list_head lru;   /* Pageout list, eg. active_list
 * protected by zone-lru_lock !
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1448,8 +1448,10 @@ zonelist_scan:
}
 
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-   if (page)
+   if (page) {
+   page-reserve = !!(alloc_flags  ALLOC_NO_WATERMARKS);
break;
+   }
 this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/33] sysctl: propagate conv errors

2007-10-30 Thread Peter Zijlstra
Currently conv routines will only generate -EINVAL, allow for other
errors to be propagetd.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 kernel/sysctl.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/sysctl.c
===
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1732,6 +1732,7 @@ static int __do_proc_dointvec(void *tbl_
int *i, vleft, first=1, neg, val;
unsigned long lval;
size_t left, len;
+   int ret = 0;

char buf[TMPBUFLEN], *p;
char __user *s = buffer;
@@ -1787,14 +1788,16 @@ static int __do_proc_dointvec(void *tbl_
s += len;
left -= len;
 
-   if (conv(neg, lval, i, 1, data))
+   ret = conv(neg, lval, i, 1, data);
+   if (ret)
break;
} else {
p = buf;
if (!first)
*p++ = '\t';

-   if (conv(neg, lval, i, 0, data))
+   ret = conv(neg, lval, i, 0, data);
+   if (ret)
break;
 
sprintf(p, %s%lu, neg ? - : , lval);
@@ -1823,11 +1826,9 @@ static int __do_proc_dointvec(void *tbl_
left--;
}
}
-   if (write  first)
-   return -EINVAL;
*lenp -= left;
*ppos += *lenp;
-   return 0;
+   return ret;
 #undef TMPBUFLEN
 }
 

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages

2007-10-30 Thread Peter Zijlstra
Replace all relevant occurences of page-index and page-mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/file.c |8 
 fs/nfs/internal.h |7 ---
 fs/nfs/pagelist.c |6 +++---
 fs/nfs/read.c |6 +++---
 fs/nfs/write.c|   49 +
 5 files changed, 39 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -357,7 +357,7 @@ static void nfs_invalidate_page(struct p
if (offset != 0)
return;
/* Cancel any unstarted writes on this page */
-   nfs_wb_page_cancel(page-mapping-host, page);
+   nfs_wb_page_cancel(page_file_mapping(page)-host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -368,7 +368,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-   return nfs_wb_page(page-mapping-host, page);
+   return nfs_wb_page(page_file_mapping(page)-host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -397,13 +397,13 @@ static int nfs_vm_page_mkwrite(struct vm
loff_t offset;
 
lock_page(page);
-   mapping = page-mapping;
+   mapping = page_file_mapping(page);
if (mapping != vma-vm_file-f_path.dentry-d_inode-i_mapping) {
unlock_page(page);
return -EINVAL;
}
pagelen = nfs_page_length(page);
-   offset = (loff_t)page-index  PAGE_CACHE_SHIFT;
+   offset = (loff_t)page_file_index(page)  PAGE_CACHE_SHIFT;
unlock_page(page);
 
/*
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -77,11 +77,11 @@ nfs_create_request(struct nfs_open_conte
 * update_nfs_request below if the region is not locked. */
req-wb_page= page;
atomic_set(req-wb_complete, 0);
-   req-wb_index   = page-index;
+   req-wb_index   = page_file_index(page);
page_cache_get(page);
BUG_ON(PagePrivate(page));
BUG_ON(!PageLocked(page));
-   BUG_ON(page-mapping-host != inode);
+   BUG_ON(page_file_mapping(page)-host != inode);
req-wb_offset  = offset;
req-wb_pgbase  = offset;
req-wb_bytes   = count;
@@ -383,7 +383,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page-index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -460,11 +460,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
struct nfs_open_context *ctx;
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
int error;
 
dprintk(NFS: nfs_readpage (%p [EMAIL PROTECTED])\n,
-   page, PAGE_CACHE_SIZE, page-index);
+   page, PAGE_CACHE_SIZE, page_file_index(page));
nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -511,7 +511,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *new;
unsigned int len;
int error;
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *req = NULL;
 
spin_lock(inode-i_lock);
@@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int 
count)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
loff_t end, i_size = i_size_read(inode);
pgoff_t end_index = (i_size - 1)  PAGE_CACHE_SHIFT;
 
-   if (i_size  0  page-index  end_index)
+   if (i_size  0  

[PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs

2007-10-30 Thread Peter Zijlstra
Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 net/netfilter/core.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -181,9 +181,12 @@ next_hook:
ret = 1;
goto unlock;
} else if (verdict == NF_DROP) {
+drop:
kfree_skb(*pskb);
ret = -EPERM;
} else if ((verdict  NF_VERDICT_MASK)  == NF_QUEUE) {
+   if (skb_emergency(*pskb))
+   goto drop;
NFDEBUG(nf_hook: Verdict = QUEUE.\n);
if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
  verdict  NF_VERDICT_BITS))

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 19/33] netvm: hook skb allocation to reserves

2007-10-30 Thread Peter Zijlstra
Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb-emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page-index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()-dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm_types.h |1 
 include/linux/skbuff.h   |   25 +-
 net/core/skbuff.c|  173 +--
 3 files changed, 173 insertions(+), 26 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -289,7 +289,8 @@ struct sk_buff {
__u8pkt_type:3,
fclone:2,
ipvs_property:1,
-   nf_trace:1;
+   nf_trace:1,
+   emergency:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
@@ -341,10 +342,22 @@ struct sk_buff {
 
 #include asm/system.h
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+   return unlikely(skb-emergency);
+#else
+   return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
@@ -354,7 +367,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern void   kfree_skbmem(struct sk_buff *skb);
@@ -1297,7 +1310,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
  gfp_t gfp_mask)
 {
-   struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+   struct sk_buff *skb =
+   __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -1343,6 +1357,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  * netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1359,7 +1374,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-   __free_page(page);
+   __netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int flags, int node)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+   int emergency = 0, memalloc = sk_memalloc_socks();
 
-   cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+   size = SKB_DATA_ALIGN(size);
+   cache = (flags  SKB_ALLOC_FCLONE)
+   ? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+   if (memalloc  (flags  SKB_ALLOC_RX))
+   gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
+#endif
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask  ~__GFP_DMA, node);
if (!skb)
-   goto out;
+   goto noskb;
 
-   size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
@@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int
 * See 

[PATCH 13/33] net: wrap sk-sk_backlog_rcv()

2007-10-30 Thread Peter Zijlstra
Wrap calling sk-sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h   |5 +
 net/core/sock.c  |4 ++--
 net/ipv4/tcp.c   |2 +-
 net/ipv4/tcp_timer.c |2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -485,6 +485,11 @@ static inline void sk_add_backlog(struct
skb-next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+   return sk-sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)  \
({  int __rc;   \
release_sock(__sk); \
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -320,7 +320,7 @@ int sk_receive_skb(struct sock *sk, stru
 */
mutex_acquire(sk-sk_lock.dep_map, 0, 1, _RET_IP_);
 
-   rc = sk-sk_backlog_rcv(sk, skb);
+   rc = sk_backlog_rcv(sk, skb);
 
mutex_release(sk-sk_lock.dep_map, 1, _RET_IP_);
} else
@@ -1312,7 +1312,7 @@ static void __release_sock(struct sock *
struct sk_buff *next = skb-next;
 
skb-next = NULL;
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
/*
 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1134,7 +1134,7 @@ static void tcp_prequeue_process(struct 
 * necessary */
local_bh_disable();
while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL)
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
local_bh_enable();
 
/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -196,7 +196,7 @@ static void tcp_delack_timer(unsigned lo
NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL)
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
tp-ucopy.memory = 0;
}

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/33] mm: __GFP_MEMALLOC

2007-10-30 Thread Peter Zijlstra
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/gfp.h |3 ++-
 mm/page_alloc.c |4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT   ((__force gfp_t)0x400u) /* Retry the allocation.  Might 
fail */
 #define __GFP_NOFAIL   ((__force gfp_t)0x800u) /* Retry for ever.  Cannot fail 
*/
 #define __GFP_NORETRY  ((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-   __GFP_NORETRY|__GFP_NOMEMALLOC)
+   __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1560,7 +1560,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
-   if (!in_irq()  (p-flags  PF_MEMALLOC))
+   if (gfp_mask  __GFP_MEMALLOC)
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_irq()  (p-flags  PF_MEMALLOC))
alloc_flags |= ALLOC_NO_WATERMARKS;
else if (!in_interrupt() 
unlikely(test_thread_flag(TIF_MEMDIE)))

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/33] mm: gfp_to_alloc_flags()

2007-10-30 Thread Peter Zijlstra
Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/internal.h   |   11 ++
 mm/page_alloc.c |   98 
 2 files changed, 67 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/internal.h
===
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -47,4 +47,15 @@ static inline unsigned long page_order(s
VM_BUG_ON(!PageBuddy(page));
return page_private(page);
 }
+
+#define ALLOC_HARDER   0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH   0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1139,14 +1139,6 @@ failed:
return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH   0x08 /* use pages_high watermark */
-#define ALLOC_HARDER   0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1535,6 +1527,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+   struct task_struct *p = current;
+   int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+   const gfp_t wait = gfp_mask  __GFP_WAIT;
+
+   /*
+* The caller may dip into page reserves a bit more if the caller
+* cannot run direct reclaim, or if the caller has realtime scheduling
+* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+*/
+   if (gfp_mask  __GFP_HIGH)
+   alloc_flags |= ALLOC_HIGH;
+
+   if (!wait) {
+   alloc_flags |= ALLOC_HARDER;
+   /*
+* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+*/
+   alloc_flags = ~ALLOC_CPUSET;
+   } else if (unlikely(rt_task(p))  !in_interrupt())
+   alloc_flags |= ALLOC_HARDER;
+
+   if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
+   if (!in_interrupt() 
+   ((p-flags  PF_MEMALLOC) ||
+unlikely(test_thread_flag(TIF_MEMDIE
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   }
+
+   return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1589,48 +1619,28 @@ restart:
 * OK, we're below the kswapd watermark and have kicked background
 * reclaim. Now things get more complex, so set up alloc_flags according
 * to how we want to proceed.
-*
-* The caller may dip into page reserves a bit more if the caller
-* cannot run direct reclaim, or if the caller has realtime scheduling
-* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 */
-   alloc_flags = ALLOC_WMARK_MIN;
-   if ((unlikely(rt_task(p))  !in_interrupt()) || !wait)
-   alloc_flags |= ALLOC_HARDER;
-   if (gfp_mask  __GFP_HIGH)
-   alloc_flags |= ALLOC_HIGH;
-   if (wait)
-   alloc_flags |= ALLOC_CPUSET;
+   alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-   /*
-* Go through the zonelist again. Let __GFP_HIGH and allocations
-* coming from realtime tasks go deeper into reserves.
-*
-* This is the last chance, in general, before the goto nopage.
-* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-*/
-   page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+   /* This is the last chance, in general, before the goto nopage. */
+   page = get_page_from_freelist(gfp_mask, order, zonelist,
+   

[PATCH 20/33] netvm: filter emergency skbs.

2007-10-30 Thread Peter Zijlstra
Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -930,6 +930,9 @@ static inline int sk_filter(struct sock 
 {
int err;
struct sk_filter *filter;
+
+   if (skb_emergency(skb)  !sk_has_memalloc(sk))
+   return -ENOMEM;

err = security_sock_rcv_skb(sk, skb);
if (err)

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/33] netvm: skb processing

2007-10-30 Thread Peter Zijlstra
In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |5 +
 net/core/dev.c |   44 ++--
 net/core/sock.c|   18 ++
 3 files changed, 61 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+   unsigned long pflags = current-flags;
+
+   /* Emergency skb are special, they should
+*  - be delivered to SOCK_MEMALLOC sockets only
+*  - stay away from userspace
+*  - have bounded memory usage
+*
+* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+* This saves us from propagating the allocation context down to all
+* allocation sites.
+*/
+   if (skb_emergency(skb))
+   current-flags |= PF_MEMALLOC;
 
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
-   return NET_RX_DROP;
+   goto out;
 
if (!skb-tstamp.tv64)
net_timestamp(skb);
@@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);
 
if (!orig_dev)
-   return NET_RX_DROP;
+   goto out;
 
__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk
}
 #endif
 
+   if (skb_emergency(skb))
+   goto skip_taps;
+
list_for_each_entry_rcu(ptype, ptype_all, list) {
if (!ptype-dev || ptype-dev == skb-dev) {
if (pt_prev)
@@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk
}
}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
 
if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
-   goto out;
+   goto unlock;
}
 
skb-tc_verd = 0;
 ncls:
 #endif
 
+   if (skb_emergency(skb))
+   switch(skb-protocol) {
+   case __constant_htons(ETH_P_ARP):
+   case __constant_htons(ETH_P_IP):
+   case __constant_htons(ETH_P_IPV6):
+   case __constant_htons(ETH_P_8021Q):
+   break;
+
+   default:
+   goto drop;
+   }
+
skb = handle_bridge(skb, pt_prev, ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
skb = handle_macvlan(skb, pt_prev, ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 
type = skb-protocol;
list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) {
@@ -2056,6 +2085,7 @@ ncls:
if (pt_prev) {
ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
 * me how you were going to use this. :-)
@@ -2063,8 +2093,10 @@ ncls:
ret = NET_RX_DROP;
}
 
-out:
+unlock:
rcu_read_unlock();
+out:
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct
skb-next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+   if (skb_emergency(skb))
+   return __sk_backlog_rcv(sk, skb);
+
return sk-sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 
+#ifdef CONFIG_NETVM
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+   int ret;
+   unsigned long pflags = current-flags;
+
+   /* these should have been dropped before queueing */
+   BUG_ON(!sk_has_memalloc(sk));
+
+   

[PATCH 25/33] mm: add support for non block device backed swap files

2007-10-30 Thread Peter Zijlstra
A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis-swap_file-f_mapping-a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 Documentation/filesystems/Locking |9 +
 include/linux/buffer_head.h   |2 -
 include/linux/fs.h|1 
 include/linux/swap.h  |3 +
 mm/Kconfig|3 +
 mm/page_io.c  |   58 ++
 mm/swap_state.c   |5 +++
 mm/swapfile.c |   22 +-
 8 files changed, 101 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -164,6 +164,7 @@ enum {
SWP_USED= (1  0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1  1), /* ok to write to this swap?*/
SWP_ACTIVE  = (SWP_USED | SWP_WRITEOK),
+   SWP_FILE= (1  2), /* file swap area */
/* add others here before... */
SWP_SCANNING= (1  8), /* refcount in scan_swap_map */
 };
@@ -264,6 +265,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include linux/bio.h
 #include linux/swapops.h
 #include linux/writeback.h
+#include linux/buffer_head.h
 #include asm/pgtable.h
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st
unlock_page(page);
goto out;
}
+#ifdef CONFIG_SWAP_FILE
+   {
+   struct swap_info_struct *sis = page_swap_info(page);
+   if (sis-flags  SWP_FILE) {
+   ret = sis-swap_file-f_mapping-
+   a_ops-writepage(page, wbc);
+   if (!ret)
+   count_vm_event(PSWPOUT);
+   return ret;
+   }
+   }
+#endif
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -120,6 +133,39 @@ out:
return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis-flags  SWP_FILE) {
+   const struct address_space_operations * a_ops =
+   sis-swap_file-f_mapping-a_ops;
+   if (a_ops-sync_page)
+   a_ops-sync_page(page);
+   } else
+   block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis-flags  SWP_FILE) {
+   const struct address_space_operations * a_ops =
+   sis-swap_file-f_mapping-a_ops;
+   int (*spd)(struct page *) = a_ops-set_page_dirty;
+#ifdef CONFIG_BLOCK
+   if (!spd)
+   spd = __set_page_dirty_buffers;
+#endif
+   return (*spd)(page);
+   }
+
+   return __set_page_dirty_nobuffers(page);
+}
+#endif
+
 int swap_readpage(struct file *file, struct page *page)
 {
struct bio *bio;
@@ -127,6 +173,18 @@ int swap_readpage(struct file *file, str
 
BUG_ON(!PageLocked(page));
ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+   {
+   struct swap_info_struct *sis = page_swap_info(page);
+   if (sis-flags  SWP_FILE) {
+   ret = sis-swap_file-f_mapping-
+   a_ops-readpage(sis-swap_file, page);
+   if (!ret)
+   count_vm_event(PSWPIN);
+   return ret;
+   }
+   }
+#endif
bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
end_swap_bio_read);
if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===
--- linux-2.6.orig/mm/swap_state.c

[PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-30 Thread Peter Zijlstra
Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

This is done to ensure reserve pages don't leak out and get consumed.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/slub_def.h |1 +
 mm/slub.c|   31 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -20,11 +20,12 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include internal.h
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab-list_lock
+ *   2. node-list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1074,7 +1075,7 @@ static void setup_object(struct kmem_cac
s-ctor(s, object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int 
*reserve)
 {
struct page *page;
struct kmem_cache_node *n;
@@ -1090,6 +1091,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
 
+   *reserve = page-reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(n-nr_slabs);
@@ -1468,10 +1470,22 @@ static void *__slab_alloc(struct kmem_ca
 {
void **object;
struct page *new;
+   int reserve = 0;
 
if (!c-page)
goto new_slab;
 
+   if (unlikely(c-reserve)) {
+   /*
+* If the current slab is a reserve slab and the current
+* allocation context does not allow access to the reserves
+* we must force an allocation to test the current levels.
+*/
+   if (!(gfp_to_alloc_flags(gfpflags)  ALLOC_NO_WATERMARKS))
+   goto alloc_slab;
+   reserve = 1;
+   }
+
slab_lock(c-page);
if (unlikely(!node_match(c, node)))
goto another_slab;
@@ -1479,10 +1493,9 @@ load_freelist:
object = c-page-freelist;
if (unlikely(!object))
goto another_slab;
-   if (unlikely(SlabDebug(c-page)))
+   if (unlikely(SlabDebug(c-page) || reserve))
goto debug;
 
-   object = c-page-freelist;
c-freelist = object[c-offset];
c-page-inuse = s-objects;
c-page-freelist = NULL;
@@ -1500,16 +1513,18 @@ new_slab:
goto load_freelist;
}
 
+alloc_slab:
if (gfpflags  __GFP_WAIT)
local_irq_enable();
 
-   new = new_slab(s, gfpflags, node);
+   new = new_slab(s, gfpflags, node, reserve);
 
if (gfpflags  __GFP_WAIT)
local_irq_disable();
 
if (new) {
c = get_cpu_slab(s, smp_processor_id());
+   c-reserve = reserve;
if (c-page) {
/*
 * Someone else populated the cpu_slab while we
@@ -1537,8 +1552,7 @@ new_slab:
}
return NULL;
 debug:
-   object = c-page-freelist;
-   if (!alloc_debug_processing(s, c-page, object, addr))
+   if (SlabDebug(c-page)  !alloc_debug_processing(s, c-page, object, 
addr))
goto another_slab;
 
c-page-inuse++;
@@ -2010,10 +2024,11 @@ static struct kmem_cache_node *early_kme
 {
struct page *page;
struct kmem_cache_node *n;
+   int reserve;
 
BUG_ON(kmalloc_caches-size  sizeof(struct kmem_cache_node));
 
-   page = new_slab(kmalloc_caches, gfpflags, node);
+   page = new_slab(kmalloc_caches, gfpflags, node, reserve);
 
BUG_ON(!page);
if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -17,6 +17,7 @@ struct kmem_cache_cpu {
int node;
unsigned int offset;
unsigned int objsize;
+   int reserve;
 };
 
 struct kmem_cache_node {

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/33] net: sk_allocation() - concentrate socket related allocations

2007-10-30 Thread Peter Zijlstra
Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h|7 ++-
 net/ipv4/tcp_output.c |   11 ++-
 net/ipv6/tcp_ipv6.c   |   14 +-
 3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2081,7 +2081,7 @@ void tcp_send_fin(struct sock *sk)
} else {
/* Socket is locked, keep trying until memory is available. */
for (;;) {
-   skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+   skb = alloc_skb_fclone(MAX_TCP_HEADER, 
sk-sk_allocation);
if (skb)
break;
yield();
@@ -2114,7 +2114,7 @@ void tcp_send_active_reset(struct sock *
struct sk_buff *skb;
 
/* NOTE: No TCP options attached and we never retransmit this. */
-   skb = alloc_skb(MAX_TCP_HEADER, priority);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
return;
@@ -2187,7 +2187,8 @@ struct sk_buff * tcp_make_synack(struct 
__u8 *md5_hash_location;
 #endif
 
-   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+   sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
 
@@ -2446,7 +2447,7 @@ void tcp_send_ack(struct sock *sk)
 * tcp_transmit_skb() will set the ownership to this
 * sock.
 */
-   buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)-icsk_ack.ato = TCP_ATO_MIN;
@@ -2488,7 +2489,7 @@ static int tcp_xmit_probe_skb(struct soc
struct sk_buff *skb;
 
/* We don't queue it, tcp_transmit_skb() sets ownership. */
-   skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -419,6 +419,11 @@ static inline int sock_flag(struct sock 
return test_bit(flag, sk-sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+   return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
sk-sk_ack_backlog--;
@@ -1212,7 +1217,7 @@ static inline struct sk_buff *sk_stream_
int hdr_len;
 
hdr_len = SKB_DATA_ALIGN(sk-sk_prot-max_header);
-   skb = alloc_skb_fclone(size + hdr_len, gfp);
+   skb = alloc_skb_fclone(size + hdr_len, sk_allocation(sk, gfp));
if (skb) {
skb-truesize += mem;
if (sk_stream_wmem_schedule(sk, skb-truesize)) {
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -573,7 +573,8 @@ static int tcp_v6_md5_do_add(struct sock
} else {
/* reallocate new list if current one is full. */
if (!tp-md5sig_info) {
-   tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), 
GFP_ATOMIC);
+   tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info),
+   sk_allocation(sk, GFP_ATOMIC));
if (!tp-md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -583,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock
tcp_alloc_md5sig_pool();
if (tp-md5sig_info-alloced6 == tp-md5sig_info-entries6) {
keys = kmalloc((sizeof (tp-md5sig_info-keys6[0]) *
-  (tp-md5sig_info-entries6 + 1)), 
GFP_ATOMIC);
+  (tp-md5sig_info-entries6 + 1)),
+  sk_allocation(sk, GFP_ATOMIC));
 
if (!keys) {
tcp_free_md5sig_pool();
@@ -709,7 +711,7 @@ static int tcp_v6_parse_md5_keys (struct
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
 
-   p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+   p = kzalloc(sizeof(struct tcp_md5sig_info), sk-sk_allocation);
 

[PATCH 18/33] netvm: INET reserves.

2007-10-30 Thread Peter Zijlstra
Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Use proc conv() routines to update these limits and return -ENOMEM to user
space.

Adds to the reserve tree:

  total network reserve  
network TX reserve   
  protocol TX pages  
network RX reserve   
+ IPv6 route cache   
+ IPv4 route cache   
  SKB data reserve   
+   IPv6 fragment cache  
+   IPv4 fragment cache  

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/sysctl.h |   11 +++
 kernel/sysctl.c|8 ++--
 net/ipv4/ip_fragment.c |7 +++
 net/ipv4/route.c   |   30 +-
 net/ipv4/sysctl_net_ipv4.c |   24 +++-
 net/ipv6/reassembly.c  |7 +++
 net/ipv6/route.c   |   31 ++-
 net/ipv6/sysctl_net_ipv6.c |   24 +++-
 8 files changed, 136 insertions(+), 6 deletions(-)

Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -18,6 +18,7 @@
 #include net/route.h
 #include net/tcp.h
 #include net/cipso_ipv4.h
+#include linux/reserve.h
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,27 @@ static int strategy_allowed_congestion_c
 
 }
 
+extern struct mem_reserve ipv4_frag_reserve;
+
+static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp,
+int *valp, int write, void *data)
+{
+   if (write) {
+   long value = *negp ? -*lvalp : *lvalp;
+   int err = mem_reserve_kmalloc_set(ipv4_frag_reserve, value);
+   if (err)
+   return err;
+   }
+   return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file 
*filp,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+   do_proc_dointvec_fragment_conv, NULL);
+}
+
 ctl_table ipv4_table[] = {
{
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +313,7 @@ ctl_table ipv4_table[] = {
.data   = sysctl_ipfrag_high_thresh,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec
+   .proc_handler   = proc_dointvec_fragment
},
{
.ctl_name   = NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6/net/ipv6/sysctl_net_ipv6.c
===
--- linux-2.6.orig/net/ipv6/sysctl_net_ipv6.c
+++ linux-2.6/net/ipv6/sysctl_net_ipv6.c
@@ -12,9 +12,31 @@
 #include net/ndisc.h
 #include net/ipv6.h
 #include net/addrconf.h
+#include linux/reserve.h
 
 #ifdef CONFIG_SYSCTL
 
+extern struct mem_reserve ipv6_frag_reserve;
+
+static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp,
+int *valp, int write, void *data)
+{
+   if (write) {
+   long value = *negp ? -*lvalp : *lvalp;
+   int err = mem_reserve_kmalloc_set(ipv6_frag_reserve, value);
+   if (err)
+   return err;
+   }
+   return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file 
*filp,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+   do_proc_dointvec_fragment_conv, NULL);
+}
+
 static ctl_table ipv6_table[] = {
{
.ctl_name   = NET_IPV6_ROUTE,
@@ -44,7 +66,7 @@ static ctl_table ipv6_table[] = {
.data   = sysctl_ip6frag_high_thresh,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec
+   .proc_handler   = proc_dointvec_fragment
},
{
.ctl_name   = NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6/net/ipv4/ip_fragment.c
===
--- linux-2.6.orig/net/ipv4/ip_fragment.c
+++ linux-2.6/net/ipv4/ip_fragment.c
@@ -43,6 +43,7 @@
 #include linux/udp.h
 #include linux/inet.h
 #include linux/netfilter_ipv4.h
+#include linux/reserve.h
 
 /* 

[PATCH 07/33] mm: serialize access to min_free_kbytes

2007-10-30 Thread Peter Zijlstra
There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page_alloc.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO
 Movable,
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4162,12 +4163,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
unsigned long pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -4222,6 +4223,15 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(min_free_lock, flags);
+   __setup_per_zone_pages_min();
+   spin_unlock_irqrestore(min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4257,7 +4267,7 @@ static int __init init_per_zone_pages_mi
min_free_kbytes = 128;
if (min_free_kbytes  65536)
min_free_kbytes = 65536;
-   setup_per_zone_pages_min();
+   __setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
return 0;
 }

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/33] mm: memory reserve management

2007-10-30 Thread Peter Zijlstra
Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/reserve.h |   54 +
 mm/Makefile |2 
 mm/reserve.c|  436 
 3 files changed, 491 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/reserve.h
===
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,54 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED]
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include linux/list.h
+#include linux/spinlock.h
+
+struct mem_reserve {
+   struct mem_reserve *parent;
+   struct list_head children;
+   struct list_head siblings;
+
+   const char *name;
+
+   long pages;
+   long limit;
+   long usage;
+   spinlock_t lock;/* protects limit and usage */
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+ struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+   struct mem_reserve *node);
+int mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages,
+int overcommit);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+  int overcommit);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+  struct kmem_cache *s,
+  int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+ long objs,
+ int overcommit);
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
   page_alloc.o page-writeback.o pdflush.o \
   readahead.o swap.o truncate.o vmscan.o \
   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-  page_isolation.o $(mmu-y)
+  page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
Index: linux-2.6/mm/reserve.c
===
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,436 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED]
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone-pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include linux/reserve.h
+#include linux/mutex.h
+#include linux/mmzone.h
+#include linux/log2.h
+#include linux/proc_fs.h
+#include linux/seq_file.h
+#include linux/module.h
+#include linux/slab.h
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+   

[PATCH 21/33] netvm: prevent a TCP specific deadlock

2007-10-30 Thread Peter Zijlstra
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |7 ---
 net/core/stream.c  |5 +++--
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -743,7 +743,8 @@ static inline struct inode *SOCK_INODE(s
 }
 
 extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+   int size, int kind);
 
 #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
 
@@ -761,13 +762,13 @@ static inline void sk_stream_mem_reclaim
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
return (int)skb-truesize = sk-sk_forward_alloc ||
-   sk_stream_mem_schedule(sk, skb-truesize, 1);
+   sk_stream_mem_schedule(sk, skb, skb-truesize, 1);
 }
 
 static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
 {
return size = sk-sk_forward_alloc ||
-  sk_stream_mem_schedule(sk, size, 0);
+  sk_stream_mem_schedule(sk, NULL, size, 0);
 }
 
 /* Used by processes to lock a socket state, so that
Index: linux-2.6/net/core/stream.c
===
--- linux-2.6.orig/net/core/stream.c
+++ linux-2.6/net/core/stream.c
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
 
 EXPORT_SYMBOL(__sk_stream_mem_reclaim);
 
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int 
kind)
 {
int amt = sk_stream_pages(size);
 
@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
/* Over hard limit. */
if (atomic_read(sk-sk_prot-memory_allocated)  
sk-sk_prot-sysctl_mem[2]) {
sk-sk_prot-enter_memory_pressure();
-   goto suppress_allocation;
+   if (!skb || (skb  !skb_emergency(skb)))
+   goto suppress_allocation;
}
 
/* Under pressure. */

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/33] net: packet split receive api

2007-10-30 Thread Peter Zijlstra
Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 drivers/net/e1000/e1000_main.c |8 ++--
 drivers/net/sky2.c |   16 ++--
 include/linux/skbuff.h |   23 +++
 net/core/skbuff.c  |   20 
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4407,12 +4407,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
pci_unmap_page(pdev, ps_page_dma-ps_page_dma[j],
PAGE_SIZE, PCI_DMA_FROMDEVICE);
ps_page_dma-ps_page_dma[j] = 0;
-   skb_fill_page_desc(skb, j, ps_page-ps_page[j], 0,
-  length);
+   skb_add_rx_frag(skb, j, ps_page-ps_page[j], 0, length);
ps_page-ps_page[j] = NULL;
-   skb-len += length;
-   skb-data_len += length;
-   skb-truesize += length;
}
 
/* strip the ethernet crc, problem is we're using pages now so
@@ -4618,7 +4614,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
if (j  adapter-rx_ps_pages) {
if (likely(!ps_page-ps_page[j])) {
ps_page-ps_page[j] =
-   alloc_page(GFP_ATOMIC);
+   netdev_alloc_page(netdev);
if (unlikely(!ps_page-ps_page[j])) {
adapter-alloc_rx_buff_failed++;
goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -846,6 +846,9 @@ static inline void skb_fill_page_desc(st
skb_shinfo(skb)-nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+   int off, int size);
+
 #define SKB_PAGE_ASSERT(skb)   BUG_ON(skb_shinfo(skb)-nr_frags)
 #define SKB_FRAG_ASSERT(skb)   BUG_ON(skb_shinfo(skb)-frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1339,6 +1342,26 @@ static inline struct sk_buff *netdev_all
return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+
+/**
+ * netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @dev: network device to receive on
+ *
+ * Allocate a new page node local to the specified device.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+   return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+   __free_page(page);
+}
+
 /**
  * skb_clone_writable - is the header of a clone writable
  * @skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+   int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1;
+   struct page *page;
+
+   page = alloc_pages_node(node, gfp_mask, 0);
+   return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+   int size)
+{
+   skb_fill_page_desc(skb, i, page, off, size);
+   skb-len += size;
+   skb-data_len += size;
+   skb-truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
@@ -2464,6 +2482,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Index: linux-2.6/drivers/net/sky2.c
===
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1173,7 +1173,7 @@ static struct sk_buff *sky2_rx_alloc(str
skb_reserve(skb, ALIGN(p, RX_SKB_ALIGN) - p);
 
for 

[PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Peter Zijlstra

Hi,

Another posting of the full swap over NFS series. 

[ I tried just posting the first part last time around, but
  that just gets more confusion by lack of a general picture ]

[ patches against 2.6.23-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.23-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


  Part 1, patches 1-12

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it is currently SLUB only.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: slub: add knowledge of reserve pages
 4 - mm: allow mempool to fall back to memalloc reserves
 5 - mm: kmem_estimate_pages()
 6 - mm: allow PF_MEMALLOC from softirq context
 7 - mm: serialize access to min_free_kbytes
 8 - mm: emergency pool
 9 - mm: system wide ALLOC_NO_WATERMARK
10 - mm: __GFP_MEMALLOC
11 - mm: memory reserve management
12 - selinux: tag avc cache alloc as non-critical


  Part 2, patches 13-15

Provide some generic network infrastructure needed later on.

13 - net: wrap sk-sk_backlog_rcv()
14 - net: packet split receive api
15 - net: sk_allocation() - concentrate socket related allocations


  Part 3, patches 16-23

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

16 - netvm: network reserve infrastructure
17 - sysctl: propagate conv errors
18 - netvm: INET reserves.
19 - netvm: hook skb allocation to reserves
20 - netvm: filter emergency skbs.
21 - netvm: prevent a TCP specific deadlock
22 - netfilter: NF_QUEUE vs emergency skbs
23 - netvm: skb processing


  Part 4, patches 24-26

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device. The approach here has been questioned, people would like to see a less
invasive approach.

One suggestion is to create and use a_ops-swap_{in,out}().

24 - mm: prepare swap entry methods for use in page methods
25 - mm: add support for non block device backed swap files
26 - mm: methods for teaching filesystems about PG_swapcache pages


  Part 5, patches 27-33

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

27 - nfs: remove mempools
28 - nfs: teach the NFS client how to treat PG_swapcache pages
29 - nfs: disable data cache revalidation for swapfiles
30 - nfs: swap vs nfs_writepage
31 - nfs: enable swap on NFS
32 - nfs: fix various memory recursions possible with swap over NFS.
33 - nfs: do not warn on radix tree node allocation failures



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 29/33] nfs: disable data cache revalidation for swapfiles

2007-10-30 Thread Peter Zijlstra
Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page-private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/inode.c |6 
 fs/nfs/write.c |   73 ++---
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -744,6 +744,12 @@ int nfs_revalidate_mapping_nolock(struct
struct nfs_inode *nfsi = NFS_I(inode);
int ret = 0;
 
+   /*
+* swapfiles are not supposed to be shared.
+*/
+   if (IS_SWAPFILE(inode))
+   goto out;
+
if ((nfsi-cache_validity  NFS_INO_REVAL_PAGECACHE)
|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
set_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int 
get)
 {
struct nfs_page *req = NULL;
 
-   if (PagePrivate(page)) {
+   if (PagePrivate(page))
req = (struct nfs_page *)page_private(page);
-   if (req != NULL)
-   kref_get(req-wb_kref);
-   }
+   else if (unlikely(PageSwapCache(page)))
+   req = radix_tree_lookup(nfsi-nfs_page_tree, 
page_file_index(page));
+
+   if (get  req)
+   kref_get(req-wb_kref);
+
return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+   return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+   struct inode *inode = page_file_mapping(page)-host;
+   struct nfs_page *req = NULL;
+
+   spin_lock(inode-i_lock);
+   req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+   spin_unlock(inode-i_lock);
+
+   /*
+* hole here plugged by the caller holding onto PG_locked
+*/
+
+   return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+   if (PagePrivate(page))
+   return 1;
+
+   if (unlikely(PageSwapCache(page)))
+   return __nfs_page_has_request(page);
+
+   return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *req = NULL;
 
spin_lock(inode-i_lock);
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(NFS_I(inode), page);
spin_unlock(inode-i_lock);
return req;
 }
@@ -255,7 +292,7 @@ static int nfs_page_async_flush(struct n
 
spin_lock(inode-i_lock);
for(;;) {
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(nfsi, page);
if (req == NULL) {
spin_unlock(inode-i_lock);
return 0;
@@ -374,8 +411,14 @@ static int nfs_inode_add_request(struct 
if (nfs_have_delegation(inode, FMODE_WRITE))
nfsi-change_attr++;
}
-   SetPagePrivate(req-wb_page);
-   set_page_private(req-wb_page, (unsigned long)req);
+   /*
+* Swap-space should not get truncated. Hence no need to plug the race
+* with invalidate/truncate.
+*/
+   if (likely(!PageSwapCache(req-wb_page))) {
+   SetPagePrivate(req-wb_page);
+   set_page_private(req-wb_page, (unsigned long)req);
+   }
nfsi-npages++;
kref_get(req-wb_kref);
return 0;
@@ -392,8 +435,10 @@ static void nfs_inode_remove_request(str
BUG_ON (!NFS_WBACK_BUSY(req));
 
spin_lock(inode-i_lock);
-   set_page_private(req-wb_page, 0);
-   ClearPagePrivate(req-wb_page);
+   if (likely(!PageSwapCache(req-wb_page))) {
+   set_page_private(req-wb_page, 0);
+   ClearPagePrivate(req-wb_page);
+   }
radix_tree_delete(nfsi-nfs_page_tree, 

[PATCH 27/33] nfs: remove mempools

2007-10-30 Thread Peter Zijlstra
With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/read.c  |   15 +++
 fs/nfs/write.c |   27 +--
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ  (32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-   struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+   struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
else {
p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p-pagevec) {
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
p = NULL;
}
}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
struct nfs_read_data *p = container_of(head, struct nfs_read_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -597,16 +594,10 @@ int __init nfs_init_readpagecache(void)
if (nfs_rdata_cachep == NULL)
return -ENOMEM;
 
-   nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-nfs_rdata_cachep);
-   if (nfs_rdata_mempool == NULL)
-   return -ENOMEM;
-
return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-   mempool_destroy(nfs_rdata_mempool);
kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITYNFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE (32)
-#define MIN_POOL_COMMIT(4)
-
 /*
  * Local function declarations
  */
@@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_commit_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all
else {
p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p-pagevec) {
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
}
}
@@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1474,16 +1469,6 @@ int __init nfs_init_writepagecache(void)
if (nfs_wdata_cachep == NULL)
return -ENOMEM;
 
-  

[PATCH 31/33] nfs: enable swap on NFS

2007-10-30 Thread Peter Zijlstra
Provide an a_ops-swapfile() implementation for NFS. This will set the
NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_MEMALLOC before engaging the protocol -connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the 
packets
required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/Kconfig  |   18 
 fs/nfs/file.c   |   10 ++
 include/linux/sunrpc/xprt.h |5 ++-
 net/sunrpc/sched.c  |9 --
 net/sunrpc/xprtsock.c   |   63 
 5 files changed, 102 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -371,6 +371,13 @@ static int nfs_launder_page(struct page 
return nfs_wb_page(page_file_mapping(page)-host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+   return xs_swapper(NFS_CLIENT(mapping-host)-cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -385,6 +392,9 @@ const struct address_space_operations nf
.direct_IO = nfs_direct_IO,
 #endif
.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+   .swapfile = nfs_swapfile,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/include/linux/sunrpc/xprt.h
===
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -143,7 +143,9 @@ struct rpc_xprt {
unsigned intmax_reqs;   /* total slots */
unsigned long   state;  /* transport state */
unsigned char   shutdown   : 1, /* being shut down */
-   resvport   : 1; /* use a reserved port */
+   resvport   : 1, /* use a reserved port */
+   swapper: 1; /* we're swapping over this
+  transport */
unsigned intbind_index; /* bind function index */
 
/*
@@ -246,6 +248,7 @@ struct rpc_rqst *   xprt_lookup_rqst(struc
 void   xprt_complete_rqst(struct rpc_task *task, int copied);
 void   xprt_release_rqst_cong(struct rpc_task *task);
 void   xprt_disconnect(struct rpc_xprt *xprt);
+intxs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt-state
Index: linux-2.6/net/sunrpc/sched.c
===
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -761,7 +761,10 @@ struct rpc_buffer {
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
struct rpc_buffer *buf;
-   gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+   gfp_t gfp = GFP_NOWAIT;
+
+   if (RPC_IS_SWAPPER(task))
+   gfp |= __GFP_MEMALLOC;
 
size += sizeof(struct rpc_buffer);
if (size = RPC_BUFFER_MAXSIZE)
@@ -817,6 +820,8 @@ void rpc_init_task(struct rpc_task *task
atomic_set(task-tk_count, 1);
task-tk_client = clnt;
task-tk_flags  = flags;
+   if (clnt-cl_xprt-swapper)
+   task-tk_flags |= RPC_TASK_SWAPPER;
task-tk_ops = tk_ops;
if (tk_ops-rpc_call_prepare != NULL)
task-tk_action = rpc_prepare_task;
@@ -853,7 +858,7 @@ void rpc_init_task(struct rpc_task *task
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-   return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+   return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 static void rpc_free_task(struct rcu_head *rcu)
Index: linux-2.6/net/sunrpc/xprtsock.c
===
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1397,6 +1397,9 @@ static void xs_udp_finish_connecting(str
transport-sock = sock;
transport-inet = sk;
 
+   if (xprt-swapper)
+   sk_set_memalloc(sk);
+
write_unlock_bh(sk-sk_callback_lock);
}
xs_udp_do_set_buffer_size(xprt);
@@ -1414,11 +1417,15 @@ static void xs_udp_connect_worker4(struc
container_of(work, struct sock_xprt, connect_worker.work);
struct rpc_xprt *xprt = transport-xprt;
struct socket *sock = transport-sock;
+   unsigned long pflags = 

[PATCH 33/33] nfs: do not warn on radix tree node allocation failures

2007-10-30 Thread Peter Zijlstra
GFP_ATOMIC failures are rather common, no not warn about them.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/inode.c |2 +-
 fs/nfs/write.c |   10 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/nfs/inode.c
===
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -1172,7 +1172,7 @@ static void init_once(struct kmem_cache 
INIT_LIST_HEAD(nfsi-open_files);
INIT_LIST_HEAD(nfsi-access_cache_entry_lru);
INIT_LIST_HEAD(nfsi-access_cache_inode_lru);
-   INIT_RADIX_TREE(nfsi-nfs_page_tree, GFP_ATOMIC);
+   INIT_RADIX_TREE(nfsi-nfs_page_tree, GFP_ATOMIC|__GFP_NOWARN);
nfsi-ncommit = 0;
nfsi-npages = 0;
nfs4_init_once(nfsi);
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -652,6 +652,7 @@ static struct nfs_page * nfs_update_requ
struct inode *inode = mapping-host;
struct nfs_page *req, *new = NULL;
pgoff_t rqend, end;
+   int error;
 
end = offset + bytes;
 
@@ -659,6 +660,10 @@ static struct nfs_page * nfs_update_requ
/* Loop over all inode entries and see if we find
 * A request for the page we wish to update
 */
+   error = radix_tree_preload(GFP_NOIO);
+   if (error)
+   return ERR_PTR(error);
+
spin_lock(inode-i_lock);
req = nfs_page_find_request_locked(NFS_I(inode), page);
if (req) {
@@ -666,6 +671,7 @@ static struct nfs_page * nfs_update_requ
int error;
 
spin_unlock(inode-i_lock);
+   radix_tree_preload_end();
error = nfs_wait_on_request(req);
nfs_release_request(req);
if (error  0) {
@@ -676,6 +682,7 @@ static struct nfs_page * nfs_update_requ
continue;
}
spin_unlock(inode-i_lock);
+   radix_tree_preload_end();
if (new)
nfs_release_request(new);
break;
@@ -687,13 +694,16 @@ static struct nfs_page * nfs_update_requ
error = nfs_inode_add_request(inode, new);
if (error) {
spin_unlock(inode-i_lock);
+   radix_tree_preload_end();
nfs_unlock_request(new);
return ERR_PTR(error);
}
spin_unlock(inode-i_lock);
+   radix_tree_preload_end();
return new;
}
spin_unlock(inode-i_lock);
+   radix_tree_preload_end();
 
new = nfs_create_request(ctx, inode, page, offset, bytes);
if (IS_ERR(new))

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-30 Thread Peter Zijlstra
Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current-flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/sched.h |4 
 kernel/softirq.c  |3 +++
 mm/page_alloc.c   |7 ---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1557,9 +1557,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
-   if (!in_interrupt() 
-   ((p-flags  PF_MEMALLOC) ||
-unlikely(test_thread_flag(TIF_MEMDIE
+   if (!in_irq()  (p-flags  PF_MEMALLOC))
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_interrupt() 
+   unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
 
Index: linux-2.6/kernel/softirq.c
===
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+   unsigned long pflags = current-flags;
+   current-flags = ~PF_MEMALLOC;
 
pending = local_softirq_pending();
account_system_vtime(current);
@@ -249,6 +251,7 @@ restart:
 
account_system_vtime(current);
_local_bh_enable();
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1389,6 +1389,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)-flags  PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+   do {(p)-flags = ~(mask); \
+   (p)-flags |= ((pflags)  (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/33] selinux: tag avc cache alloc as non-critical

2007-10-30 Thread Peter Zijlstra
Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Acked-by: James Morris [EMAIL PROTECTED]
---
 security/selinux/avc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-2/security/selinux/avc.c
===
--- linux-2.6-2.orig/security/selinux/avc.c
+++ linux-2.6-2/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
struct avc_node *node;
 
-   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;
 

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/33] mm: allow mempool to fall back to memalloc reserves

2007-10-30 Thread Peter Zijlstra
Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/mempool.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/mempool.c
===
--- linux-2.6.orig/mm/mempool.c
+++ linux-2.6/mm/mempool.c
@@ -14,6 +14,7 @@
 #include linux/mempool.h
 #include linux/blkdev.h
 #include linux/writeback.h
+#include internal.h
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
void *element;
unsigned long flags;
wait_queue_t wait;
-   gfp_t gfp_temp;
+   gfp_t gfp_temp, gfp_orig = gfp_mask;
 
might_sleep_if(gfp_mask  __GFP_WAIT);
 
@@ -228,6 +229,15 @@ repeat_alloc:
}
spin_unlock_irqrestore(pool-lock, flags);
 
+   /* if we really had right to the emergency reserves try those */
+   if (gfp_to_alloc_flags(gfp_orig)  ALLOC_NO_WATERMARKS) {
+   if (gfp_temp  __GFP_NOMEMALLOC) {
+   gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+   goto repeat_alloc;
+   } else
+   gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+   }
+
/* We must not sleep in the GFP_ATOMIC case */
if (!(gfp_mask  __GFP_WAIT))
return NULL;

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK

2007-10-30 Thread Peter Zijlstra
Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
wide - which they are per setup_per_zone_pages_min(), when we scrape the
barrel, do it properly.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page_alloc.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1638,6 +1638,12 @@ restart:
 rebalance:
if (alloc_flags  ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+   /*
+* break out of mempolicy boundaries
+*/
+   zonelist = NODE_DATA(numa_node_id())-node_zonelists +
+   gfp_zone(gfp_mask);
+
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order, zonelist,
ALLOC_NO_WATERMARKS);

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.24] ixgb: TX hangs under heavy load

2007-10-30 Thread Kok, Auke
Andy Gospodarek wrote:
 Auke,
 
 It has become clear that this patch resolves some tx-lockups on the ixgb
 driver.  IBM did some checking and realized this hunk is in your
 sourceforge driver, but not anywhere else.  Mind if we add it?


I'll quickly double check where this came from in the first place and will post
this to Jeff

Thanks!

Auke


 Signed-off-by: Andy Gospodarek [EMAIL PROTECTED]
 
 ---
 
  ixgb_main.c |2 +-
  1 files changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
 index d444de5..3ec7a41 100644
 --- a/drivers/net/ixgb/ixgb_main.c
 +++ b/drivers/net/ixgb/ixgb_main.c
 @@ -1324,7 +1324,7 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct 
 sk_buff *skb,
  
   /* Workaround for premature desc write-backs
* in TSO mode.  Append 4-byte sentinel desc */
 - if (unlikely(mss  !nr_frags  size == len
 + if (unlikely(mss  (f == (nr_frags-1))  size == len
 size  8))
   size -= 4;
  
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1][CORE] resend - fix free_netdev on register_netdev failure

2007-10-30 Thread Daniel Lezcano
Point 1:
The unregistering of a network device schedule a netdev_run_todo.
This function calls dev-destructor when it is set and the
destructor calls free_netdev.

Point 2:
In the case of an initialization of a network device the usual code
is:
 * alloc_netdev
 * register_netdev
- if this one fails, call free_netdev and exit with error.

Point 3:
In the register_netdevice function at the later state, when the device
is at the registered state, a call to the netdevice_notifiers is made.
If one of the notification falls into an error, a rollback to the
registered state is done using unregister_netdevice.

Conclusion:
When a network device fails to register during initialization because
one network subsystem returned an error during a notification call
chain, the network device is freed twice because of fact 1 and fact 2.
The second free_netdev will be done with an invalid pointer.

Proposed solution:
The following patch move all the code of unregister_netdevice *except* 
the call to net_set_todo, to a new function rollback_registered.

The following functions are changed in this way:
 * register_netdevice: calls rollback_registered when a notification fails
 * unregister_netdevice: calls rollback_register + net_set_todo, the call
 order to net_set_todo is changed because it is the
 latest now. Since it justs add an element to a list
 that should not break anything.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/core/dev.c |  112 ++---
 1 file changed, 59 insertions(+), 53 deletions(-)

Index: net-2.6/net/core/dev.c
===
--- net-2.6.orig/net/core/dev.c
+++ net-2.6/net/core/dev.c
@@ -3496,6 +3496,60 @@ static void net_set_todo(struct net_devi
spin_unlock(net_todo_list_lock);
 }
 
+static void rollback_registered(struct net_device *dev)
+{
+   BUG_ON(dev_boot_phase);
+   ASSERT_RTNL();
+
+   /* Some devices call without registering for initialization unwind. */
+   if (dev-reg_state == NETREG_UNINITIALIZED) {
+   printk(KERN_DEBUG unregister_netdevice: device %s/%p never 
+ was registered\n, dev-name, dev);
+
+   WARN_ON(1);
+   return;
+   }
+
+   BUG_ON(dev-reg_state != NETREG_REGISTERED);
+
+   /* If device is running, close it first. */
+   dev_close(dev);
+
+   /* And unlink it from device chain. */
+   unlist_netdevice(dev);
+
+   dev-reg_state = NETREG_UNREGISTERING;
+
+   synchronize_net();
+
+   /* Shutdown queueing discipline. */
+   dev_shutdown(dev);
+
+
+   /* Notify protocols, that we are about to destroy
+  this device. They should clean all the things.
+   */
+   call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+
+   /*
+*  Flush the unicast and multicast chains
+*/
+   dev_addr_discard(dev);
+
+   if (dev-uninit)
+   dev-uninit(dev);
+
+   /* Notifier chain MUST detach us from master device. */
+   BUG_TRAP(!dev-master);
+
+   /* Remove entries from kobject tree */
+   netdev_unregister_kobject(dev);
+
+   synchronize_net();
+
+   dev_put(dev);
+}
+
 /**
  * register_netdevice  - register a network device
  * @dev: device to register
@@ -3633,8 +3687,10 @@ int register_netdevice(struct net_device
/* Notify protocols, that a new device appeared. */
ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
ret = notifier_to_errno(ret);
-   if (ret)
-   unregister_netdevice(dev);
+   if (ret) {
+   rollback_registered(dev);
+   dev-reg_state = NETREG_UNREGISTERED;
+   }
 
 out:
return ret;
@@ -3911,59 +3967,9 @@ void synchronize_net(void)
 
 void unregister_netdevice(struct net_device *dev)
 {
-   BUG_ON(dev_boot_phase);
-   ASSERT_RTNL();
-
-   /* Some devices call without registering for initialization unwind. */
-   if (dev-reg_state == NETREG_UNINITIALIZED) {
-   printk(KERN_DEBUG unregister_netdevice: device %s/%p never 
- was registered\n, dev-name, dev);
-
-   WARN_ON(1);
-   return;
-   }
-
-   BUG_ON(dev-reg_state != NETREG_REGISTERED);
-
-   /* If device is running, close it first. */
-   dev_close(dev);
-
-   /* And unlink it from device chain. */
-   unlist_netdevice(dev);
-
-   dev-reg_state = NETREG_UNREGISTERING;
-
-   synchronize_net();
-
-   /* Shutdown queueing discipline. */
-   dev_shutdown(dev);
-
-
-   /* Notify protocols, that we are about to destroy
-  this device. They should clean all the things.
-   */
-   call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
-
-   /*
-*  Flush the unicast and multicast chains
- 

[patch 1/1][NETNS] resend: fix net released by rcu callback

2007-10-30 Thread Daniel Lezcano
When a network namespace reference is held by a network subsystem,
and when this reference is decremented in a rcu update callback, we
must ensure that there is no more outstanding rcu update before 
trying to free the network namespace.

In the normal case, the rcu_barrier is called when the network namespace
is exiting in the cleanup_net function.

But when a network namespace creation fails, and the subsystems are
undone (like the cleanup), the rcu_barrier is missing.

This patch adds the missing rcu_barrier.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/core/net_namespace.c |2 ++
 1 file changed, 2 insertions(+)

Index: net-2.6/net/core/net_namespace.c
===
--- net-2.6.orig/net/core/net_namespace.c
+++ net-2.6/net/core/net_namespace.c
@@ -112,6 +112,8 @@ out_undo:
if (ops-exit)
ops-exit(net);
}
+
+   rcu_barrier();
goto out;
 }
 

-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1][IPV6] resend: remove duplicate call to proc_net_remove

2007-10-30 Thread Daniel Lezcano
The file /proc/net/if_inet6 is removed twice.
First time in:
inet6_exit
 -addrconf_cleanup
And followed a few lines after by:
inet6_exit
 - if6_proc_exit

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 net/ipv6/addrconf.c |4 
 1 file changed, 4 deletions(-)

Index: net-2.6/net/ipv6/addrconf.c
===
--- net-2.6.orig/net/ipv6/addrconf.c
+++ net-2.6/net/ipv6/addrconf.c
@@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void)
del_timer(addr_chk_timer);
 
rtnl_unlock();
-
-#ifdef CONFIG_PROC_FS
-   proc_net_remove(init_net, if_inet6);
-#endif
 }

-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] NFS: change the ip_map cache code to handle IPv6 addresses

2007-10-30 Thread Aurélien Charbon

Hi,

Here is the IPv6 support 
patch for the ip_map caching code part in nfs server.


I have ported it on 2.6.24-rc1 (in which Brian 
Haley's ipv6_addr_v4mapped function is included)


In case of bad formatting due to my mailer, you can also find the patch 
in attachment.


Tests: tested with only IPv4 network and basic nfs ops (mount, file 
creation and modification)


Signed-off-by: Aurelien Charbon [EMAIL PROTECTED]
---

diff -p -u -r -N linux-2.6.24-rc1/fs/nfsd/export.c 
linux-2.6.24-rc1-ipmap/fs/nfsd/export.c

--- linux-2.6.24-rc1/fs/nfsd/export.c2007-10-30 12:47:21.0 +0100
+++ linux-2.6.24-rc1-ipmap/fs/nfsd/export.c2007-10-30 
17:18:21.0 +0100

@@ -35,6 +35,7 @@
#include linux/lockd/bind.h
#include linux/sunrpc/msg_prot.h
#include linux/sunrpc/gss_api.h
+#include net/ipv6.h

#define NFSDDBG_FACILITYNFSDDBG_EXPORT

@@ -1556,6 +1557,7 @@ exp_addclient(struct nfsctl_client *ncp)
{
struct auth_domain*dom;
inti, err;
+struct in6_addr addr6;

/* First, consistency check. */
err = -EINVAL;
@@ -1574,9 +1576,12 @@ exp_addclient(struct nfsctl_client *ncp)
goto out_unlock;

/* Insert client into hashtable. */
-for (i = 0; i  ncp-cl_naddr; i++)
-auth_unix_add_addr(ncp-cl_addrlist[i], dom);
-
+for (i = 0; i  ncp-cl_naddr; i++) {
+/* Mapping address */
+ipv6_addr_set(addr6, 0, 0,
+htonl(0x), ncp-cl_addrlist[i].s_addr);
+auth_unix_add_addr(addr6, dom);
+}
auth_unix_forget_old(dom);
auth_domain_put(dom);

diff -p -u -r -N linux-2.6.24-rc1/fs/nfsd/nfsctl.c 
linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c

--- linux-2.6.24-rc1/fs/nfsd/nfsctl.c2007-10-30 12:47:21.0 +0100
+++ linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c2007-10-30 
17:15:45.0 +0100

@@ -37,6 +37,7 @@
#include linux/nfsd/syscall.h

#include asm/uaccess.h
+#include net/ipv6.h

/*
 *We have a single directory with 9 nodes in it.
@@ -222,6 +223,7 @@ static ssize_t write_getfs(struct file *
struct auth_domain *clp;
int err = 0;
struct knfsd_fh *res;
+struct in6_addr in6;

if (size  sizeof(*data))
return -EINVAL;
@@ -236,7 +238,12 @@ static ssize_t write_getfs(struct file *
res = (struct knfsd_fh*)buf;

exp_readlock();
-if (!(clp = auth_unix_lookup(sin-sin_addr)))
+
+/* IPv6 address mapping */
+ipv6_addr_set(in6, 0, 0,
+htonl(0x), (((struct sockaddr_in 
*)data-gd_addr)-sin_addr.s_addr));

+
+if (!(clp = auth_unix_lookup(in6)))
err = -EPERM;
else {
err = exp_rootfh(clp, data-gd_path, res, data-gd_maxlen);
@@ -257,6 +264,7 @@ static ssize_t write_getfd(struct file *
int err = 0;
struct knfsd_fh fh;
char *res;
+struct in6_addr in6;

if (size  sizeof(*data))
return -EINVAL;
@@ -271,7 +279,11 @@ static ssize_t write_getfd(struct file *
res = buf;
sin = (struct sockaddr_in *)data-gd_addr;
exp_readlock();
-if (!(clp = auth_unix_lookup(sin-sin_addr)))
+   
+/* IPv6 address mapping */
+ipv6_addr_set(in6, 0, 0, htonl(0x), (((struct sockaddr_in 
*)data-gd_addr)-sin_addr.s_addr));

+
+if (!(clp = auth_unix_lookup(in6)))
err = -EPERM;
else {
err = exp_rootfh(clp, data-gd_path, fh, NFS_FHSIZE);
diff -p -u -r -N linux-2.6.24-rc1/include/linux/sunrpc/svcauth.h 
linux-2.6.24-rc1-ipmap/include/linux/sunrpc/svcauth.h
--- linux-2.6.24-rc1/include/linux/sunrpc/svcauth.h2007-10-30 
12:47:04.0 +0100
+++ linux-2.6.24-rc1-ipmap/include/linux/sunrpc/svcauth.h2007-10-30 
13:14:04.0 +0100

@@ -120,10 +120,10 @@ extern voidsvc_auth_unregister(rpc_auth

extern struct auth_domain *unix_domain_find(char *name);
extern void auth_domain_put(struct auth_domain *item);
-extern int auth_unix_add_addr(struct in_addr addr, struct auth_domain 
*dom);
+extern int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain 
*dom);
extern struct auth_domain *auth_domain_lookup(char *name, struct 
auth_domain *new);

extern struct auth_domain *auth_domain_find(char *name);
-extern struct auth_domain *auth_unix_lookup(struct in_addr addr);
+extern struct auth_domain *auth_unix_lookup(struct in6_addr *addr);
extern int auth_unix_forget_old(struct auth_domain *dom);
extern void svcauth_unix_purge(void);
extern void svcauth_unix_info_release(void *);
diff -p -u -r -N linux-2.6.24-rc1/net/sunrpc/svcauth_unix.c 
linux-2.6.24-rc1-ipmap/net/sunrpc/svcauth_unix.c
--- linux-2.6.24-rc1/net/sunrpc/svcauth_unix.c2007-10-30 
12:47:07.0 +0100
+++ linux-2.6.24-rc1-ipmap/net/sunrpc/svcauth_unix.c2007-10-30 
17:17:00.0 +0100

@@ -11,7 +11,8 @@
#include linux/hash.h
#include linux/string.h
#include net/sock.h
-
+#include net/ipv6.h
+#include linux/kernel.h
#define RPCDBG_FACILITYRPCDBG_AUTH


@@ -84,7 +85,7 @@ static void svcauth_unix_domain_release(
struct ip_map {
struct cache_headh;
charm_class[8]; 

[PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl

2007-10-30 Thread Aurélien Charbon
Here is a second missing part of the IPv6 support in NFS server code 
concerning knfd syscall interface.

It updates write_getfd and write_getfd to accept IPv6 addresses.

Applies on a kernel including ip_map cache modifications

Tests: tested with only IPv4 network and basic nfs ops (mount, file 
creation and modification)


Signed-off-by: Aurelien Charbon [EMAIL PROTECTED]
---
diff -p -u -r -N linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c 
linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c
--- linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c2007-10-30 
17:15:45.0 +0100
+++ linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c2007-10-30 
17:21:36.0 +0100

@@ -219,7 +219,7 @@ static ssize_t write_unexport(struct fil
static ssize_t write_getfs(struct file *file, char *buf, size_t size)
{
struct nfsctl_fsparm *data;
-struct sockaddr_in *sin;
+struct sockaddr_in6 *sin6, sin6_storage;
struct auth_domain *clp;
int err = 0;
struct knfsd_fh *res;
@@ -229,9 +229,21 @@ static ssize_t write_getfs(struct file *
return -EINVAL;
data = (struct nfsctl_fsparm*)buf;
err = -EPROTONOSUPPORT;
-if (data-gd_addr.sa_family != AF_INET)
+switch (data-gd_addr.sa_family) {
+case AF_INET6:
+sin6 = sin6_storage;
+sin6 = (struct sockaddr_in6 *)data-gd_addr;
+ipv6_addr_copy(in6, sin6-sin6_addr);
+break;
+case AF_INET:
+/* Map v4 address into v6 structure */
+ipv6_addr_set(in6, 0, 0,
+htonl(0x), (((struct sockaddr_in 
*)data-gd_addr)-sin_addr.s_addr));

+break;
+default:
goto out;
-sin = (struct sockaddr_in *)data-gd_addr;
+}
+
if (data-gd_maxlen  NFS3_FHSIZE)
data-gd_maxlen = NFS3_FHSIZE;

@@ -239,11 +251,7 @@ static ssize_t write_getfs(struct file *

exp_readlock();

-/* IPv6 address mapping */
-ipv6_addr_set(in6, 0, 0,
-htonl(0x), (((struct sockaddr_in 
*)data-gd_addr)-sin_addr.s_addr));

-
-if (!(clp = auth_unix_lookup(in6)))
+if (!(clp = auth_unix_lookup(in6)))
err = -EPERM;
else {
err = exp_rootfh(clp, data-gd_path, res, data-gd_maxlen);
@@ -259,7 +267,7 @@ static ssize_t write_getfs(struct file *
static ssize_t write_getfd(struct file *file, char *buf, size_t size)
{
struct nfsctl_fdparm *data;
-struct sockaddr_in *sin;
+struct sockaddr_in6 *sin6, sin6_storage;
struct auth_domain *clp;
int err = 0;
struct knfsd_fh fh;
@@ -270,20 +278,31 @@ static ssize_t write_getfd(struct file *
return -EINVAL;
data = (struct nfsctl_fdparm*)buf;
err = -EPROTONOSUPPORT;
-if (data-gd_addr.sa_family != AF_INET)
+if (data-gd_addr.sa_family != AF_INET  data-gd_addr.sa_family 
!= AF_INET6)

goto out;
err = -EINVAL;
if (data-gd_version  2 || data-gd_version  NFSSVC_MAXVERS)
goto out;

res = buf;
-sin = (struct sockaddr_in *)data-gd_addr;
exp_readlock();
-   
-/* IPv6 address mapping */
-ipv6_addr_set(in6, 0, 0, htonl(0x), (((struct sockaddr_in 
*)data-gd_addr)-sin_addr.s_addr));


-if (!(clp = auth_unix_lookup(in6)))
+switch (data-gd_addr.sa_family) {
+case AF_INET:
+/* IPv6 address mapping */
+ipv6_addr_set(in6, 0, 0, htonl(0x),
+((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr);
+break;
+case AF_INET6:
+sin6 = sin6_storage;
+sin6 = (struct sockaddr_in6 *)data-gd_addr;
+ipv6_addr_copy(in6, sin6-sin6_addr);
+break;
+default:
+BUG();
+}
+
+if (!(clp = auth_unix_lookup(in6)))
err = -EPERM;
else {
err = exp_rootfh(clp, data-gd_path, fh, NFS_FHSIZE);



--


  Aurelien Charbon
  Linux NFSv4 team
  Bull SAS
Echirolles - France
http://nfsv4.bullopensource.org/


diff -p -u -r -N linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c
--- linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c	2007-10-30 17:15:45.0 +0100
+++ linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c	2007-10-30 17:21:36.0 +0100
@@ -219,7 +219,7 @@ static ssize_t write_unexport(struct fil
 static ssize_t write_getfs(struct file *file, char *buf, size_t size)
 {
 	struct nfsctl_fsparm *data;
-	struct sockaddr_in *sin;
+	struct sockaddr_in6 *sin6, sin6_storage;
 	struct auth_domain *clp;
 	int err = 0;
 	struct knfsd_fh *res;
@@ -229,9 +229,21 @@ static ssize_t write_getfs(struct file *
 		return -EINVAL;
 	data = (struct nfsctl_fsparm*)buf;
 	err = -EPROTONOSUPPORT;
-	if (data-gd_addr.sa_family != AF_INET)
+	switch (data-gd_addr.sa_family) {
+	case AF_INET6:
+		sin6 = sin6_storage;
+		sin6 = (struct sockaddr_in6 *)data-gd_addr;
+		ipv6_addr_copy(in6, sin6-sin6_addr); 
+		break;
+	case AF_INET:
+		/* Map v4 address into v6 structure */
+		ipv6_addr_set(in6, 0, 0,
+htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr));
+	

[PATCH] ixgb: fix TX hangs under heavy load

2007-10-30 Thread Auke Kok
A merge error occurred where we merged the wrong block here
in version 1.0.120. The right condition for frags is slightly
different then for the skb, so account for the difference properly
and trim the TSO based size right.

Originally part of a fix reported by IBM to fix TSO hangs on
pSeries hardware.

Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
Cc: Andy Gospodarek [EMAIL PROTECTED]
---

 drivers/net/ixgb/ixgb_main.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index e564335..3021234 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1321,8 +1321,8 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff 
*skb,
 
/* Workaround for premature desc write-backs
 * in TSO mode.  Append 4-byte sentinel desc */
-   if (unlikely(mss  !nr_frags  size == len
- size  8))
+   if (unlikely(mss  (f == (nr_frags - 1))
+ size == len  size  8))
size -= 4;
 
buffer_info-length = size;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines

2007-10-30 Thread Jeff Garzik

Dale Farnsworth wrote:

On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote:

This commit made an incorrect assumption:
--
Author: Lennert Buytenhek [EMAIL PROTECTED]
 Date:   Fri Oct 19 04:10:10 2007 +0200

mv643xx_eth: Move ethernet register definitions into private header

Move the mv643xx's ethernet-related register definitions from

include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since
they aren't of any use outside the ethernet driver.

Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]

Acked-by: Tzachi Perelstein [EMAIL PROTECTED]
Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
--

arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there.

[EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe 


v2.6.24-rc1-138-g0119130

This patch fixes this by internalizing 3 defines onto pegasos which are
simply no longer available elsewhere. Without this your compile will fail


That compile failure was fixed in commit
30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro.

However, as I examine that commit, I see that it defines offsets from
the eth block in the chip, rather than the full chip registeri block
as the Pegasos 2 code expects.  So, I think it fixes the compile
failure, but leaves the Pegasos 2 broken.

Luis, do you have Pegasos 2 hardware?  Can you (or anyone) verify that
the following patch is needed for the Pegasos 2?

Thanks,
-Dale

-

mv643xx_eth: Fix MV643XX_ETH offsets used by Pegasos 2

In the mv643xx_eth driver, we now use offsets from the ethernet
register block within the chip, but the pegasos 2 platform still
needs offsets from the full chip's register base address.

Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
---
 include/linux/mv643xx_eth.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mv643xx_eth.h b/include/linux/mv643xx_eth.h
index 8df230a..30e11aa 100644
--- a/include/linux/mv643xx_eth.h
+++ b/include/linux/mv643xx_eth.h
@@ -8,9 +8,9 @@
 #define MV643XX_ETH_NAME   mv643xx_eth
 #define MV643XX_ETH_SHARED_REGS0x2000
 #define MV643XX_ETH_SHARED_REGS_SIZE   0x2000
-#define MV643XX_ETH_BAR_4  0x220
-#define MV643XX_ETH_SIZE_REG_4 0x224
-#define MV643XX_ETH_BASE_ADDR_ENABLE_REG   0x0290
+#define MV643XX_ETH_BAR_4  0x2220
+#define MV643XX_ETH_SIZE_REG_4 0x2224
+#define MV643XX_ETH_BASE_ADDR_ENABLE_REG   0x2290


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] ixgbe: minor sparse fixes

2007-10-30 Thread Jeff Garzik

Auke Kok wrote:

From: Stephen Hemminger [EMAIL PROTECTED]

Make strings const if possible, and fix includes so forward definitions
are seen.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/ixgbe/ixgbe.h   |2 +-
 drivers/net/ixgbe/ixgbe_82598.c |3 +--
 drivers/net/ixgbe/ixgbe_main.c  |9 +
 3 files changed, 7 insertions(+), 7 deletions(-)


applied 1-4

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e1000e: Fix typo !

2007-10-30 Thread Jeff Garzik

Auke Kok wrote:

From: Roel Kluin [EMAIL PROTECTED]

Signed-off-by: Roel Kluin [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e1000e/82571.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000e/82571.c b/drivers/net/e1000e/82571.c
index cf70522..14141a5 100644
--- a/drivers/net/e1000e/82571.c
+++ b/drivers/net/e1000e/82571.c
@@ -283,7 +283,7 @@ static s32 e1000_get_invariants_82571(struct e1000_adapter 
*adapter)
adapter-flags = ~FLAG_HAS_WOL;
/* quad ports only support WoL on port A */
if (adapter-flags  FLAG_IS_QUAD_PORT 
-   (!adapter-flags  FLAG_IS_QUAD_PORT_A))
+   (!(adapter-flags  FLAG_IS_QUAD_PORT_A)))
adapter-flags = ~FLAG_HAS_WOL;
break;


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ixgb: fix TX hangs under heavy load

2007-10-30 Thread Jeff Garzik

Auke Kok wrote:

A merge error occurred where we merged the wrong block here
in version 1.0.120. The right condition for frags is slightly
different then for the skb, so account for the difference properly
and trim the TSO based size right.

Originally part of a fix reported by IBM to fix TSO hangs on
pSeries hardware.

Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
Cc: Andy Gospodarek [EMAIL PROTECTED]
---

 drivers/net/ixgb/ixgb_main.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] using mii-bitbang on different processor ports

2007-10-30 Thread Sergej Stepanov
The patch makes possible to have mdio and mdc pins on different physical ports
also for CONFIG_PPC_CPM_NEW_BINDING.
To setup it in the device tree:
reg = 10d40 14 10d60 14; // mdc: 0x10d40, mdio: 0x10d60
or
reg = 10d40 14; // mdc and mdio have the same offset 10d40
The approach was taken from older version.

Signed-off-by: Sergej Stepanov [EMAIL PROTECTED]
--

diff --git a/drivers/net/fs_enet/mii-bitbang.c 
b/drivers/net/fs_enet/mii-bitbang.c
index b8e4a73..eea5feb 100644
--- a/drivers/net/fs_enet/mii-bitbang.c
+++ b/drivers/net/fs_enet/mii-bitbang.c
@@ -29,12 +29,16 @@
 
 #include fs_enet.h
 
-struct bb_info {
-   struct mdiobb_ctrl ctrl;
+struct bb_port {
__be32 __iomem *dir;
__be32 __iomem *dat;
-   u32 mdio_msk;
-   u32 mdc_msk;
+   u32 msk;
+};
+
+struct bb_info {
+   struct mdiobb_ctrl ctrl;
+   struct bb_port mdc;
+   struct bb_port mdio;
 };
 
 /* FIXME: If any other users of GPIO crop up, then these will have to
@@ -62,18 +66,18 @@ static inline void mdio_dir(struct mdiobb_ctrl *ctrl, int 
dir)
struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 
if (dir)
-   bb_set(bitbang-dir, bitbang-mdio_msk);
+   bb_set(bitbang-mdio.dir, bitbang-mdio.msk);
else
-   bb_clr(bitbang-dir, bitbang-mdio_msk);
+   bb_clr(bitbang-mdio.dir, bitbang-mdio.msk);
 
/* Read back to flush the write. */
-   in_be32(bitbang-dir);
+   in_be32(bitbang-mdio.dir);
 }
 
 static inline int mdio_read(struct mdiobb_ctrl *ctrl)
 {
struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
-   return bb_read(bitbang-dat, bitbang-mdio_msk);
+   return bb_read(bitbang-mdio.dat, bitbang-mdio.msk);
 }
 
 static inline void mdio(struct mdiobb_ctrl *ctrl, int what)
@@ -81,12 +85,12 @@ static inline void mdio(struct mdiobb_ctrl *ctrl, int what)
struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 
if (what)
-   bb_set(bitbang-dat, bitbang-mdio_msk);
+   bb_set(bitbang-mdio.dat, bitbang-mdio.msk);
else
-   bb_clr(bitbang-dat, bitbang-mdio_msk);
+   bb_clr(bitbang-mdio.dat, bitbang-mdio.msk);
 
/* Read back to flush the write. */
-   in_be32(bitbang-dat);
+   in_be32(bitbang-mdio.dat);
 }
 
 static inline void mdc(struct mdiobb_ctrl *ctrl, int what)
@@ -94,12 +98,12 @@ static inline void mdc(struct mdiobb_ctrl *ctrl, int what)
struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 
if (what)
-   bb_set(bitbang-dat, bitbang-mdc_msk);
+   bb_set(bitbang-mdc.dat, bitbang-mdc.msk);
else
-   bb_clr(bitbang-dat, bitbang-mdc_msk);
+   bb_clr(bitbang-mdc.dat, bitbang-mdc.msk);
 
/* Read back to flush the write. */
-   in_be32(bitbang-dat);
+   in_be32(bitbang-mdc.dat);
 }
 
 static struct mdiobb_ops bb_ops = {
@@ -114,23 +118,23 @@ static struct mdiobb_ops bb_ops = {
 static int __devinit fs_mii_bitbang_init(struct mii_bus *bus,
  struct device_node *np)
 {
-   struct resource res;
+   struct resource res[2];
const u32 *data;
int mdio_pin, mdc_pin, len;
struct bb_info *bitbang = bus-priv;
 
-   int ret = of_address_to_resource(np, 0, res);
+   int ret = of_address_to_resource(np, 0, res[0]);
if (ret)
return ret;
 
-   if (res.end - res.start  13)
+   if (res[0].end - res[0].start  13)
return -ENODEV;
 
/* This should really encode the pin number as well, but all
 * we get is an int, and the odds of multiple bitbang mdio buses
 * is low enough that it's not worth going too crazy.
 */
-   bus-id = res.start;
+   bus-id = res[0].start;
 
data = of_get_property(np, fsl,mdio-pin, len);
if (!data || len != 4)
@@ -142,15 +146,32 @@ static int __devinit fs_mii_bitbang_init(struct mii_bus 
*bus,
return -ENODEV;
mdc_pin = *data;
 
-   bitbang-dir = ioremap(res.start, res.end - res.start + 1);
-   if (!bitbang-dir)
+   bitbang-mdc.dir = ioremap(res[0].start, res[0].end - res[0].start + 1);
+   if (!bitbang-mdc.dir)
return -ENOMEM;
 
-   bitbang-dat = bitbang-dir + 4;
-   bitbang-mdio_msk = 1  (31 - mdio_pin);
-   bitbang-mdc_msk = 1  (31 - mdc_pin);
+   bitbang-mdc.dat = bitbang-mdc.dir + 4;
+   if( !of_address_to_resource(np, 1, res[1])) {
+   if (res[1].end - res[1].start  13)
+   goto bad_resource;
+   bitbang-mdio.dir = ioremap(res[1].start, res[1].end - 
res[1].start + 1);
+   if (!bitbang-mdio.dir)
+   goto unmap_and_exit;
+   bitbang-mdio.dat = bitbang-mdio.dir + 4;
+   } else {
+   bitbang-mdio.dir = bitbang-mdc.dir;
+   

[git patches] net driver fixes

2007-10-30 Thread Jeff Garzik

Fixes, and a new DM9601 USB NIC id.

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/bfin_mac.c|2 -
 drivers/net/e1000/e1000.h |8 +++
 drivers/net/e1000/e1000_ethtool.c |   29 ++--
 drivers/net/e1000/e1000_hw.c  |4 +-
 drivers/net/e1000/e1000_main.c|7 +
 drivers/net/e1000/e1000_param.c   |   23 ++-
 drivers/net/e1000e/82571.c|2 +-
 drivers/net/e1000e/ethtool.c  |4 +-
 drivers/net/e1000e/param.c|   35 +++--
 drivers/net/ixgb/ixgb.h   |7 ++
 drivers/net/ixgb/ixgb_ethtool.c   |7 +
 drivers/net/ixgb/ixgb_hw.c|4 +-
 drivers/net/ixgb/ixgb_main.c  |   15 +---
 drivers/net/ixgb/ixgb_param.c |   43 +++--
 drivers/net/ixgbe/ixgbe.h |2 +-
 drivers/net/ixgbe/ixgbe_82598.c   |3 +-
 drivers/net/ixgbe/ixgbe_main.c|9 ---
 drivers/net/usb/dm9601.c  |4 +++
 include/linux/mv643xx_eth.h   |6 ++--
 19 files changed, 110 insertions(+), 104 deletions(-)

Auke Kok (1):
  ixgb: fix TX hangs under heavy load

Dale Farnsworth (1):
  mv643xx_eth: Fix MV643XX_ETH offsets used by Pegasos 2

Michael Hennerich (1):
  Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost 
packets)

Peter Korsgaard (1):
  DM9601: Support for ADMtek ADM8515 NIC

Roel Kluin (1):
  e1000e: Fix typo ! 

Stephen Hemminger (4):
  e1000e: fix sparse warnings
  ixgb: fix sparse warnings
  e1000: sparse warnings fixes
  ixgbe: minor sparse fixes

diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c
index 53fe7de..084acfd 100644
--- a/drivers/net/bfin_mac.c
+++ b/drivers/net/bfin_mac.c
@@ -371,7 +371,6 @@ static void bf537_adjust_link(struct net_device *dev)
if (phydev-speed != lp-old_speed) {
 #if defined(CONFIG_BFIN_MAC_RMII)
u32 opmode = bfin_read_EMAC_OPMODE();
-   bf537mac_disable();
switch (phydev-speed) {
case 10:
opmode |= RMII_10;
@@ -386,7 +385,6 @@ static void bf537_adjust_link(struct net_device *dev)
break;
}
bfin_write_EMAC_OPMODE(opmode);
-   bf537mac_enable();
 #endif
 
new_state = 1;
diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 781ed99..3b84028 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -351,4 +351,12 @@ enum e1000_state_t {
__E1000_DOWN
 };
 
+extern char e1000_driver_name[];
+extern const char e1000_driver_version[];
+
+extern void e1000_power_up_phy(struct e1000_adapter *);
+extern void e1000_set_ethtool_ops(struct net_device *netdev);
+extern void e1000_check_options(struct e1000_adapter *adapter);
+
+
 #endif /* _E1000_H_ */
diff --git a/drivers/net/e1000/e1000_ethtool.c 
b/drivers/net/e1000/e1000_ethtool.c
index 6c9a643..667f18b 100644
--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -32,9 +32,6 @@
 
 #include asm/uaccess.h
 
-extern char e1000_driver_name[];
-extern char e1000_driver_version[];
-
 extern int e1000_up(struct e1000_adapter *adapter);
 extern void e1000_down(struct e1000_adapter *adapter);
 extern void e1000_reinit_locked(struct e1000_adapter *adapter);
@@ -733,16 +730,16 @@ err_setup:
 
 #define REG_PATTERN_TEST(R, M, W)  
\
 {  
\
-   uint32_t pat, value;   \
-   uint32_t test[] =  \
+   uint32_t pat, val; \
+   const uint32_t test[] =\
{0x5A5A5A5A, 0xA5A5A5A5, 0x, 0x};  \
-   for (pat = 0; pat  ARRAY_SIZE(test); pat++) {  \
+   for (pat = 0; pat  ARRAY_SIZE(test); pat++) { \
E1000_WRITE_REG(adapter-hw, R, (test[pat]  W)); \
-   value = E1000_READ_REG(adapter-hw, R);   \
-   if (value != (test[pat]  W  M)) { 
\
+   val = E1000_READ_REG(adapter-hw, R); \
+   if (val != (test[pat]  W  M)) {  \
DPRINTK(DRV, ERR, pattern test reg %04X failed: got  \
0x%08X expected 0x%08X\n,\
-   E1000_##R, value, (test[pat]  W  M));\
+ 

Re: [PATCH 1/2] NFS: change the ip_map cache code to handle IPv6 addresses

2007-10-30 Thread J. Bruce Fields
Thanks for working on this.

Could you run linux/scripts/checkpatch.pl on your patch and fix the
problems it complains about?

On Tue, Oct 30, 2007 at 06:05:42PM +0100, Aurélien Charbon wrote:
 static void update(struct cache_head *cnew, struct cache_head *citem)
 {
 @@ -149,22 +157,24 @@ static void ip_map_request(struct cache_
   struct cache_head *h,
   char **bpp, int *blen)
 {
 -char text_addr[20];
 +char text_addr[40];
 struct ip_map *im = container_of(h, struct ip_map, h);
 -__be32 addr = im-m_addr.s_addr;
 -
 -snprintf(text_addr, 20, %u.%u.%u.%u,
 - ntohl(addr)  24  0xff,
 - ntohl(addr)  16  0xff,
 - ntohl(addr)   8  0xff,
 - ntohl(addr)   0  0xff);
 +if (ipv6_addr_v4mapped((im-m_addr))) {
 +snprintf(text_addr, 20, NIPQUAD_FMT,
 +ntohl(im-m_addr.s6_addr32[3])  24  0xff,
 +ntohl(im-m_addr.s6_addr32[3])  16  0xff,
 +ntohl(im-m_addr.s6_addr32[3])   8  0xff,
 +ntohl(im-m_addr.s6_addr32[3])   0  0xff);
 +} else {
 +snprintf(text_addr, 40, NIP6_FMT, NIP6(im-m_addr));
 +}
 qword_add(bpp, blen, im-m_class);
 qword_add(bpp, blen, text_addr);
 (*bpp)[-1] = '\n';
 }

What happens when an unpatched mountd gets this request?  Does it
ignore it, or respond with a negative entry?

--b.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] using mii-bitbang on different processor ports

2007-10-30 Thread Scott Wood

Sergej Stepanov wrote:

+   if( !of_address_to_resource(np, 1, res[1])) {


The spacing is still wrong.


-   iounmap(bitbang-dir);
+   if ( bitbang-mdio.dir != bitbang-mdc.dir)
+   iounmap(bitbang-mdio.dir);
+   iounmap(bitbang-mdc.dir);


And here.

-Scott
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] DM9601: Support for ADMtek ADM8515 NIC

2007-10-30 Thread Jeff Garzik

Peter Korsgaard wrote:

Add device ID for the ADMtek ADM8515 USB NIC to the DM9601 driver.

Signed-off-by: Peter Korsgaard [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost packets)

2007-10-30 Thread Jeff Garzik

Bryan Wu wrote:

From: Michael Hennerich [EMAIL PROTECTED]

Fix Ethernet communication bug(dupliated and lost packets)
in RMII PHY mode- dont call mac_disable and mac_enable during
10/100 REFCLK changes - mac_enable screws up the DMA descriptor chain

Signed-off-by: Michael Hennerich [EMAIL PROTECTED]
Signed-off-by: Bryan Wu [EMAIL PROTECTED]
---
 drivers/net/bfin_mac.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl

2007-10-30 Thread Brian Haley

Aurélien Charbon wrote:
Here is a second missing part of the IPv6 support in NFS server code 
concerning knfd syscall interface.

It updates write_getfd and write_getfd to accept IPv6 addresses.

Applies on a kernel including ip_map cache modifications


Both patches still have bugs, I think the patch I sent yesterday fixed 
them all, so I would recommend using that instead.  Of course Neil's 
comment possibly trumps all that anyways...


-Brian
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ixgb: fix TX hangs under heavy load

2007-10-30 Thread Andy Gospodarek
On Tue, Oct 30, 2007 at 11:21:50AM -0700, Auke Kok wrote:
 A merge error occurred where we merged the wrong block here
 in version 1.0.120. The right condition for frags is slightly
 different then for the skb, so account for the difference properly
 and trim the TSO based size right.
 
 Originally part of a fix reported by IBM to fix TSO hangs on
 pSeries hardware.
 
 Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
 Signed-off-by: Auke Kok [EMAIL PROTECTED]
 Cc: Andy Gospodarek [EMAIL PROTECTED]
 ---
 

Thanks, Auke and Jesse!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] remove claim balance_rr won't reorder on many to one

2007-10-30 Thread Rick Jones
Remove the text which suggests that many balance_rr links feeding into
a single uplink will not experience packet reordering.

More up-to-date tests, with 1G links feeding into a switch with a 10G
uplink, using a 2.6.23-rc8 kernel on the system on which the 1G links
were bonded with balance_rr (mode=0) shows that even a many to one
link configuration will experience packet reordering and the attendant
TCP issues involving spurrious retransmissions and the congestion
window.  This happens even with a single, simple bulk transfer such as
a netperf TCP_STREAM test.  A more complete description of the tests
and results, including tcptrace analysis of packet traces showing the
degree of reordering and such can be found at:

http://marc.info/?l=linux-netdevm=119101513406349w=2

Also, note that some switches use the term trunking in a context
other than link aggregation.

Signed-off-by:  Rick Jones [EMAIL PROTECTED]

---
diff -r 35e54d4beaad Documentation/networking/bonding.txt
--- a/Documentation/networking/bonding.txt  Wed Oct 24 05:06:40 2007 +
+++ b/Documentation/networking/bonding.txt  Mon Oct 29 03:47:19 2007 -0700
@@ -1696,23 +1696,6 @@ balance-rr: This mode is the only mode t
interface's worth of throughput, even after adjusting
tcp_reordering.
 
-   Note that this out of order delivery occurs when both the
-   sending and receiving systems are utilizing a multiple
-   interface bond.  Consider a configuration in which a
-   balance-rr bond feeds into a single higher capacity network
-   channel (e.g., multiple 100Mb/sec ethernets feeding a single
-   gigabit ethernet via an etherchannel capable switch).  In this
-   configuration, traffic sent from the multiple 100Mb devices to
-   a destination connected to the gigabit device will not see
-   packets out of order.  However, traffic sent from the gigabit
-   device to the multiple 100Mb devices may or may not see
-   traffic out of order, depending upon the balance policy of the
-   switch.  Many switches do not support any modes that stripe
-   traffic (instead choosing a port based upon IP or MAC level
-   addresses); for those devices, traffic flowing from the
-   gigabit device to the many 100Mb devices will only utilize one
-   interface.
-
If you are utilizing protocols other than TCP/IP, UDP for
example, and your application can tolerate out of order
delivery, then this mode can allow for single stream datagram
@@ -1720,7 +1703,9 @@ balance-rr: This mode is the only mode t
to the bond.
 
This mode requires the switch to have the appropriate ports
-   configured for etherchannel or trunking.
+   configured for etherchannel or aggregation. N.B. some
+   switches might use the term trunking for something other 
+   than link aggregation.
 
 active-backup: There is not much advantage in this network topology to
the active-backup mode, as the inactive backup devices are all
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2.6 patch] fix drivers/net/wan/lmc/ compilation

2007-10-30 Thread Adrian Bunk
Documentation/SubmitChecklist, point 1:

--  snip  --

...
  CC  drivers/net/wan/lmc/lmc_main.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c: In 
function ‘lmc_ioctl’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c:239: 
error: expected expression before ‘else’
...
make[5]: *** [drivers/net/wan/lmc/lmc_main.o] Error 1

--  snip  --

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---
d5e92a30491abf073e0a7f4d46b466c7c97f0f61 
diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c
index 64eb578..37c52e1 100644
--- a/drivers/net/wan/lmc/lmc_main.c
+++ b/drivers/net/wan/lmc/lmc_main.c
@@ -234,7 +234,7 @@ int lmc_ioctl (struct net_device *dev, struct ifreq *ifr, 
int cmd) /*fold00*/
 sc-lmc_xinfo.Magic1 = 0xDEADBEEF;
 
 if (copy_to_user(ifr-ifr_data, sc-lmc_xinfo,
-   sizeof(struct lmc_xinfo))) {
+sizeof(struct lmc_xinfo)))
ret = -EFAULT;
else
ret = 0;

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Saner thash_entries default with much memory

2007-10-30 Thread Andi Kleen

 Next, machines that service that many sockets typically have them
 mostly with full transmit queues talking to a very slow receiver at
 the other end. 

Not sure -- there are likely use cases with lots of idle but connected 
sockets.

Also the constraint here is not really how many sockets are served,
but how well the hash function manages to spread them in the table.. I don't
have good data on that.

But still (512 * 1024) sounds reasonable because e.g. in the lots
of idle socket case you're probably fine with the hash chains
having more than one entry worst case because a small working
set will fit in cache and as long as the chains do not end up
very long walking in cache of a short list will be still fast enough.

 So to me (512 * 1024) is a very reasonable limit and (with lockdep
 and spinlock debugging disabled) this makes the EHASH table consume
 8MB on UP 64-bit and ~12MB on SMP 64-bit systems.

I still have my doubts it makes sense to have an own lock for each bucket. It 
would be probably better to just divide the hash value through a factor
again and then use that to index a smaller lock only table.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] fix drivers/net/wan/lmc/ compilation

2007-10-30 Thread Roel Kluin
Adrian Bunk wrote:
 Documentation/SubmitChecklist, point 1:
 
 --  snip  --
 
 ...
   CC  drivers/net/wan/lmc/lmc_main.o
 /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c: In 
 function ‘lmc_ioctl’:
 /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c:239: 
 error: expected expression before ‘else’
 ...
 make[5]: *** [drivers/net/wan/lmc/lmc_main.o] Error 1
 
 --  snip  --
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 ---
 d5e92a30491abf073e0a7f4d46b466c7c97f0f61 
 diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c
 index 64eb578..37c52e1 100644
 --- a/drivers/net/wan/lmc/lmc_main.c
 +++ b/drivers/net/wan/lmc/lmc_main.c
 @@ -234,7 +234,7 @@ int lmc_ioctl (struct net_device *dev, struct ifreq *ifr, 
 int cmd) /*fold00*/
  sc-lmc_xinfo.Magic1 = 0xDEADBEEF;
  
  if (copy_to_user(ifr-ifr_data, sc-lmc_xinfo,
 - sizeof(struct lmc_xinfo))) {
 +  sizeof(struct lmc_xinfo)))
   ret = -EFAULT;
   else
   ret = 0;
 

I am sorry, my patch broke this and Kristov Provost also noticed this.
See http://lkml.org/lkml/2007/10/30/355


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1][NETNS] resend: fix net released by rcu callback

2007-10-30 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

 When a network namespace reference is held by a network subsystem,
 and when this reference is decremented in a rcu update callback, we
 must ensure that there is no more outstanding rcu update before 
 trying to free the network namespace.

 In the normal case, the rcu_barrier is called when the network namespace
 is exiting in the cleanup_net function.

 But when a network namespace creation fails, and the subsystems are
 undone (like the cleanup), the rcu_barrier is missing.

 This patch adds the missing rcu_barrier.

Looks sane.  Did you have any specific failures related to this or was
this something that was just caught in review?

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1][IPV6] resend: remove duplicate call to proc_net_remove

2007-10-30 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

 The file /proc/net/if_inet6 is removed twice.
 First time in:
 inet6_exit
  -addrconf_cleanup
 And followed a few lines after by:
 inet6_exit
  - if6_proc_exit

 Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Eric W. Biederman [EMAIL PROTECTED]

Looks like a good clean up to me.




 ---
  net/ipv6/addrconf.c |4 
  1 file changed, 4 deletions(-)

 Index: net-2.6/net/ipv6/addrconf.c
 ===
 --- net-2.6.orig/net/ipv6/addrconf.c
 +++ net-2.6/net/ipv6/addrconf.c
 @@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void)
   del_timer(addr_chk_timer);
  
   rtnl_unlock();
 -
 -#ifdef CONFIG_PROC_FS
 - proc_net_remove(init_net, if_inet6);
 -#endif
  }

 -- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remove claim balance_rr won't reorder on many to one

2007-10-30 Thread Jay Vosburgh
Rick Jones [EMAIL PROTECTED] wrote:
[...]
-  Note that this out of order delivery occurs when both the
-  sending and receiving systems are utilizing a multiple
-  interface bond.  Consider a configuration in which a
-  balance-rr bond feeds into a single higher capacity network
-  channel (e.g., multiple 100Mb/sec ethernets feeding a single
-  gigabit ethernet via an etherchannel capable switch).  In this
-  configuration, traffic sent from the multiple 100Mb devices to
-  a destination connected to the gigabit device will not see
-  packets out of order.  However, traffic sent from the gigabit
-  device to the multiple 100Mb devices may or may not see
-  traffic out of order, depending upon the balance policy of the
-  switch.  Many switches do not support any modes that stripe
-  traffic (instead choosing a port based upon IP or MAC level
-  addresses); for those devices, traffic flowing from the
-  gigabit device to the many 100Mb devices will only utilize one
-  interface.

Rather than simply removing this entirely (because I do think
there is value in discussion of the reordering aspects of balance-rr),
I'd rather see something that makes the following points:

1- the worst reordering is balance-rr to balance-rr, back to
back.  The reordering rate here depends upon (a) the number of slaves
involved and (b) packet reception scheduling behaviors (packet
coalescing, NAPI, etc), and thus will vary signficantly, but won't be
better than case #2.

2- next worst is balance-rr many slow to single fast, with
the reordering rate generally being substantially lower than case #1 (it
looked like your test showed about a 1% reordering rate, if I'm reading
your data correctly).

3- For the single fast to balance-rr many case, going
through a switch configured for etherchannel may or may not see traffic
out of order, depending upon the balance policy of the switch.  Many
switches do not support any modes that stripe traffic (instead choosing
a port based upon IP or MAC level addresses); for those devices, traffic
flowing from the [single fast] device to the [balance-rr many] devices
will only utilize one interface.

[...]
   This mode requires the switch to have the appropriate ports
-  configured for etherchannel or trunking.
+  configured for etherchannel or aggregation. N.B. some
+  switches might use the term trunking for something other 
+  than link aggregation.

If memory serves, Sun uses the term trunking to refer to
etherchannel compatible behavior.

I'm also hearing aggregation used to described 802.3ad
specifically.

Perhaps text of the form:

This mode requires the switch to have the appropriate ports
configured for Etherchannel.  Some switches use different terms, so
the configuration may be called trunking or aggregation.  Note that
both of these terms also have other meanings.  For example, trunking
is also used to describe a type of switch port, and aggregation or
link aggregation is often used to refer to 802.3ad link aggregation,
which is compatible with bonding's 802.3ad mode, but not balance-rr.

Thoughts?

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Saner thash_entries default with much memory

2007-10-30 Thread David Miller
From: Jean Delvare [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 14:18:27 +0100

 OK, let's go with (512 * 1024) then. Want me to send an updated patch?

Why submit a patch that's already in Linus's tree :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 23/33] netvm: skb processing

2007-10-30 Thread Stephen Hemminger
On Tue, 30 Oct 2007 17:04:24 +0100
Peter Zijlstra [EMAIL PROTECTED] wrote:

 In order to make sure emergency packets receive all memory needed to proceed
 ensure processing of emergency SKBs happens under PF_MEMALLOC.
 
 Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.
 
 Skip taps, since those are user-space again.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  include/net/sock.h |5 +
  net/core/dev.c |   44 ++--
  net/core/sock.c|   18 ++
  3 files changed, 61 insertions(+), 6 deletions(-)
 
 Index: linux-2.6/net/core/dev.c
 ===
 --- linux-2.6.orig/net/core/dev.c
 +++ linux-2.6/net/core/dev.c
 @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
   struct net_device *orig_dev;
   int ret = NET_RX_DROP;
   __be16 type;
 + unsigned long pflags = current-flags;
 +
 + /* Emergency skb are special, they should
 +  *  - be delivered to SOCK_MEMALLOC sockets only
 +  *  - stay away from userspace
 +  *  - have bounded memory usage
 +  *
 +  * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
 +  * This saves us from propagating the allocation context down to all
 +  * allocation sites.
 +  */
 + if (skb_emergency(skb))
 + current-flags |= PF_MEMALLOC;
  
   /* if we've gotten here through NAPI, check netpoll */
   if (netpoll_receive_skb(skb))
 - return NET_RX_DROP;
 + goto out;

Why the change? doesn't gcc optimize the common exit case anyway?

  
   if (!skb-tstamp.tv64)
   net_timestamp(skb);
 @@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk
   orig_dev = skb_bond(skb);
  
   if (!orig_dev)
 - return NET_RX_DROP;
 + goto out;
  
   __get_cpu_var(netdev_rx_stat).total++;
  
 @@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk
   }
  #endif
  
 + if (skb_emergency(skb))
 + goto skip_taps;
 +
   list_for_each_entry_rcu(ptype, ptype_all, list) {
   if (!ptype-dev || ptype-dev == skb-dev) {
   if (pt_prev)
 @@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk
   }
   }
  
 +skip_taps:
  #ifdef CONFIG_NET_CLS_ACT
   if (pt_prev) {
   ret = deliver_skb(skb, pt_prev, orig_dev);
 @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
  
   if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
   kfree_skb(skb);
 - goto out;
 + goto unlock;
   }
  
   skb-tc_verd = 0;
  ncls:
  #endif
  
 + if (skb_emergency(skb))
 + switch(skb-protocol) {
 + case __constant_htons(ETH_P_ARP):
 + case __constant_htons(ETH_P_IP):
 + case __constant_htons(ETH_P_IPV6):
 + case __constant_htons(ETH_P_8021Q):
 + break;

Indentation is wrong, and hard coding protocol values as spcial case
seems bad here. What about vlan's, etc?

 + default:
 + goto drop;
 + }
 +
   skb = handle_bridge(skb, pt_prev, ret, orig_dev);
   if (!skb)
 - goto out;
 + goto unlock;
   skb = handle_macvlan(skb, pt_prev, ret, orig_dev);
   if (!skb)
 - goto out;
 + goto unlock;
  
   type = skb-protocol;
   list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) {
 @@ -2056,6 +2085,7 @@ ncls:
   if (pt_prev) {
   ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev);
   } else {
 +drop:
   kfree_skb(skb);
   /* Jamal, now you will not able to escape explaining
* me how you were going to use this. :-)
 @@ -2063,8 +2093,10 @@ ncls:
   ret = NET_RX_DROP;
   }
  
 -out:
 +unlock:
   rcu_read_unlock();
 +out:
 + tsk_restore_flags(current, pflags, PF_MEMALLOC);
   return ret;
  }
  
 Index: linux-2.6/include/net/sock.h
 ===
 --- linux-2.6.orig/include/net/sock.h
 +++ linux-2.6/include/net/sock.h
 @@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct
   skb-next = NULL;
  }
  
 +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
 +
  static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
  {
 + if (skb_emergency(skb))
 + return __sk_backlog_rcv(sk, skb);
 +
   return sk-sk_backlog_rcv(sk, skb);
  }
  
 Index: linux-2.6/net/core/sock.c
 ===
 --- linux-2.6.orig/net/core/sock.c
 +++ linux-2.6/net/core/sock.c
 @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
  }
  EXPORT_SYMBOL_GPL(sk_clear_memalloc);
  
 +#ifdef 

Re: [PATCH 23/33] netvm: skb processing

2007-10-30 Thread Peter Zijlstra
On Tue, 2007-10-30 at 14:26 -0700, Stephen Hemminger wrote:
 On Tue, 30 Oct 2007 17:04:24 +0100
 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  In order to make sure emergency packets receive all memory needed to proceed
  ensure processing of emergency SKBs happens under PF_MEMALLOC.
  
  Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog 
  processing.
  
  Skip taps, since those are user-space again.
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   include/net/sock.h |5 +
   net/core/dev.c |   44 ++--
   net/core/sock.c|   18 ++
   3 files changed, 61 insertions(+), 6 deletions(-)
  
  Index: linux-2.6/net/core/dev.c
  ===
  --- linux-2.6.orig/net/core/dev.c
  +++ linux-2.6/net/core/dev.c
  @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
  struct net_device *orig_dev;
  int ret = NET_RX_DROP;
  __be16 type;
  +   unsigned long pflags = current-flags;
  +
  +   /* Emergency skb are special, they should
  +*  - be delivered to SOCK_MEMALLOC sockets only
  +*  - stay away from userspace
  +*  - have bounded memory usage
  +*
  +* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
  +* This saves us from propagating the allocation context down to all
  +* allocation sites.
  +*/
  +   if (skb_emergency(skb))
  +   current-flags |= PF_MEMALLOC;
   
  /* if we've gotten here through NAPI, check netpoll */
  if (netpoll_receive_skb(skb))
  -   return NET_RX_DROP;
  +   goto out;
 
 Why the change? doesn't gcc optimize the common exit case anyway?

It needs to unset PF_MEMALLOC at the exit.

  @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
   
  if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
  kfree_skb(skb);
  -   goto out;
  +   goto unlock;
  }
   
  skb-tc_verd = 0;
   ncls:
   #endif
   
  +   if (skb_emergency(skb))
  +   switch(skb-protocol) {
  +   case __constant_htons(ETH_P_ARP):
  +   case __constant_htons(ETH_P_IP):
  +   case __constant_htons(ETH_P_IPV6):
  +   case __constant_htons(ETH_P_8021Q):
  +   break;
 
 Indentation is wrong, and hard coding protocol values as spcial case
 seems bad here. What about vlan's, etc?

The other protocols needs analysis on what memory allocations occur
during packet processing, if anything is done that is not yet accounted
for (skb, route cache) then that needs to be added to a reserve, if
there are any paths that could touch user-space, those need to be
handled.

I've started looking at a few others, but its hard and difficult work if
one is not familiar with the protocols.


  @@ -2063,8 +2093,10 @@ ncls:
  ret = NET_RX_DROP;
  }
   
  -out:
  +unlock:
  rcu_read_unlock();
  +out:
  +   tsk_restore_flags(current, pflags, PF_MEMALLOC);
  return ret;
   }

Its that tsk_restore_flags() there what requires the s/return/goto/
stuff you noted earlier.

 I am still not convinced that this solves the problem well enough
 to be useful.  Can you really survive a heavy memory overcommit?

On a machine with mem=128M, I've ran 4 processes of 64M, 2 file backed
with the files on NFS, 2 anonymous. The processes just cycle through the
memory using writes. This is a 100% overcommit.

During these tests I've ran various network loads.

I've shut down the NFS server, waited for say 15 minutes, and restarted
the NFS server, and the machine came back up and continued.

 In other words, can you prove that the added complexity causes the system
 to survive a real test where otherwise it would not?

I've put some statistics in the skb reserve allocations, those are most
definately used. I'm quite certain the machine would lock up solid
without it.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1][NETNS] resend: fix net released by rcu callback

2007-10-30 Thread Daniel Lezcano

Eric W. Biederman wrote:

Daniel Lezcano [EMAIL PROTECTED] writes:


When a network namespace reference is held by a network subsystem,
and when this reference is decremented in a rcu update callback, we
must ensure that there is no more outstanding rcu update before 
trying to free the network namespace.


In the normal case, the rcu_barrier is called when the network namespace
is exiting in the cleanup_net function.

But when a network namespace creation fails, and the subsystems are
undone (like the cleanup), the rcu_barrier is missing.

This patch adds the missing rcu_barrier.


Looks sane.  Did you have any specific failures related to this or was
this something that was just caught in review?


Yes, I had this problem when doing ipv6 isolation for netns49. The ipv6 
subsystem creation failed and the different subsystem where rollbacked 
in the setup_net function.
When the network namespace was about to be freed in free_net function, I 
had the error with an usage refcount different from zero.

It appears that was coming from core/neighbour.c

neigh_parms_release
 - neigh_rcu_free_parms
   - neigh_parms_put
 - neigh_parms_destroy
   - release_net

The free_net function was called before rcu callback neigh_rcu_free_parms.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >