Re: [RFC 1/2] [IPV6] ADDRCONF: Preparation for configurable address selection policy with ifindex.
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 14:52:37 +0900 (JST) Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED] What is the substance of this change? Please add a description of this to the changelog entry as currently the description is far too brief and vague. Even saying simply that the change allows the interface index to be passed into the address selection routines would be a great improvement. Thank you. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9260] New: tipc_config.h is not installed when doing make headers_install
From: Andrew Morton [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 12:07:16 -0700 On Mon, 29 Oct 2007 09:10:26 -0700 (PDT) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9260 ... Problem Description: When doing make headers_install the file tipc_config.h is not installed. It describes the interface to configure the TIPC module and it is needed when building the config utility (tipc-config). Adding the following line to include/linux/Kbuild solves this: header-y += tipc_config.h Fair enough, I'll commit the following and submit to -stable as well. From 502ef38da15d817f8e67acefc12dc2212f7f8aa1 Mon Sep 17 00:00:00 2001 From: David S. Miller [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 01:19:19 -0700 Subject: [PATCH] [TIPC]: Add tipc_config.h to include/linux/Kbuild. Needed, as reported in: http://bugzilla.kernel.org/show_bug.cgi?id=9260 Signed-off-by: David S. Miller [EMAIL PROTECTED] --- include/linux/Kbuild |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 6a65231..bd33c22 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -149,6 +149,7 @@ header-y += ticable.h header-y += times.h header-y += tiocl.h header-y += tipc.h +header-y += tipc_config.h header-y += toshiba.h header-y += ultrasound.h header-y += un.h -- 1.5.2.5 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines
On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote: On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote: This commit made an incorrect assumption: -- Author: Lennert Buytenhek [EMAIL PROTECTED] Date: Fri Oct 19 04:10:10 2007 +0200 mv643xx_eth: Move ethernet register definitions into private header Move the mv643xx's ethernet-related register definitions from include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since they aren't of any use outside the ethernet driver. Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Acked-by: Tzachi Perelstein [EMAIL PROTECTED] Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] -- arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there. [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe v2.6.24-rc1-138-g0119130 This patch fixes this by internalizing 3 defines onto pegasos which are simply no longer available elsewhere. Without this your compile will fail That compile failure was fixed in commit 30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro. However, as I examine that commit, I see that it defines offsets from the eth block in the chip, rather than the full chip registeri block as the Pegasos 2 code expects. So, I think it fixes the compile failure, but leaves the Pegasos 2 broken. Luis, do you have Pegasos 2 hardware? Can you (or anyone) verify that the following patch is needed for the Pegasos 2? Nope, sorry. Luis - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Saner thash_entries default with much memory
From: Andi Kleen [EMAIL PROTECTED] Date: Fri, 26 Oct 2007 17:34:17 +0200 On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote: I propose 2 millions of entries as the arbitrary high limit. This It's probably still far too large. I agree. Perhaps a better number is something on the order of (512 * 1024) so I think I'll check in a variant of Jean's patch with just the limit decreased like that. Using just some back of the envelope calculations, on UP 64-bit systems each socket uses about 2424 bytes minimum of memory (this is the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP). This is an underestimate because it does not even consider things like allocator overhead. Next, machines that service that many sockets typically have them mostly with full transmit queues talking to a very slow receiver at the other end. So let's estimate that on average each socket consumes about 64K of retransmit queue data. I think this is an extremely conservative estimate beause it doesn't even consider overhead coming from struct sk_buff and related state. So for (512 * 1024) of established sockets we consume roughly 35GB of memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'. So to me (512 * 1024) is a very reasonable limit and (with lockdep and spinlock debugging disabled) this makes the EHASH table consume 8MB on UP 64-bit and ~12MB on SMP 64-bit systems. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] [IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 14:52:54 +0900 (JST) diff --git a/include/linux/if_addrlabel.h b/include/linux/if_addrlabel.h new file mode 100644 index 000..66978a5 --- /dev/null +++ b/include/linux/if_addrlabel.h @@ -0,0 +1,55 @@ +/* + * ifaddrlabel.h - netlink interface for address labels + * + * Copyright (C)2007 USAGI/WIDE Project, All Rights Reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: Please, this is just a very primitive header file definiting a simplistic struct and a few enumerations. Can't you just GPL it with just the USAGI/WIDE copyright line, instead of using this complicated license text? If it important for the USAGI Project to take credit for this work, they will receive it fully in the copyright line and the changelog entry. Thank you. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPv4] SNMP: Refer correct memory location to display ICMP out-going statistics
From: David Stevens [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 13:54:50 -0700 Dave, I didn't see a response for this one... in case it fell through the cracks. Just want to make sure my bone-headed error doesn't hang around too long. :-) It's in my tree now, never fear :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dn_route.c momentarily exiting RCU read-side critical section
From: Paul E. McKenney [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 14:15:40 -0700 net/decnet/dn_route.c in dn_rt_cache_get_next() is as follows: static struct dn_route *dn_rt_cache_get_next(struct seq_file *seq, struct dn_route *rt) { struct dn_rt_cache_iter_state *s = rcu_dereference(seq-private); rt = rt-u.dst.dn_next; while(!rt) { rcu_read_unlock_bh(); if (--s-bucket 0) break; ... But what happens if seq-private is freed up right here? ... Or what prevents this from happening? ... Similar code is in rt_cache_get_next(). So, what am I missing here? seq-private is allocated on file open (here via seq_open_private()), and freed up on file close (via seq_release_private). So it cannot be freed up in the middle of an iteration. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rpc_rdma: we need to cast u64 to unsigned long long for printing
From: Stephen Rothwell [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 16:12:40 +1100 as some architectures have unsigned long for u64. net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_create_chunks': net/sunrpc/xprtrdma/rpc_rdma.c:222: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64' net/sunrpc/xprtrdma/rpc_rdma.c:234: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64' net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_count_chunks': net/sunrpc/xprtrdma/rpc_rdma.c:577: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64 Noticed on PowerPC pseries_defconfig build. Signed-off-by: Stephen Rothwell [EMAIL PROTECTED] I've applied this, thanks Stephen. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: add kexec support
Michael Ellerman [EMAIL PROTECTED] wrote on 28.10.2007 23:32:17: How do you plan to support kdump? When kexec is fully supported kdump should work out of the box as for any other ethernet card (if you load the right eth driver). There's nothing specific to kdump you have to handle in ethernet device drivers. Hope I didn't miss anything here... Gruss / Regards Christoph R - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic removing devices from a teql queuing discipline
From: Chuck Ebbert [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 14:00:01 -0400 The panic is in __teql_resolve (which has been inlined into teql_master_xmit) in net/sched/sch_teql.c at this line: if (n n-tbl == mn-tbl Specifically the dereference of n-tbl is faulting as n is not valid. And the address looks like part of an ASCCI string... figt I studied sch_teql.c a bit and I suspect that the slave list management in teql_destroy() and teql_qdisc_init() might be suspect. If someone can take a closer look at this, I'd appreciate it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Add support for the RDC R6040 Fast Ethernet controller
On Mon, 29 Oct 2007, Florian Fainelli wrote: +static int mdio_read(struct net_device *dev, int phy_id, int location); +static void mdio_write(struct net_device *dev, int phy_id, int location, int value); +static int r6040_open(struct net_device *dev); +static int r6040_start_xmit(struct sk_buff *skb, struct net_device *dev); +static irqreturn_t r6040_interrupt(int irq, void *dev_id); +static int r6040_close(struct net_device *dev); +static void set_multicast_list(struct net_device *dev); +static struct ethtool_ops netdev_ethtool_ops; +static int r6040_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); +static void r6040_down(struct net_device *dev); +static void r6040_up(struct net_device *dev); +static void r6040_tx_timeout(struct net_device *dev); +static void r6040_timer(unsigned long); +static void r6040_mac_address(struct net_device *dev); + +static int phy_mode_chk(struct net_device *dev); +static int phy_read(int ioaddr, int phy_adr, int reg_idx); +static void phy_write(int ioaddr, int phy_adr, int reg_idx, int dat); +static void rx_buf_alloc(struct r6040_private *lp, struct net_device *dev); +#ifdef CONFIG_R6040_NAPI +static int r6040_poll(struct napi_struct *napi, int budget); +#endif + ...Most of those forward declarations can go if the functions are ordered properly. One can trivially notice that the mdio_{read,write} are unnecessary already: +static int mdio_read(struct net_device *dev, int phy_id, int regnum) +{ + struct r6040_private *lp = netdev_priv(dev); + long ioaddr = dev-base_addr; + return (phy_read(ioaddr, lp-phy_addr, regnum)) ; +} + +static void mdio_write(struct net_device *dev, int phy_id, int regnum, int value) +{ + struct r6040_private *lp = netdev_priv(dev); + long ioaddr = dev-base_addr; + + phy_write(ioaddr, lp-phy_addr, regnum, value); +} -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] [IPV6] ADDRCONF: Preparation for configurable address selection policy with ifindex.
Hi Yoshifuji, YOSHIFUJI Hideaki wrote on 10/30/2007 11:22:37 AM: -static inline int ipv6_saddr_label(const struct in6_addr *addr, int type) +static inline int ipv6_addr_label(const struct in6_addr *addr, int type, +int ifindex) This function doesn't use this new argument passed to it. Did you perhaps intend to use it to initializing daddr_index? + int daddr_ifindex = daddr_dev ? daddr_dev-ifindex : 0; Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost packets)
From: Michael Hennerich [EMAIL PROTECTED] Fix Ethernet communication bug(dupliated and lost packets) in RMII PHY mode- dont call mac_disable and mac_enable during 10/100 REFCLK changes - mac_enable screws up the DMA descriptor chain Signed-off-by: Michael Hennerich [EMAIL PROTECTED] Signed-off-by: Bryan Wu [EMAIL PROTECTED] --- drivers/net/bfin_mac.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c index 53fe7de..084acfd 100644 --- a/drivers/net/bfin_mac.c +++ b/drivers/net/bfin_mac.c @@ -371,7 +371,6 @@ static void bf537_adjust_link(struct net_device *dev) if (phydev-speed != lp-old_speed) { #if defined(CONFIG_BFIN_MAC_RMII) u32 opmode = bfin_read_EMAC_OPMODE(); - bf537mac_disable(); switch (phydev-speed) { case 10: opmode |= RMII_10; @@ -386,7 +385,6 @@ static void bf537_adjust_link(struct net_device *dev) break; } bfin_write_EMAC_OPMODE(opmode); - bf537mac_enable(); #endif new_state = 1; -- 1.5.3.4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops
On Mon, Oct 29 2007 at 22:16 +0200, Jens Axboe [EMAIL PROTECTED] wrote: On Fri, Oct 26 2007, Herbert Xu wrote: [CRYPTO] tcrypt: Move sg_init_table out of timing loops This patch moves the sg_init_table out of the timing loops for hash algorithms so that it doesn't impact on the speed test results. Wouldn't it be better to just make sg_init_one() call sg_init_table? diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 4571231..ccc55a6 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist *sg) } /** - * sg_init_one - Initialize a single entry sg list - * @sg: SG entry - * @buf: Virtual address for IO - * @buflen: IO length - * - * Notes: - * This should not be used on a single entry that is part of a larger - * table. Use sg_init_table() for that. - * - **/ -static inline void sg_init_one(struct scatterlist *sg, const void *buf, -unsigned int buflen) -{ - memset(sg, 0, sizeof(*sg)); -#ifdef CONFIG_DEBUG_SG - sg-sg_magic = SG_MAGIC; -#endif - sg_mark_end(sg, 1); - sg_set_buf(sg, buf, buflen); -} - -/** * sg_init_table - Initialize SG table * @sgl:The SG table * @nents: Number of entries in table @@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist *sgl, unsigned int nents) } /** + * sg_init_one - Initialize a single entry sg list + * @sg: SG entry + * @buf: Virtual address for IO + * @buflen: IO length + * + * Notes: + * This should not be used on a single entry that is part of a larger + * table. Use sg_init_table() for that. + * + **/ +static inline void sg_init_one(struct scatterlist *sg, const void *buf, +unsigned int buflen) +{ + sg_init_table(sg, 1); + sg_set_buf(sg, buf, buflen); +} + +/** * sg_phys - Return physical address of an sg entry * @sg: SG entry * Yes please submit this patch. scsi-ml is full of sg_init_one, specially on the error recovery path. Thanks Boaz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops
On Tue, Oct 30 2007, Boaz Harrosh wrote: On Mon, Oct 29 2007 at 22:16 +0200, Jens Axboe [EMAIL PROTECTED] wrote: On Fri, Oct 26 2007, Herbert Xu wrote: [CRYPTO] tcrypt: Move sg_init_table out of timing loops This patch moves the sg_init_table out of the timing loops for hash algorithms so that it doesn't impact on the speed test results. Wouldn't it be better to just make sg_init_one() call sg_init_table? diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 4571231..ccc55a6 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist *sg) } /** - * sg_init_one - Initialize a single entry sg list - * @sg: SG entry - * @buf:Virtual address for IO - * @buflen: IO length - * - * Notes: - * This should not be used on a single entry that is part of a larger - * table. Use sg_init_table() for that. - * - **/ -static inline void sg_init_one(struct scatterlist *sg, const void *buf, - unsigned int buflen) -{ - memset(sg, 0, sizeof(*sg)); -#ifdef CONFIG_DEBUG_SG - sg-sg_magic = SG_MAGIC; -#endif - sg_mark_end(sg, 1); - sg_set_buf(sg, buf, buflen); -} - -/** * sg_init_table - Initialize SG table * @sgl: The SG table * @nents:Number of entries in table @@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist *sgl, unsigned int nents) } /** + * sg_init_one - Initialize a single entry sg list + * @sg: SG entry + * @buf:Virtual address for IO + * @buflen: IO length + * + * Notes: + * This should not be used on a single entry that is part of a larger + * table. Use sg_init_table() for that. + * + **/ +static inline void sg_init_one(struct scatterlist *sg, const void *buf, + unsigned int buflen) +{ + sg_init_table(sg, 1); + sg_set_buf(sg, buf, buflen); +} + +/** * sg_phys - Return physical address of an sg entry * @sg: SG entry * Yes please submit this patch. scsi-ml is full of sg_init_one, specially on the error recovery path. Will do. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines
On Tue, Oct 30, 2007 at 03:44:59AM -0400, Luis R. Rodriguez wrote: On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote: On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote: This commit made an incorrect assumption: -- Author: Lennert Buytenhek [EMAIL PROTECTED] Date: Fri Oct 19 04:10:10 2007 +0200 mv643xx_eth: Move ethernet register definitions into private header Move the mv643xx's ethernet-related register definitions from include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since they aren't of any use outside the ethernet driver. Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Acked-by: Tzachi Perelstein [EMAIL PROTECTED] Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] -- arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there. [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe v2.6.24-rc1-138-g0119130 This patch fixes this by internalizing 3 defines onto pegasos which are simply no longer available elsewhere. Without this your compile will fail That compile failure was fixed in commit 30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro. However, as I examine that commit, I see that it defines offsets from the eth block in the chip, rather than the full chip registeri block as the Pegasos 2 code expects. So, I think it fixes the compile failure, but leaves the Pegasos 2 broken. Luis, do you have Pegasos 2 hardware? Can you (or anyone) verify that the following patch is needed for the Pegasos 2? Nope, sorry. I am busy right now, but have various pegasos machines available for testing. What exactly should i test ? Friendly, Sven Luther - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines
On Tue, Oct 30, 2007 at 10:36:06AM +0100, Sven Luther wrote: On Tue, Oct 30, 2007 at 03:44:59AM -0400, Luis R. Rodriguez wrote: On 10/29/07, Dale Farnsworth [EMAIL PROTECTED] wrote: On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote: This commit made an incorrect assumption: -- Author: Lennert Buytenhek [EMAIL PROTECTED] Date: Fri Oct 19 04:10:10 2007 +0200 mv643xx_eth: Move ethernet register definitions into private header Move the mv643xx's ethernet-related register definitions from include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since they aren't of any use outside the ethernet driver. Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Acked-by: Tzachi Perelstein [EMAIL PROTECTED] Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] -- arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there. [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe v2.6.24-rc1-138-g0119130 This patch fixes this by internalizing 3 defines onto pegasos which are simply no longer available elsewhere. Without this your compile will fail That compile failure was fixed in commit 30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro. However, as I examine that commit, I see that it defines offsets from the eth block in the chip, rather than the full chip registeri block as the Pegasos 2 code expects. So, I think it fixes the compile failure, but leaves the Pegasos 2 broken. Luis, do you have Pegasos 2 hardware? Can you (or anyone) verify that the following patch is needed for the Pegasos 2? Nope, sorry. I am busy right now, but have various pegasos machines available for testing. What exactly should i test ? Thanks Sven. Test whether an Ethernet port works at all. I think it's currently broken, but should work with the patch I supplied. -Dale - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] core: fix free_netdev when register fails during notification call chain
Point 1: The unregistering of a network device schedule a netdev_run_todo. This function calls dev-destructor when it is set and the destructor calls free_netdev. Point 2: In the case of an initialization of a network device the usual code is: * alloc_netdev * register_netdev - if this one fails, call free_netdev and exit with error. Point 3: In the register_netdevice function at the later state, when the device is at the registered state, a call to the netdevice_notifiers is made. If one of the notification falls into an error, a rollback to the registered state is done using unregister_netdevice. Conclusion: When a network device fails to register during initialization because one network subsystem returned an error during a notification call chain, the network device is freed twice because of fact 1 and fact 2. The second free_netdev will be done with an invalid pointer. Proposed solution: The following patch move all the code of unregister_netdevice *except* the call to net_set_todo, to a new function rollback_registered. The following functions are changed in this way: * register_netdevice: calls rollback_registered when a notification fails * unregister_netdevice: calls rollback_register + net_set_todo, the call order to net_set_todo is changed because it is the latest now. Since it justs add an element to a list that should not break anything. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/core/dev.c | 112 ++--- 1 file changed, 59 insertions(+), 53 deletions(-) Index: net-2.6/net/core/dev.c === --- net-2.6.orig/net/core/dev.c +++ net-2.6/net/core/dev.c @@ -3496,6 +3496,60 @@ static void net_set_todo(struct net_devi spin_unlock(net_todo_list_lock); } +static void rollback_registered(struct net_device *dev) +{ +BUG_ON(dev_boot_phase); +ASSERT_RTNL(); + +/* Some devices call without registering for initialization unwind. */ +if (dev-reg_state == NETREG_UNINITIALIZED) { +printk(KERN_DEBUG unregister_netdevice: device %s/%p never + was registered\n, dev-name, dev); + +WARN_ON(1); +return; +} + +BUG_ON(dev-reg_state != NETREG_REGISTERED); + +/* If device is running, close it first. */ +dev_close(dev); + +/* And unlink it from device chain. */ +unlist_netdevice(dev); + +dev-reg_state = NETREG_UNREGISTERING; + +synchronize_net(); + +/* Shutdown queueing discipline. */ +dev_shutdown(dev); + + +/* Notify protocols, that we are about to destroy + this device. They should clean all the things. +*/ +call_netdevice_notifiers(NETDEV_UNREGISTER, dev); + +/* + *Flush the unicast and multicast chains + */ +dev_addr_discard(dev); + +if (dev-uninit) +dev-uninit(dev); + +/* Notifier chain MUST detach us from master device. */ +BUG_TRAP(!dev-master); + +/* Remove entries from kobject tree */ +netdev_unregister_kobject(dev); + +synchronize_net(); + +dev_put(dev); +} + /** *register_netdevice- register a network device *@dev: device to register @@ -3633,8 +3687,10 @@ int register_netdevice(struct net_device /* Notify protocols, that a new device appeared. */ ret = call_netdevice_notifiers(NETDEV_REGISTER, dev); ret = notifier_to_errno(ret); -if (ret) -unregister_netdevice(dev); +if (ret) { +rollback_registered(dev); +dev-reg_state = NETREG_UNREGISTERED; +} out: return ret; @@ -3911,59 +3967,9 @@ void synchronize_net(void) void unregister_netdevice(struct net_device *dev) { -BUG_ON(dev_boot_phase); -ASSERT_RTNL(); - -/* Some devices call without registering for initialization unwind. */ -if (dev-reg_state == NETREG_UNINITIALIZED) { -printk(KERN_DEBUG unregister_netdevice: device %s/%p never - was registered\n, dev-name, dev); - -WARN_ON(1); -return; -} - -BUG_ON(dev-reg_state != NETREG_REGISTERED); - -/* If device is running, close it first. */ -dev_close(dev); - -/* And unlink it from device chain. */ -unlist_netdevice(dev); - -dev-reg_state = NETREG_UNREGISTERING; - -synchronize_net(); - -/* Shutdown queueing discipline. */ -dev_shutdown(dev); - - -/* Notify protocols, that we are about to destroy - this device. They should clean all the things. -*/ -call_netdevice_notifiers(NETDEV_UNREGISTER, dev); - -/* - *Flush the unicast and multicast chains - */ -dev_addr_discard(dev); - -if (dev-uninit) -dev-uninit(dev); - -/* Notifier chain MUST detach us from master device. */ -BUG_TRAP(!dev-master); - -/* Remove entries from kobject tree */ -netdev_unregister_kobject(dev); - +
[PATCH][NETNS] fix net released by rcu callback
When a network namespace reference is held by a network subsystem, and when this reference is decremented in a rcu update callback, we must ensure that there is no more outstanding rcu update before trying to free the network namespace. In the normal case, the rcu_barrier is called when the network namespace is exiting in the cleanup_net function. But when a network namespace creation fails, and the subsystems are undone (like the cleanup), the rcu_barrier is missing. This patch adds the missing rcu_barrier. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/core/net_namespace.c |2 ++ 1 file changed, 2 insertions(+) Index: net-2.6/net/core/net_namespace.c === --- net-2.6.orig/net/core/net_namespace.c +++ net-2.6/net/core/net_namespace.c @@ -112,6 +112,8 @@ out_undo: if (ops-exit) ops-exit(net); } + + rcu_barrier(); goto out; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.24] ixgb: TX hangs under heavy load
Auke, It has become clear that this patch resolves some tx-lockups on the ixgb driver. IBM did some checking and realized this hunk is in your sourceforge driver, but not anywhere else. Mind if we add it? Thanks, -andy Signed-off-by: Andy Gospodarek [EMAIL PROTECTED] --- ixgb_main.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index d444de5..3ec7a41 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -1324,7 +1324,7 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff *skb, /* Workaround for premature desc write-backs * in TSO mode. Append 4-byte sentinel desc */ - if (unlikely(mss !nr_frags size == len + if (unlikely(mss (f == (nr_frags-1)) size == len size 8)) size -= 4; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPV6] cleanup : remove proc_net_remove called twice
The file /proc/net/if_inet6 is removed twice. First time in: inet6_exit -addrconf_cleanup And followed a few lines after by: inet6_exit - if6_proc_exit Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/ipv6/addrconf.c |4 1 file changed, 4 deletions(-) Index: net-2.6/net/ipv6/addrconf.c === --- net-2.6.orig/net/ipv6/addrconf.c +++ net-2.6/net/ipv6/addrconf.c @@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void) del_timer(addr_chk_timer); rtnl_unlock(); - -#ifdef CONFIG_PROC_FS - proc_net_remove(init_net, if_inet6); -#endif } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
One proc_net_create() user less. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- net/ipv6/route.c | 70 +++ 1 file changed, 25 insertions(+), 45 deletions(-) --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2288,71 +2288,49 @@ struct rt6_proc_arg static int rt6_info_route(struct rt6_info *rt, void *p_arg) { - struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg; + struct seq_file *m = p_arg; - if (arg-skip arg-offset / RT6_INFO_LEN) { - arg-skip++; - return 0; - } - - if (arg-len = arg-length) - return 0; - - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_dst.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_dst.addr), rt-rt6i_dst.plen); #ifdef CONFIG_IPV6_SUBTREES - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_src.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_src.addr), rt-rt6i_src.plen); #else - arg-len += sprintf(arg-buffer + arg-len, - 00 ); + seq_puts(m, 00 ); #endif if (rt-rt6i_nexthop) { - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT, + seq_printf(m, NIP6_SEQFMT, NIP6(*((struct in6_addr *)rt-rt6i_nexthop-primary_key))); } else { - arg-len += sprintf(arg-buffer + arg-len, - ); + seq_puts(m, ); } - arg-len += sprintf(arg-buffer + arg-len, -%08x %08x %08x %08x %8s\n, + seq_printf(m, %08x %08x %08x %08x %8s\n, rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt), rt-u.dst.__use, rt-rt6i_flags, rt-rt6i_dev ? rt-rt6i_dev-name : ); return 0; } -static int rt6_proc_info(char *buffer, char **start, off_t offset, int length) +static int ipv6_route_show(struct seq_file *m, void *v) { - struct rt6_proc_arg arg = { - .buffer = buffer, - .offset = offset, - .length = length, - }; - - fib6_clean_all(rt6_info_route, 0, arg); - - *start = buffer; - if (offset) - *start += offset % RT6_INFO_LEN; - - arg.len -= offset % RT6_INFO_LEN; - - if (arg.len length) - arg.len = length; - if (arg.len 0) - arg.len = 0; + fib6_clean_all(rt6_info_route, 0, m); + return 0; +} - return arg.len; +static int ipv6_route_open(struct inode *inode, struct file *file) +{ + return single_open(file, ipv6_route_show, NULL); } +static const struct file_operations ipv6_route_proc_fops = { + .open = ipv6_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); - if (p) + p = create_proc_entry(ipv6_route, 0, init_net.proc_net); + if (p) { p-owner = THIS_MODULE; + p-proc_fops = ipv6_route_proc_fops; + } proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Remove /proc/net/ip_vs_lblcr
It's under CONFIG_IP_VS_LBLCR_DEBUG option which never existed. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- I can convert it to seq_file if anyone is secretly using it. net/ipv4/ipvs/ip_vs_lblcr.c | 76 1 file changed, 76 deletions(-) --- a/net/ipv4/ipvs/ip_vs_lblcr.c +++ b/net/ipv4/ipvs/ip_vs_lblcr.c @@ -48,8 +48,6 @@ /* for sysctl */ #include linux/fs.h #include linux/sysctl.h -/* for proc_net_create/proc_net_remove */ -#include linux/proc_fs.h #include net/net_namespace.h #include net/ip_vs.h @@ -547,71 +545,6 @@ static void ip_vs_lblcr_check_expire(unsigned long data) mod_timer(tbl-periodic_timer, jiffies+CHECK_EXPIRE_INTERVAL); } - -#ifdef CONFIG_IP_VS_LBLCR_DEBUG -static struct ip_vs_lblcr_table *lblcr_table_list; - -/* - * /proc/net/ip_vs_lblcr to display the mappings of - * destination IP address == its serverSet - */ -static int -ip_vs_lblcr_getinfo(char *buffer, char **start, off_t offset, int length) -{ - off_t pos=0, begin; - int len=0, size; - struct ip_vs_lblcr_table *tbl; - unsigned long now = jiffies; - int i; - struct ip_vs_lblcr_entry *en; - - tbl = lblcr_table_list; - - size = sprintf(buffer, LastTime Dest IP address Server set\n); - pos += size; - len += size; - - for (i=0; iIP_VS_LBLCR_TAB_SIZE; i++) { - read_lock_bh(tbl-lock); - list_for_each_entry(en, tbl-bucket[i], list) { - char tbuf[16]; - struct ip_vs_dest_list *d; - - sprintf(tbuf, %u.%u.%u.%u, NIPQUAD(en-addr)); - size = sprintf(buffer+len, %8lu %-16s , - now-en-lastuse, tbuf); - - read_lock(en-set.lock); - for (d=en-set.list; d!=NULL; d=d-next) { - size += sprintf(buffer+len+size, - %u.%u.%u.%u , - NIPQUAD(d-dest-addr)); - } - read_unlock(en-set.lock); - size += sprintf(buffer+len+size, \n); - len += size; - pos += size; - if (pos = offset) - len=0; - if (pos = offset+length) { - read_unlock_bh(tbl-lock); - goto done; - } - } - read_unlock_bh(tbl-lock); - } - - done: - begin = len - (pos - offset); - *start = buffer + begin; - len -= begin; - if(lenlength) - len = length; - return len; -} -#endif - - static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc) { int i; @@ -650,9 +583,6 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc) tbl-periodic_timer.expires = jiffies+CHECK_EXPIRE_INTERVAL; add_timer(tbl-periodic_timer); -#ifdef CONFIG_IP_VS_LBLCR_DEBUG - lblcr_table_list = tbl; -#endif return 0; } @@ -843,18 +773,12 @@ static int __init ip_vs_lblcr_init(void) { INIT_LIST_HEAD(ip_vs_lblcr_scheduler.n_list); sysctl_header = register_sysctl_table(lblcr_root_table); -#ifdef CONFIG_IP_VS_LBLCR_DEBUG - proc_net_create(init_net, ip_vs_lblcr, 0, ip_vs_lblcr_getinfo); -#endif return register_ip_vs_scheduler(ip_vs_lblcr_scheduler); } static void __exit ip_vs_lblcr_cleanup(void) { -#ifdef CONFIG_IP_VS_LBLCR_DEBUG - proc_net_remove(init_net, ip_vs_lblcr); -#endif unregister_sysctl_table(sysctl_header); unregister_ip_vs_scheduler(ip_vs_lblcr_scheduler); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Saner thash_entries default with much memory
Hi David, Le mardi 30 octobre 2007, David Miller a écrit : From: Andi Kleen [EMAIL PROTECTED] Date: Fri, 26 Oct 2007 17:34:17 +0200 On Fri, Oct 26, 2007 at 05:21:31PM +0200, Jean Delvare wrote: I propose 2 millions of entries as the arbitrary high limit. This It's probably still far too large. I agree. Perhaps a better number is something on the order of (512 * 1024) so I think I'll check in a variant of Jean's patch with just the limit decreased like that. That's very fine with me. I originally proposed an admittedly high limit value to increase the chance to see it accepted. I am not familiar enough with networking to know what a more reasonable limit would be, so I'm leaving it to the experts. Using just some back of the envelope calculations, on UP 64-bit systems each socket uses about 2424 bytes minimum of memory (this is the sum of tcp_sock, inode, dentry, socket, and file on sparc64 UP). This is an underestimate because it does not even consider things like allocator overhead. Next, machines that service that many sockets typically have them mostly with full transmit queues talking to a very slow receiver at the other end. So let's estimate that on average each socket consumes about 64K of retransmit queue data. I think this is an extremely conservative estimate beause it doesn't even consider overhead coming from struct sk_buff and related state. So for (512 * 1024) of established sockets we consume roughly 35GB of memory, this is '((2424 + (64 * 1024)) * (512 * 1024))'. So to me (512 * 1024) is a very reasonable limit and (with lockdep and spinlock debugging disabled) this makes the EHASH table consume 8MB on UP 64-bit and ~12MB on SMP 64-bit systems. OK, let's go with (512 * 1024) then. Want me to send an updated patch? Thanks, -- Jean Delvare Suse L3 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
Alexey Dobriyan wrote: One proc_net_create() user less. Funny, I was working on a similar patch. See comment below. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- net/ipv6/route.c | 70 +++ 1 file changed, 25 insertions(+), 45 deletions(-) --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2288,71 +2288,49 @@ struct rt6_proc_arg static int rt6_info_route(struct rt6_info *rt, void *p_arg) { - struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg; + struct seq_file *m = p_arg; - if (arg-skip arg-offset / RT6_INFO_LEN) { - arg-skip++; - return 0; - } - - if (arg-len = arg-length) - return 0; - - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_dst.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_dst.addr), rt-rt6i_dst.plen); #ifdef CONFIG_IPV6_SUBTREES - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_src.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_src.addr), rt-rt6i_src.plen); #else - arg-len += sprintf(arg-buffer + arg-len, - 00 ); + seq_puts(m, 00 ); #endif if (rt-rt6i_nexthop) { - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT, + seq_printf(m, NIP6_SEQFMT, NIP6(*((struct in6_addr *)rt-rt6i_nexthop-primary_key))); } else { - arg-len += sprintf(arg-buffer + arg-len, - ); + seq_puts(m, ); } - arg-len += sprintf(arg-buffer + arg-len, - %08x %08x %08x %08x %8s\n, + seq_printf(m, %08x %08x %08x %08x %8s\n, rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt), rt-u.dst.__use, rt-rt6i_flags, rt-rt6i_dev ? rt-rt6i_dev-name : ); return 0; } -static int rt6_proc_info(char *buffer, char **start, off_t offset, int length) +static int ipv6_route_show(struct seq_file *m, void *v) { - struct rt6_proc_arg arg = { - .buffer = buffer, - .offset = offset, - .length = length, - }; - - fib6_clean_all(rt6_info_route, 0, arg); - - *start = buffer; - if (offset) - *start += offset % RT6_INFO_LEN; - - arg.len -= offset % RT6_INFO_LEN; - - if (arg.len length) - arg.len = length; - if (arg.len 0) - arg.len = 0; + fib6_clean_all(rt6_info_route, 0, m); + return 0; +} - return arg.len; +static int ipv6_route_open(struct inode *inode, struct file *file) +{ + return single_open(file, ipv6_route_show, NULL); } +static const struct file_operations ipv6_route_proc_fops = { + .open = ipv6_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); - if (p) + p = create_proc_entry(ipv6_route, 0, init_net.proc_net); + if (p) { p-owner = THIS_MODULE; + p-proc_fops = ipv6_route_proc_fops; + } You should use proc_net_fops_create() instead of the above code. It does the same thing. Otherwise the patch looks fine to me. Tested on i386. Benjamin proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
Cosmetic comment: I forgot to say there are a few indentation errors when I apply your patch. See below. Benjamin Thery wrote: Alexey Dobriyan wrote: One proc_net_create() user less. Funny, I was working on a similar patch. See comment below. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- net/ipv6/route.c | 70 +++ 1 file changed, 25 insertions(+), 45 deletions(-) --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2288,71 +2288,49 @@ struct rt6_proc_arg static int rt6_info_route(struct rt6_info *rt, void *p_arg) { -struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg; +struct seq_file *m = p_arg; -if (arg-skip arg-offset / RT6_INFO_LEN) { -arg-skip++; -return 0; -} - -if (arg-len = arg-length) -return 0; - -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT %02x , -NIP6(rt-rt6i_dst.addr), +seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_dst.addr), rt-rt6i_dst.plen); #ifdef CONFIG_IPV6_SUBTREES -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT %02x , -NIP6(rt-rt6i_src.addr), +seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_src.addr), rt-rt6i_src.plen); Indent is wrong for the above line. #else -arg-len += sprintf(arg-buffer + arg-len, - 00 ); +seq_puts(m, 00 ); #endif if (rt-rt6i_nexthop) { -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT, +seq_printf(m, NIP6_SEQFMT, NIP6(*((struct in6_addr *)rt-rt6i_nexthop-primary_key))); Idem. } else { -arg-len += sprintf(arg-buffer + arg-len, -); +seq_puts(m, ); } -arg-len += sprintf(arg-buffer + arg-len, - %08x %08x %08x %08x %8s\n, +seq_printf(m, %08x %08x %08x %08x %8s\n, rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt), rt-u.dst.__use, rt-rt6i_flags, rt-rt6i_dev ? rt-rt6i_dev-name : ); Indent of the 3 above lines. return 0; } -static int rt6_proc_info(char *buffer, char **start, off_t offset, int length) +static int ipv6_route_show(struct seq_file *m, void *v) { -struct rt6_proc_arg arg = { -.buffer = buffer, -.offset = offset, -.length = length, -}; - -fib6_clean_all(rt6_info_route, 0, arg); - -*start = buffer; -if (offset) -*start += offset % RT6_INFO_LEN; - -arg.len -= offset % RT6_INFO_LEN; - -if (arg.len length) -arg.len = length; -if (arg.len 0) -arg.len = 0; +fib6_clean_all(rt6_info_route, 0, m); +return 0; +} -return arg.len; +static int ipv6_route_open(struct inode *inode, struct file *file) +{ +return single_open(file, ipv6_route_show, NULL); } +static const struct file_operations ipv6_route_proc_fops = { +.open = ipv6_route_open, +.read = seq_read, +.llseek = seq_lseek, +.release= single_release, +}; + static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS -p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); -if (p) +p = create_proc_entry(ipv6_route, 0, init_net.proc_net); +if (p) { p-owner = THIS_MODULE; +p-proc_fops = ipv6_route_proc_fops; +} You should use proc_net_fops_create() instead of the above code. It does the same thing. Otherwise the patch looks fine to me. Tested on i386. Benjamin proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuring the same IP on multiple addresses
David Miller wrote: From: David Miller [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 15:25:59 -0700 (PDT) Can you guys please just state upfront what virtualization issue is made more difficult by features you want to remove? Sorry, I mentioned virtualization because that's been the largest majority of the cases being presented lately. Nope, not virtualization. I suspect in your case it's some multicast or SCTP thing :-) Neither of these really either, although I should try to see how SCTP behaves in this configuration. As Brian said, a customer asked us a question, and we didn't know the history. No one is trying to remove functionality or features. We'd just like to know the why, and the answer of why not doesn't fly very well. Although in the IPv6 case, there might be issues. -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] DM9601: Support for ADMtek ADM8515 NIC
Add device ID for the ADMtek ADM8515 USB NIC to the DM9601 driver. Signed-off-by: Peter Korsgaard [EMAIL PROTECTED] diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c index a2de32f..2c68573 100644 --- a/drivers/net/usb/dm9601.c +++ b/drivers/net/usb/dm9601.c @@ -586,6 +586,10 @@ static const struct usb_device_id products[] = { USB_DEVICE(0x0a46, 0x0268),/* ShanTou ST268 USB NIC */ .driver_info = (unsigned long)dm9601_info, }, + { +USB_DEVICE(0x0a46, 0x8515),/* ADMtek ADM8515 USB NIC */ +.driver_info = (unsigned long)dm9601_info, +}, {}, // END }; -- 1.5.3.4 -- Bye, Peter Korsgaard - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops in 2.6.21-rc4, 2.6.23
On Mon, Oct 29, 2007 at 01:41:47AM -0700, David Miller wrote: ... Actually, this was caused by a real bug in the SKB_WITH_OVERHEAD macro definition, which Herbert Xu quickly spotted and fixed. Which I hope you've found this by yourself by now. ...Btw, of course you have to be right, and I should find this in max. 12 days yet, if I'm as smart as I hope. But as for now, I really can't see any meaningful difference between this buggy SKB_WITH_OVERHEAD version and 'generic' 2.6.20. There is also a tiny doubt, how this all could influence 2.6.21-rc4, which seems to be 'generic' here as well. I guess it has to be some git issue... the more so, as I can't see there this other (bisected) patch as well?! Then, of course, this could be my sight issue - but then these 12 days are definitely not enough... Cheers, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] nf_nat_h323.c unneeded rcu_dereference() calls
Paul E. McKenney wrote: Hello! While reviewing rcu_dereference() uses, I came across a number of cases where I couldn't see how the rcu_dereference() helped. One class of cases is where the variable is never subsequently dereferenced, so that patches like the following one would be appropriate. So, what am I missing here? Nothing, it was mainly intended as documentation that the hooks are protected by RCU. I agree that its probably more confusing this way since we're not even in a rcu_read_lock protected section. I've queued a patch to remove them all. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops
On Tue, Oct 30, 2007 at 06:50:58AM +0100, Jens Axboe wrote: How so? The reason you changed it to sg_init_table() + sg_set_buf() is exactly because sg_init_one() didn't properly init the entry (as they name promised). For one of the cases yes but the other one repeatedly calls sg_init_one on the same sg entry while we really only need to initialise it once and call sg_set_buf afterwards. Normally this is irrelevant but the loops in question are trying to estimate the speed of the algorithms so it's good to exclude as much noise from them as possible. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops
On Tue, Oct 30 2007, Herbert Xu wrote: On Tue, Oct 30, 2007 at 06:50:58AM +0100, Jens Axboe wrote: How so? The reason you changed it to sg_init_table() + sg_set_buf() is exactly because sg_init_one() didn't properly init the entry (as they name promised). For one of the cases yes but the other one repeatedly calls sg_init_one on the same sg entry while we really only need to initialise it once and call sg_set_buf afterwards. Normally this is irrelevant but the loops in question are trying to estimate the speed of the algorithms so it's good to exclude as much noise from them as possible. Ah OK, I was referring to the replacement mentioned above. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
On Tue, 30 Oct 2007 16:11:47 +0300 Alexey Dobriyan [EMAIL PROTECTED] wrote: +static const struct file_operations ipv6_route_proc_fops = { + .open = ipv6_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + This needs .owner = THIS_MODULE, static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); - if (p) + p = create_proc_entry(ipv6_route, 0, init_net.proc_net); + if (p) { p-owner = THIS_MODULE; + p-proc_fops = ipv6_route_proc_fops; + } proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif Use proc_net_fops_create() proc_net_fops_create(init_net, ipv6_route, S_IRUGO, ipv6_route_proc_fops) You can get rid of #ifdef since proc_net_fops_create stub does correct thing if PROC_FS is not configured. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dn_route.c momentarily exiting RCU read-side critical section
On Tue, Oct 30, 2007 at 01:10:36AM -0700, David Miller wrote: From: Paul E. McKenney [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 14:15:40 -0700 net/decnet/dn_route.c in dn_rt_cache_get_next() is as follows: static struct dn_route *dn_rt_cache_get_next(struct seq_file *seq, struct dn_route *rt) { struct dn_rt_cache_iter_state *s = rcu_dereference(seq-private); rt = rt-u.dst.dn_next; while(!rt) { rcu_read_unlock_bh(); if (--s-bucket 0) break; ... But what happens if seq-private is freed up right here? ... Or what prevents this from happening? ... Similar code is in rt_cache_get_next(). So, what am I missing here? seq-private is allocated on file open (here via seq_open_private()), and freed up on file close (via seq_release_private). So it cannot be freed up in the middle of an iteration. Thank you for the info!!! OK, for my next stupid question: why is the rcu_dereference(seq-private) required, as opposed to simply seq-private? Thanx, Paul - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] nf_nat_h323.c unneeded rcu_dereference() calls
On Tue, Oct 30, 2007 at 03:06:20PM +0100, Patrick McHardy wrote: Paul E. McKenney wrote: Hello! While reviewing rcu_dereference() uses, I came across a number of cases where I couldn't see how the rcu_dereference() helped. One class of cases is where the variable is never subsequently dereferenced, so that patches like the following one would be appropriate. So, what am I missing here? Nothing, it was mainly intended as documentation that the hooks are protected by RCU. I agree that its probably more confusing this way since we're not even in a rcu_read_lock protected section. I've queued a patch to remove them all. Thank you!!! Thanx, Paul - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
Stephen Hemminger wrote: On Tue, 30 Oct 2007 16:11:47 +0300 Alexey Dobriyan [EMAIL PROTECTED] wrote: +static const struct file_operations ipv6_route_proc_fops = { + .open = ipv6_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + This needs .owner = THIS_MODULE, Your ip_queue conversion patch was also missing this, I've fixed it up. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS.
GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/pagelist.c |2 +- fs/nfs/write.c|6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach struct nfs_write_data *nfs_commit_alloc(void) { - struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); @@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount) { - struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); @@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all if (pagecount = ARRAY_SIZE(p-page_array)) p-pagevec = p-page_array; else { - p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); + p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO); if (!p-pagevec) { kmem_cache_free(nfs_wdata_cachep, p); p = NULL; Index: linux-2.6/fs/nfs/pagelist.c === --- linux-2.6.orig/fs/nfs/pagelist.c +++ linux-2.6/fs/nfs/pagelist.c @@ -27,7 +27,7 @@ static inline struct nfs_page * nfs_page_alloc(void) { struct nfs_page *p; - p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL); + p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); INIT_LIST_HEAD(p-wb_list); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/33] mm: kmem_estimate_pages()
Provide a method to get the upper bound on the pages needed to allocate a given number of objects from a given kmem_cache. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/slab.h |3 + mm/slub.c| 82 +++ 2 files changed, 85 insertions(+) Index: linux-2.6/include/linux/slab.h === --- linux-2.6.orig/include/linux/slab.h +++ linux-2.6/include/linux/slab.h @@ -60,6 +60,7 @@ void kmem_cache_free(struct kmem_cache * unsigned int kmem_cache_size(struct kmem_cache *); const char *kmem_cache_name(struct kmem_cache *); int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr); +unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int objects); /* * Please use this macro to create slab caches. Simply specify the @@ -94,6 +95,8 @@ int kmem_ptr_validate(struct kmem_cache void * __must_check krealloc(const void *, size_t, gfp_t); void kfree(const void *); size_t ksize(const void *); +unsigned kestimate_single(size_t, gfp_t, int); +unsigned kestimate(gfp_t, size_t); /* * Allocator specific definitions. These are mainly used to establish optimized Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c +++ linux-2.6/mm/slub.c @@ -2293,6 +2293,37 @@ const char *kmem_cache_name(struct kmem_ EXPORT_SYMBOL(kmem_cache_name); /* + * return the max number of pages required to allocated count + * objects from the given cache + */ +unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects) +{ + unsigned long slabs; + + if (WARN_ON(!s) || WARN_ON(!s-objects)) + return 0; + + slabs = DIV_ROUND_UP(objects, s-objects); + + /* +* Account the possible additional overhead if the slab holds more that +* one object. +*/ + if (s-objects 1) { + /* +* Account the possible additional overhead if per cpu slabs +* are currently empty and have to be allocated. This is very +* unlikely but a possible scenario immediately after +* kmem_cache_shrink. +*/ + slabs += num_online_cpus(); + } + + return slabs s-order; +} +EXPORT_SYMBOL_GPL(kmem_estimate_pages); + +/* * Attempt to free all slabs on a node. Return the number of slabs we * were unable to free. */ @@ -2630,6 +2661,57 @@ void kfree(const void *x) EXPORT_SYMBOL(kfree); /* + * return the max number of pages required to allocate @count objects + * of @size bytes from kmalloc given @flags. + */ +unsigned kestimate_single(size_t size, gfp_t flags, int count) +{ + struct kmem_cache *s = get_slab(size, flags); + if (!s) + return 0; + + return kmem_estimate_pages(s, flags, count); + +} +EXPORT_SYMBOL_GPL(kestimate_single); + +/* + * return the max number of pages required to allocate @bytes from kmalloc + * in an unspecified number of allocation of heterogeneous size. + */ +unsigned kestimate(gfp_t flags, size_t bytes) +{ + int i; + unsigned long pages; + + /* +* multiply by two, in order to account the worst case slack space +* due to the power-of-two allocation sizes. +*/ + pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE); + + /* +* add the kmem_cache overhead of each possible kmalloc cache +*/ + for (i = 1; i PAGE_SHIFT; i++) { + struct kmem_cache *s; + +#ifdef CONFIG_ZONE_DMA + if (unlikely(flags SLUB_DMA)) + s = dma_kmalloc_cache(i, flags); + else +#endif + s = kmalloc_caches[i]; + + if (s) + pages += kmem_estimate_pages(s, flags, 0); + } + + return pages; +} +EXPORT_SYMBOL_GPL(kestimate); + +/* * kmem_cache_shrink removes empty slabs from the partial lists and sorts * the remaining slabs by the number of items in use. The slabs with the * most items in use come first. New allocations will then fill those up -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 24/33] mm: prepare swap entry methods for use in page methods
Move around the swap entry methods in preparation for use from page methods. Also provide a function to obtain the swap_info_struct backing a swap cache page. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm.h |8 include/linux/swap.h| 48 include/linux/swapops.h | 44 mm/swapfile.c |1 + 4 files changed, 57 insertions(+), 44 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -12,6 +12,7 @@ #include linux/prio_tree.h #include linux/debug_locks.h #include linux/mm_types.h +#include linux/swap.h struct mempolicy; struct anon_vma; @@ -573,6 +574,13 @@ static inline struct address_space *page return mapping; } +static inline struct swap_info_struct *page_swap_info(struct page *page) +{ + swp_entry_t swap = { .val = page_private(page) }; + BUG_ON(!PageSwapCache(page)); + return get_swap_info_struct(swp_type(swap)); +} + static inline int PageAnon(struct page *page) { return ((unsigned long)page-mapping PAGE_MAPPING_ANON) != 0; Index: linux-2.6/include/linux/swap.h === --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -80,6 +80,50 @@ typedef struct { } swp_entry_t; /* + * swapcache pages are stored in the swapper_space radix tree. We want to + * get good packing density in that tree, so the index should be dense in + * the low-order bits. + * + * We arrange the `type' and `offset' fields so that `type' is at the five + * high-order bits of the swp_entry_t and `offset' is right-aligned in the + * remaining bits. + * + * swp_entry_t's are *never* stored anywhere in their arch-dependent format. + */ +#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT) +#define SWP_OFFSET_MASK(e) ((1UL SWP_TYPE_SHIFT(e)) - 1) + +/* + * Store a type+offset into a swp_entry_t in an arch-independent format + */ +static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset) +{ + swp_entry_t ret; + + ret.val = (type SWP_TYPE_SHIFT(ret)) | + (offset SWP_OFFSET_MASK(ret)); + return ret; +} + +/* + * Extract the `type' field from a swp_entry_t. The swp_entry_t is in + * arch-independent format + */ +static inline unsigned swp_type(swp_entry_t entry) +{ + return (entry.val SWP_TYPE_SHIFT(entry)); +} + +/* + * Extract the `offset' field from a swp_entry_t. The swp_entry_t is in + * arch-independent format + */ +static inline pgoff_t swp_offset(swp_entry_t entry) +{ + return entry.val SWP_OFFSET_MASK(entry); +} + +/* * current-reclaim_state points to one of these when a task is running * memory reclaim */ @@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_ return 0; } +static inline struct swap_info_struct *get_swap_info_struct(unsigned type) +{ + return NULL; +} #define can_share_swap_page(p) (page_mapcount(p) == 1) static inline int move_to_swap_cache(struct page *page, swp_entry_t entry) Index: linux-2.6/include/linux/swapops.h === --- linux-2.6.orig/include/linux/swapops.h +++ linux-2.6/include/linux/swapops.h @@ -1,48 +1,4 @@ /* - * swapcache pages are stored in the swapper_space radix tree. We want to - * get good packing density in that tree, so the index should be dense in - * the low-order bits. - * - * We arrange the `type' and `offset' fields so that `type' is at the five - * high-order bits of the swp_entry_t and `offset' is right-aligned in the - * remaining bits. - * - * swp_entry_t's are *never* stored anywhere in their arch-dependent format. - */ -#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT) -#define SWP_OFFSET_MASK(e) ((1UL SWP_TYPE_SHIFT(e)) - 1) - -/* - * Store a type+offset into a swp_entry_t in an arch-independent format - */ -static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset) -{ - swp_entry_t ret; - - ret.val = (type SWP_TYPE_SHIFT(ret)) | - (offset SWP_OFFSET_MASK(ret)); - return ret; -} - -/* - * Extract the `type' field from a swp_entry_t. The swp_entry_t is in - * arch-independent format - */ -static inline unsigned swp_type(swp_entry_t entry) -{ - return (entry.val SWP_TYPE_SHIFT(entry)); -} - -/* - * Extract the `offset' field from a swp_entry_t. The swp_entry_t is in - * arch-independent format - */ -static inline pgoff_t swp_offset(swp_entry_t entry) -{ - return entry.val SWP_OFFSET_MASK(entry); -} - -/* * Convert the arch-dependent pte representation of a swp_entry_t into an * arch-independent swp_entry_t. */ Index: linux-2.6/mm/swapfile.c
[PATCH 16/33] netvm: network reserve infrastructure
Provide the basic infrastructure to reserve and charge/account network memory. We provide the following reserve tree: 1) total network reserve 2)network TX reserve 3) protocol TX pages 4)network RX reserve 5) SKB data reserve [1] is used to make all the network reserves a single subtree, for easy manipulation. [2] and [4] are merely for eastetic reasons. The TX pages reserve [3] is assumed bounded by it being the upper bound of memory that can be used for sending pages (not quite true, but good enough) The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data against in the fallback path. The consumers for these reserves are sockets marked with: SOCK_MEMALLOC Such sockets are to be used to service the VM (iow. to swap over). They must be handled kernel side, exposing such a socket to user-space is a BUG. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h | 35 +++- net/Kconfig|3 + net/core/sock.c| 113 + 3 files changed, 150 insertions(+), 1 deletion(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -50,6 +50,7 @@ #include linux/skbuff.h /* struct sk_buff */ #include linux/mm.h #include linux/security.h +#include linux/reserve.h #include linux/filter.h @@ -397,6 +398,7 @@ enum sock_flags { SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */ SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ + SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */ }; static inline void sock_copy_flags(struct sock *nsk, struct sock *osk) @@ -419,9 +421,40 @@ static inline int sock_flag(struct sock return test_bit(flag, sk-sk_flags); } +static inline int sk_has_memalloc(struct sock *sk) +{ + return sock_flag(sk, SOCK_MEMALLOC); +} + +/* + * Guestimate the per request queue TX upper bound. + * + * Max packet size is 64k, and we need to reserve that much since the data + * might need to bounce it. Double it to be on the safe side. + */ +#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE) + +extern atomic_t memalloc_socks; + +extern struct mem_reserve net_rx_reserve; +extern struct mem_reserve net_skb_reserve; + +static inline int sk_memalloc_socks(void) +{ + return atomic_read(memalloc_socks); +} + +extern int rx_emergency_get(int bytes); +extern int rx_emergency_get_overcommit(int bytes); +extern void rx_emergency_put(int bytes); + +extern int sk_adjust_memalloc(int socks, long tx_reserve_pages); +extern int sk_set_memalloc(struct sock *sk); +extern int sk_clear_memalloc(struct sock *sk); + static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask) { - return gfp_mask; + return gfp_mask | (sk-sk_allocation __GFP_MEMALLOC); } static inline void sk_acceptq_removed(struct sock *sk) Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -112,6 +112,7 @@ #include linux/tcp.h #include linux/init.h #include linux/highmem.h +#include linux/reserve.h #include asm/uaccess.h #include asm/system.h @@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly /* Maximal space eaten by iovec or ancilliary data plus some space */ int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); +atomic_t memalloc_socks; + +static struct mem_reserve net_reserve; +struct mem_reserve net_rx_reserve; +struct mem_reserve net_skb_reserve; +static struct mem_reserve net_tx_reserve; +static struct mem_reserve net_tx_pages; + +EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */ +EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */ + +/* + * is there room for another emergency packet? + */ +static int __rx_emergency_get(int bytes, bool overcommit) +{ + return mem_reserve_kmalloc_charge(net_skb_reserve, bytes, overcommit); +} + +int rx_emergency_get(int bytes) +{ + return __rx_emergency_get(bytes, false); +} + +int rx_emergency_get_overcommit(int bytes) +{ + return __rx_emergency_get(bytes, true); +} + +void rx_emergency_put(int bytes) +{ + mem_reserve_kmalloc_charge(net_skb_reserve, -bytes, 0); +} + +/** + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX + * @socks: number of new %SOCK_MEMALLOC sockets + * @tx_resserve_pages: number of pages to (un)reserve for TX + * + * This function adjusts the memalloc reserve based on system demand. + * The RX reserve is a limit, and only added once, not for each socket. + * + * NOTE: + *@tx_reserve_pages is an upper-bound of memory used for TX hence + *we need not account the pages like we do for RX pages.
[PATCH 08/33] mm: emergency pool
Provide means to reserve a specific amount of pages. The emergency pool is separated from the min watermark because ALLOC_HARDER and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure a strict minimum. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mmzone.h |3 + mm/page_alloc.c| 82 +++-- mm/vmstat.c|6 +-- 3 files changed, 78 insertions(+), 13 deletions(-) Index: linux-2.6/include/linux/mmzone.h === --- linux-2.6.orig/include/linux/mmzone.h +++ linux-2.6/include/linux/mmzone.h @@ -213,7 +213,7 @@ enum zone_type { struct zone { /* Fields commonly accessed by the page allocator */ - unsigned long pages_min, pages_low, pages_high; + unsigned long pages_emerg, pages_min, pages_low, pages_high; /* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several @@ -682,6 +682,7 @@ int sysctl_min_unmapped_ratio_sysctl_han struct file *, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); +int adjust_memalloc_reserve(int pages); extern int numa_zonelist_order_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO static DEFINE_SPINLOCK(min_free_lock); int min_free_kbytes = 1024; +static DEFINE_MUTEX(var_free_mutex); +int var_free_kbytes; unsigned long __meminitdata nr_kernel_pages; unsigned long __meminitdata nr_all_pages; @@ -1252,7 +1254,7 @@ int zone_watermark_ok(struct zone *z, in if (alloc_flags ALLOC_HARDER) min -= min / 4; - if (free_pages = min + z-lowmem_reserve[classzone_idx]) + if (free_pages = min + z-lowmem_reserve[classzone_idx] + z-pages_emerg) return 0; for (o = 0; o order; o++) { /* At the next order, this order's pages become unavailable */ @@ -1733,8 +1735,8 @@ nofail_alloc: nopage: if (!(gfp_mask __GFP_NOWARN) printk_ratelimit()) { printk(KERN_WARNING %s: page allocation failure. -order:%d, mode:0x%x\n, - p-comm, order, gfp_mask); +order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n, + p-comm, order, gfp_mask, alloc_flags, p-flags); dump_stack(); show_mem(); } @@ -1952,9 +1954,9 @@ void show_free_areas(void) \n, zone-name, K(zone_page_state(zone, NR_FREE_PAGES)), - K(zone-pages_min), - K(zone-pages_low), - K(zone-pages_high), + K(zone-pages_emerg + zone-pages_min), + K(zone-pages_emerg + zone-pages_low), + K(zone-pages_emerg + zone-pages_high), K(zone_page_state(zone, NR_ACTIVE)), K(zone_page_state(zone, NR_INACTIVE)), K(zone-present_pages), @@ -4113,7 +4115,7 @@ static void calculate_totalreserve_pages } /* we treat pages_high as reserved pages. */ - max += zone-pages_high; + max += zone-pages_high + zone-pages_emerg; if (max zone-present_pages) max = zone-present_pages; @@ -4170,7 +4172,8 @@ static void setup_per_zone_lowmem_reserv */ static void __setup_per_zone_pages_min(void) { - unsigned long pages_min = min_free_kbytes (PAGE_SHIFT - 10); + unsigned pages_min = min_free_kbytes (PAGE_SHIFT - 10); + unsigned pages_emerg = var_free_kbytes (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; struct zone *zone; unsigned long flags; @@ -4182,11 +4185,13 @@ static void __setup_per_zone_pages_min(v } for_each_zone(zone) { - u64 tmp; + u64 tmp, tmp_emerg; spin_lock_irqsave(zone-lru_lock, flags); tmp = (u64)pages_min * zone-present_pages; do_div(tmp, lowmem_pages); + tmp_emerg = (u64)pages_emerg * zone-present_pages; + do_div(tmp_emerg, lowmem_pages); if (is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't @@
[PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages
In order to teach filesystems to handle swap cache pages, two new page functions are introduced: pgoff_t page_file_index(struct page *); struct address_space *page_file_mapping(struct page *); page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE blocks. Like page-index is for mapped pages, this function also gives the correct index for PG_swapcache pages. page_file_mapping - gives the mapping backing the actual page; that is for swap cache pages it will give swap_file-f_mapping. page_offset() is modified to use page_file_index(), so that it will give the expected result, even for PG_swapcache pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm.h | 26 ++ include/linux/pagemap.h |2 +- 2 files changed, 27 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -13,6 +13,7 @@ #include linux/debug_locks.h #include linux/mm_types.h #include linux/swap.h +#include linux/fs.h struct mempolicy; struct anon_vma; @@ -581,6 +582,16 @@ static inline struct swap_info_struct *p return get_swap_info_struct(swp_type(swap)); } +static inline +struct address_space *page_file_mapping(struct page *page) +{ +#ifdef CONFIG_SWAP_FILE + if (unlikely(PageSwapCache(page))) + return page_swap_info(page)-swap_file-f_mapping; +#endif + return page-mapping; +} + static inline int PageAnon(struct page *page) { return ((unsigned long)page-mapping PAGE_MAPPING_ANON) != 0; @@ -598,6 +609,21 @@ static inline pgoff_t page_index(struct } /* + * Return the file index of the page. Regular pagecache pages use -index + * whereas swapcache pages use swp_offset(-private) + */ +static inline pgoff_t page_file_index(struct page *page) +{ +#ifdef CONFIG_SWAP_FILE + if (unlikely(PageSwapCache(page))) { + swp_entry_t swap = { .val = page_private(page) }; + return swp_offset(swap); + } +#endif + return page-index; +} + +/* * The atomic page-_mapcount, like _count, starts from -1: * so that transitions both from it and to it can be tracked, * using atomic_inc_and_test and atomic_add_negative(-1). Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str */ static inline loff_t page_offset(struct page *page) { - return ((loff_t)page-index) PAGE_CACHE_SHIFT; + return ((loff_t)page_file_index(page)) PAGE_CACHE_SHIFT; } static inline pgoff_t linear_page_index(struct vm_area_struct *vma, -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 30/33] nfs: swap vs nfs_writepage
For now just use the -writepage() path for swap traffic. Trond would like to see -swap_page() or some such additional a_op. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/write.c | 23 +++ 1 file changed, 23 insertions(+) Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -336,6 +336,29 @@ static int nfs_do_writepage(struct page nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1); + if (unlikely(IS_SWAPFILE(inode))) { + struct rpc_cred *cred; + struct nfs_open_context *ctx; + int status; + + cred = rpcauth_lookupcred(NFS_CLIENT(inode)-cl_auth, 0); + if (IS_ERR(cred)) + return PTR_ERR(cred); + + ctx = nfs_find_open_context(inode, cred, FMODE_WRITE); + if (!ctx) + return -EBADF; + + status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page)); + + put_nfs_open_context(ctx); + + if (status 0) { + nfs_set_pageerror(page); + return status; + } + } + nfs_pageio_cond_complete(pgio, page-index); return nfs_page_async_flush(pgio, page); } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/33] mm: tag reseve pages
Tag pages allocated from the reserves with a non-zero page-reserve. This allows us to distinguish and account reserve pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm_types.h |1 + mm/page_alloc.c |4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/mm_types.h === --- linux-2.6.orig/include/linux/mm_types.h +++ linux-2.6/include/linux/mm_types.h @@ -70,6 +70,7 @@ struct page { union { pgoff_t index; /* Our offset within mapping. */ void *freelist; /* SLUB: freelist req. slab lock */ + int reserve;/* page_alloc: page is a reserve page */ }; struct list_head lru; /* Pageout list, eg. active_list * protected by zone-lru_lock ! Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1448,8 +1448,10 @@ zonelist_scan: } page = buffered_rmqueue(zonelist, zone, order, gfp_mask); - if (page) + if (page) { + page-reserve = !!(alloc_flags ALLOC_NO_WATERMARKS); break; + } this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/33] sysctl: propagate conv errors
Currently conv routines will only generate -EINVAL, allow for other errors to be propagetd. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- kernel/sysctl.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) Index: linux-2.6/kernel/sysctl.c === --- linux-2.6.orig/kernel/sysctl.c +++ linux-2.6/kernel/sysctl.c @@ -1732,6 +1732,7 @@ static int __do_proc_dointvec(void *tbl_ int *i, vleft, first=1, neg, val; unsigned long lval; size_t left, len; + int ret = 0; char buf[TMPBUFLEN], *p; char __user *s = buffer; @@ -1787,14 +1788,16 @@ static int __do_proc_dointvec(void *tbl_ s += len; left -= len; - if (conv(neg, lval, i, 1, data)) + ret = conv(neg, lval, i, 1, data); + if (ret) break; } else { p = buf; if (!first) *p++ = '\t'; - if (conv(neg, lval, i, 0, data)) + ret = conv(neg, lval, i, 0, data); + if (ret) break; sprintf(p, %s%lu, neg ? - : , lval); @@ -1823,11 +1826,9 @@ static int __do_proc_dointvec(void *tbl_ left--; } } - if (write first) - return -EINVAL; *lenp -= left; *ppos += *lenp; - return 0; + return ret; #undef TMPBUFLEN } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages
Replace all relevant occurences of page-index and page-mapping in the NFS client with the new page_file_index() and page_file_mapping() functions. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/file.c |8 fs/nfs/internal.h |7 --- fs/nfs/pagelist.c |6 +++--- fs/nfs/read.c |6 +++--- fs/nfs/write.c| 49 + 5 files changed, 39 insertions(+), 37 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -357,7 +357,7 @@ static void nfs_invalidate_page(struct p if (offset != 0) return; /* Cancel any unstarted writes on this page */ - nfs_wb_page_cancel(page-mapping-host, page); + nfs_wb_page_cancel(page_file_mapping(page)-host, page); } static int nfs_release_page(struct page *page, gfp_t gfp) @@ -368,7 +368,7 @@ static int nfs_release_page(struct page static int nfs_launder_page(struct page *page) { - return nfs_wb_page(page-mapping-host, page); + return nfs_wb_page(page_file_mapping(page)-host, page); } const struct address_space_operations nfs_file_aops = { @@ -397,13 +397,13 @@ static int nfs_vm_page_mkwrite(struct vm loff_t offset; lock_page(page); - mapping = page-mapping; + mapping = page_file_mapping(page); if (mapping != vma-vm_file-f_path.dentry-d_inode-i_mapping) { unlock_page(page); return -EINVAL; } pagelen = nfs_page_length(page); - offset = (loff_t)page-index PAGE_CACHE_SHIFT; + offset = (loff_t)page_file_index(page) PAGE_CACHE_SHIFT; unlock_page(page); /* Index: linux-2.6/fs/nfs/pagelist.c === --- linux-2.6.orig/fs/nfs/pagelist.c +++ linux-2.6/fs/nfs/pagelist.c @@ -77,11 +77,11 @@ nfs_create_request(struct nfs_open_conte * update_nfs_request below if the region is not locked. */ req-wb_page= page; atomic_set(req-wb_complete, 0); - req-wb_index = page-index; + req-wb_index = page_file_index(page); page_cache_get(page); BUG_ON(PagePrivate(page)); BUG_ON(!PageLocked(page)); - BUG_ON(page-mapping-host != inode); + BUG_ON(page_file_mapping(page)-host != inode); req-wb_offset = offset; req-wb_pgbase = offset; req-wb_bytes = count; @@ -383,7 +383,7 @@ void nfs_pageio_cond_complete(struct nfs * nfs_scan_list - Scan a list for matching requests * @nfsi: NFS inode * @dst: Destination list - * @idx_start: lower bound of page-index to scan + * @idx_start: lower bound of page_file_index(page) to scan * @npages: idx_start + npages sets the upper bound to scan. * @tag: tag to scan for * Index: linux-2.6/fs/nfs/read.c === --- linux-2.6.orig/fs/nfs/read.c +++ linux-2.6/fs/nfs/read.c @@ -460,11 +460,11 @@ static const struct rpc_call_ops nfs_rea int nfs_readpage(struct file *file, struct page *page) { struct nfs_open_context *ctx; - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; int error; dprintk(NFS: nfs_readpage (%p [EMAIL PROTECTED])\n, - page, PAGE_CACHE_SIZE, page-index); + page, PAGE_CACHE_SIZE, page_file_index(page)); nfs_inc_stats(inode, NFSIOS_VFSREADPAGE); nfs_add_stats(inode, NFSIOS_READPAGES, 1); @@ -511,7 +511,7 @@ static int readpage_async_filler(void *data, struct page *page) { struct nfs_readdesc *desc = (struct nfs_readdesc *)data; - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; struct nfs_page *new; unsigned int len; int error; Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re static struct nfs_page *nfs_page_find_request(struct page *page) { - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; struct nfs_page *req = NULL; spin_lock(inode-i_lock); @@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re /* Adjust the file length if we're writing beyond the end */ static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count) { - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; loff_t end, i_size = i_size_read(inode); pgoff_t end_index = (i_size - 1) PAGE_CACHE_SHIFT; - if (i_size 0 page-index end_index) + if (i_size 0
[PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs
Avoid memory getting stuck waiting for userspace, drop all emergency packets. This of course requires the regular storage route to not include an NF_QUEUE target ;-) Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- net/netfilter/core.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/net/netfilter/core.c === --- linux-2.6.orig/net/netfilter/core.c +++ linux-2.6/net/netfilter/core.c @@ -181,9 +181,12 @@ next_hook: ret = 1; goto unlock; } else if (verdict == NF_DROP) { +drop: kfree_skb(*pskb); ret = -EPERM; } else if ((verdict NF_VERDICT_MASK) == NF_QUEUE) { + if (skb_emergency(*pskb)) + goto drop; NFDEBUG(nf_hook: Verdict = QUEUE.\n); if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn, verdict NF_VERDICT_BITS)) -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/33] netvm: hook skb allocation to reserves
Change the skb allocation api to indicate RX usage and use this to fall back to the reserve when needed. SKBs allocated from the reserve are tagged in skb-emergency. Teach all other skb ops about emergency skbs and the reserve accounting. Use the (new) packet split API to allocate and track fragment pages from the emergency reserve. Do this using an atomic counter in page-index. This is needed because the fragments have a different sharing semantic than that indicated by skb_shinfo()-dataref. Note that the decision to distinguish between regular and emergency SKBs allows the accounting overhead to be limited to the later kind. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm_types.h |1 include/linux/skbuff.h | 25 +- net/core/skbuff.c| 173 +-- 3 files changed, 173 insertions(+), 26 deletions(-) Index: linux-2.6/include/linux/skbuff.h === --- linux-2.6.orig/include/linux/skbuff.h +++ linux-2.6/include/linux/skbuff.h @@ -289,7 +289,8 @@ struct sk_buff { __u8pkt_type:3, fclone:2, ipvs_property:1, - nf_trace:1; + nf_trace:1, + emergency:1; __be16 protocol; void(*destructor)(struct sk_buff *skb); @@ -341,10 +342,22 @@ struct sk_buff { #include asm/system.h +#define SKB_ALLOC_FCLONE 0x01 +#define SKB_ALLOC_RX 0x02 + +static inline bool skb_emergency(const struct sk_buff *skb) +{ +#ifdef CONFIG_NETVM + return unlikely(skb-emergency); +#else + return false; +#endif +} + extern void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, - gfp_t priority, int fclone, int node); + gfp_t priority, int flags, int node); static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { @@ -354,7 +367,7 @@ static inline struct sk_buff *alloc_skb( static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 1, -1); + return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1); } extern void kfree_skbmem(struct sk_buff *skb); @@ -1297,7 +1310,8 @@ static inline void __skb_queue_purge(str static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask) { - struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); + struct sk_buff *skb = + __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb; @@ -1343,6 +1357,7 @@ static inline struct sk_buff *netdev_all } extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); +extern void __netdev_free_page(struct net_device *dev, struct page *page); /** * netdev_alloc_page - allocate a page for ps-rx on a specific device @@ -1359,7 +1374,7 @@ static inline struct page *netdev_alloc_ static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** Index: linux-2.6/net/core/skbuff.c === --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug); * %GFP_ATOMIC. */ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int fclone, int node) + int flags, int node) { struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; + int emergency = 0, memalloc = sk_memalloc_socks(); - cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; + size = SKB_DATA_ALIGN(size); + cache = (flags SKB_ALLOC_FCLONE) + ? skbuff_fclone_cache : skbuff_head_cache; +#ifdef CONFIG_NETVM + if (memalloc (flags SKB_ALLOC_RX)) + gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN; +retry_alloc: +#endif /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask ~__GFP_DMA, node); if (!skb) - goto out; + goto noskb; - size = SKB_DATA_ALIGN(size); data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), gfp_mask, node); if (!data) @@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int * See
[PATCH 13/33] net: wrap sk-sk_backlog_rcv()
Wrap calling sk-sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 + net/core/sock.c |4 ++-- net/ipv4/tcp.c |2 +- net/ipv4/tcp_timer.c |2 +- 4 files changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -485,6 +485,11 @@ static inline void sk_add_backlog(struct skb-next = NULL; } +static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) +{ + return sk-sk_backlog_rcv(sk, skb); +} + #define sk_wait_event(__sk, __timeo, __condition) \ ({ int __rc; \ release_sock(__sk); \ Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -320,7 +320,7 @@ int sk_receive_skb(struct sock *sk, stru */ mutex_acquire(sk-sk_lock.dep_map, 0, 1, _RET_IP_); - rc = sk-sk_backlog_rcv(sk, skb); + rc = sk_backlog_rcv(sk, skb); mutex_release(sk-sk_lock.dep_map, 1, _RET_IP_); } else @@ -1312,7 +1312,7 @@ static void __release_sock(struct sock * struct sk_buff *next = skb-next; skb-next = NULL; - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); /* * We are in process context here with softirqs Index: linux-2.6/net/ipv4/tcp.c === --- linux-2.6.orig/net/ipv4/tcp.c +++ linux-2.6/net/ipv4/tcp.c @@ -1134,7 +1134,7 @@ static void tcp_prequeue_process(struct * necessary */ local_bh_disable(); while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL) - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); local_bh_enable(); /* Clear memory counter. */ Index: linux-2.6/net/ipv4/tcp_timer.c === --- linux-2.6.orig/net/ipv4/tcp_timer.c +++ linux-2.6/net/ipv4/tcp_timer.c @@ -196,7 +196,7 @@ static void tcp_delack_timer(unsigned lo NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED); while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL) - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); tp-ucopy.memory = 0; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/33] mm: __GFP_MEMALLOC
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much like PF_MEMALLOC. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/gfp.h |3 ++- mm/page_alloc.c |4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/gfp.h === --- linux-2.6.orig/include/linux/gfp.h +++ linux-2.6/include/linux/gfp.h @@ -43,6 +43,7 @@ struct vm_area_struct; #define __GFP_REPEAT ((__force gfp_t)0x400u) /* Retry the allocation. Might fail */ #define __GFP_NOFAIL ((__force gfp_t)0x800u) /* Retry for ever. Cannot fail */ #define __GFP_NORETRY ((__force gfp_t)0x1000u)/* Do not retry. Might fail */ +#define __GFP_MEMALLOC ((__force gfp_t)0x2000u)/* Use emergency reserves */ #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */ #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency reserves */ @@ -88,7 +89,7 @@ struct vm_area_struct; /* Control page allocator reclaim behavior */ #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\ __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\ - __GFP_NORETRY|__GFP_NOMEMALLOC) + __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC) /* Control allocation constraints */ #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1560,7 +1560,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_HARDER; if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { - if (!in_irq() (p-flags PF_MEMALLOC)) + if (gfp_mask __GFP_MEMALLOC) + alloc_flags |= ALLOC_NO_WATERMARKS; + else if (!in_irq() (p-flags PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; else if (!in_interrupt() unlikely(test_thread_flag(TIF_MEMDIE))) -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/33] mm: gfp_to_alloc_flags()
Factor out the gfp to alloc_flags mapping so it can be used in other places. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/internal.h | 11 ++ mm/page_alloc.c | 98 2 files changed, 67 insertions(+), 42 deletions(-) Index: linux-2.6/mm/internal.h === --- linux-2.6.orig/mm/internal.h +++ linux-2.6/mm/internal.h @@ -47,4 +47,15 @@ static inline unsigned long page_order(s VM_BUG_ON(!PageBuddy(page)); return page_private(page); } + +#define ALLOC_HARDER 0x01 /* try to alloc harder */ +#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */ +#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */ +#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */ +#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */ +#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */ +#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ + +int gfp_to_alloc_flags(gfp_t gfp_mask); + #endif Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1139,14 +1139,6 @@ failed: return NULL; } -#define ALLOC_NO_WATERMARKS0x01 /* don't check watermarks at all */ -#define ALLOC_WMARK_MIN0x02 /* use pages_min watermark */ -#define ALLOC_WMARK_LOW0x04 /* use pages_low watermark */ -#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ -#define ALLOC_HARDER 0x10 /* try to alloc harder */ -#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ -#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ - #ifdef CONFIG_FAIL_PAGE_ALLOC static struct fail_page_alloc_attr { @@ -1535,6 +1527,44 @@ static void set_page_owner(struct page * #endif /* CONFIG_PAGE_OWNER */ /* + * get the deepest reaching allocation flags for the given gfp_mask + */ +int gfp_to_alloc_flags(gfp_t gfp_mask) +{ + struct task_struct *p = current; + int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; + const gfp_t wait = gfp_mask __GFP_WAIT; + + /* +* The caller may dip into page reserves a bit more if the caller +* cannot run direct reclaim, or if the caller has realtime scheduling +* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will +* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). +*/ + if (gfp_mask __GFP_HIGH) + alloc_flags |= ALLOC_HIGH; + + if (!wait) { + alloc_flags |= ALLOC_HARDER; + /* +* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. +* See also cpuset_zone_allowed() comment in kernel/cpuset.c. +*/ + alloc_flags = ~ALLOC_CPUSET; + } else if (unlikely(rt_task(p)) !in_interrupt()) + alloc_flags |= ALLOC_HARDER; + + if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { + if (!in_interrupt() + ((p-flags PF_MEMALLOC) || +unlikely(test_thread_flag(TIF_MEMDIE + alloc_flags |= ALLOC_NO_WATERMARKS; + } + + return alloc_flags; +} + +/* * This is the 'heart' of the zoned buddy allocator. */ struct page * fastcall @@ -1589,48 +1619,28 @@ restart: * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. -* -* The caller may dip into page reserves a bit more if the caller -* cannot run direct reclaim, or if the caller has realtime scheduling -* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will -* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). */ - alloc_flags = ALLOC_WMARK_MIN; - if ((unlikely(rt_task(p)) !in_interrupt()) || !wait) - alloc_flags |= ALLOC_HARDER; - if (gfp_mask __GFP_HIGH) - alloc_flags |= ALLOC_HIGH; - if (wait) - alloc_flags |= ALLOC_CPUSET; + alloc_flags = gfp_to_alloc_flags(gfp_mask); - /* -* Go through the zonelist again. Let __GFP_HIGH and allocations -* coming from realtime tasks go deeper into reserves. -* -* This is the last chance, in general, before the goto nopage. -* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. -* See also cpuset_zone_allowed() comment in kernel/cpuset.c. -*/ - page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); + /* This is the last chance, in general, before the goto nopage. */ + page = get_page_from_freelist(gfp_mask, order, zonelist, +
[PATCH 20/33] netvm: filter emergency skbs.
Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our precious memory reserve doesn't get stuck waiting for user-space. The correctness of this approach relies on the fact that networks must be assumed lossy. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -930,6 +930,9 @@ static inline int sk_filter(struct sock { int err; struct sk_filter *filter; + + if (skb_emergency(skb) !sk_has_memalloc(sk)) + return -ENOMEM; err = security_sock_rcv_skb(sk, skb); if (err) -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/33] netvm: skb processing
In order to make sure emergency packets receive all memory needed to proceed ensure processing of emergency SKBs happens under PF_MEMALLOC. Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing. Skip taps, since those are user-space again. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 + net/core/dev.c | 44 ++-- net/core/sock.c| 18 ++ 3 files changed, 61 insertions(+), 6 deletions(-) Index: linux-2.6/net/core/dev.c === --- linux-2.6.orig/net/core/dev.c +++ linux-2.6/net/core/dev.c @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk struct net_device *orig_dev; int ret = NET_RX_DROP; __be16 type; + unsigned long pflags = current-flags; + + /* Emergency skb are special, they should +* - be delivered to SOCK_MEMALLOC sockets only +* - stay away from userspace +* - have bounded memory usage +* +* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind. +* This saves us from propagating the allocation context down to all +* allocation sites. +*/ + if (skb_emergency(skb)) + current-flags |= PF_MEMALLOC; /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) - return NET_RX_DROP; + goto out; if (!skb-tstamp.tv64) net_timestamp(skb); @@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk orig_dev = skb_bond(skb); if (!orig_dev) - return NET_RX_DROP; + goto out; __get_cpu_var(netdev_rx_stat).total++; @@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk } #endif + if (skb_emergency(skb)) + goto skip_taps; + list_for_each_entry_rcu(ptype, ptype_all, list) { if (!ptype-dev || ptype-dev == skb-dev) { if (pt_prev) @@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk } } +skip_taps: #ifdef CONFIG_NET_CLS_ACT if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) { kfree_skb(skb); - goto out; + goto unlock; } skb-tc_verd = 0; ncls: #endif + if (skb_emergency(skb)) + switch(skb-protocol) { + case __constant_htons(ETH_P_ARP): + case __constant_htons(ETH_P_IP): + case __constant_htons(ETH_P_IPV6): + case __constant_htons(ETH_P_8021Q): + break; + + default: + goto drop; + } + skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; skb = handle_macvlan(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; type = skb-protocol; list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) { @@ -2056,6 +2085,7 @@ ncls: if (pt_prev) { ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev); } else { +drop: kfree_skb(skb); /* Jamal, now you will not able to escape explaining * me how you were going to use this. :-) @@ -2063,8 +2093,10 @@ ncls: ret = NET_RX_DROP; } -out: +unlock: rcu_read_unlock(); +out: + tsk_restore_flags(current, pflags, PF_MEMALLOC); return ret; } Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct skb-next = NULL; } +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb); + static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) { + if (skb_emergency(skb)) + return __sk_backlog_rcv(sk, skb); + return sk-sk_backlog_rcv(sk, skb); } Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_clear_memalloc); +#ifdef CONFIG_NETVM +int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) +{ + int ret; + unsigned long pflags = current-flags; + + /* these should have been dropped before queueing */ + BUG_ON(!sk_has_memalloc(sk)); + +
[PATCH 25/33] mm: add support for non block device backed swap files
A new addres_space_operations method is added: int swapfile(struct address_space *, int) When during sys_swapon() this method is found and returns no error the swapper_space.a_ops will proxy to sis-swap_file-f_mapping-a_ops. The swapfile method will be used to communicate to the address_space that the VM relies on it, and the address_space should take adequate measures (like reserving memory for mempools or the like). Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- Documentation/filesystems/Locking |9 + include/linux/buffer_head.h |2 - include/linux/fs.h|1 include/linux/swap.h |3 + mm/Kconfig|3 + mm/page_io.c | 58 ++ mm/swap_state.c |5 +++ mm/swapfile.c | 22 +- 8 files changed, 101 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/swap.h === --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -164,6 +164,7 @@ enum { SWP_USED= (1 0), /* is slot in swap_info[] used? */ SWP_WRITEOK = (1 1), /* ok to write to this swap?*/ SWP_ACTIVE = (SWP_USED | SWP_WRITEOK), + SWP_FILE= (1 2), /* file swap area */ /* add others here before... */ SWP_SCANNING= (1 8), /* refcount in scan_swap_map */ }; @@ -264,6 +265,8 @@ extern void swap_unplug_io_fn(struct bac /* linux/mm/page_io.c */ extern int swap_readpage(struct file *, struct page *); extern int swap_writepage(struct page *page, struct writeback_control *wbc); +extern void swap_sync_page(struct page *page); +extern int swap_set_page_dirty(struct page *page); extern void end_swap_bio_read(struct bio *bio, int err); /* linux/mm/swap_state.c */ Index: linux-2.6/mm/page_io.c === --- linux-2.6.orig/mm/page_io.c +++ linux-2.6/mm/page_io.c @@ -17,6 +17,7 @@ #include linux/bio.h #include linux/swapops.h #include linux/writeback.h +#include linux/buffer_head.h #include asm/pgtable.h static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index, @@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st unlock_page(page); goto out; } +#ifdef CONFIG_SWAP_FILE + { + struct swap_info_struct *sis = page_swap_info(page); + if (sis-flags SWP_FILE) { + ret = sis-swap_file-f_mapping- + a_ops-writepage(page, wbc); + if (!ret) + count_vm_event(PSWPOUT); + return ret; + } + } +#endif bio = get_swap_bio(GFP_NOIO, page_private(page), page, end_swap_bio_write); if (bio == NULL) { @@ -120,6 +133,39 @@ out: return ret; } +#ifdef CONFIG_SWAP_FILE +void swap_sync_page(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations * a_ops = + sis-swap_file-f_mapping-a_ops; + if (a_ops-sync_page) + a_ops-sync_page(page); + } else + block_sync_page(page); +} + +int swap_set_page_dirty(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations * a_ops = + sis-swap_file-f_mapping-a_ops; + int (*spd)(struct page *) = a_ops-set_page_dirty; +#ifdef CONFIG_BLOCK + if (!spd) + spd = __set_page_dirty_buffers; +#endif + return (*spd)(page); + } + + return __set_page_dirty_nobuffers(page); +} +#endif + int swap_readpage(struct file *file, struct page *page) { struct bio *bio; @@ -127,6 +173,18 @@ int swap_readpage(struct file *file, str BUG_ON(!PageLocked(page)); ClearPageUptodate(page); +#ifdef CONFIG_SWAP_FILE + { + struct swap_info_struct *sis = page_swap_info(page); + if (sis-flags SWP_FILE) { + ret = sis-swap_file-f_mapping- + a_ops-readpage(sis-swap_file, page); + if (!ret) + count_vm_event(PSWPIN); + return ret; + } + } +#endif bio = get_swap_bio(GFP_KERNEL, page_private(page), page, end_swap_bio_read); if (bio == NULL) { Index: linux-2.6/mm/swap_state.c === --- linux-2.6.orig/mm/swap_state.c
[PATCH 03/33] mm: slub: add knowledge of reserve pages
Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. Care is taken to only touch the SLUB slow path. This is done to ensure reserve pages don't leak out and get consumed. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/slub_def.h |1 + mm/slub.c| 31 +++ 2 files changed, 24 insertions(+), 8 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c +++ linux-2.6/mm/slub.c @@ -20,11 +20,12 @@ #include linux/mempolicy.h #include linux/ctype.h #include linux/kallsyms.h +#include internal.h /* * Lock order: * 1. slab_lock(page) - * 2. slab-list_lock + * 2. node-list_lock * * The slab_lock protects operations on the object of a particular * slab and its metadata in the page struct. If the slab lock @@ -1074,7 +1075,7 @@ static void setup_object(struct kmem_cac s-ctor(s, object); } -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve) { struct page *page; struct kmem_cache_node *n; @@ -1090,6 +1091,7 @@ static struct page *new_slab(struct kmem if (!page) goto out; + *reserve = page-reserve; n = get_node(s, page_to_nid(page)); if (n) atomic_long_inc(n-nr_slabs); @@ -1468,10 +1470,22 @@ static void *__slab_alloc(struct kmem_ca { void **object; struct page *new; + int reserve = 0; if (!c-page) goto new_slab; + if (unlikely(c-reserve)) { + /* +* If the current slab is a reserve slab and the current +* allocation context does not allow access to the reserves +* we must force an allocation to test the current levels. +*/ + if (!(gfp_to_alloc_flags(gfpflags) ALLOC_NO_WATERMARKS)) + goto alloc_slab; + reserve = 1; + } + slab_lock(c-page); if (unlikely(!node_match(c, node))) goto another_slab; @@ -1479,10 +1493,9 @@ load_freelist: object = c-page-freelist; if (unlikely(!object)) goto another_slab; - if (unlikely(SlabDebug(c-page))) + if (unlikely(SlabDebug(c-page) || reserve)) goto debug; - object = c-page-freelist; c-freelist = object[c-offset]; c-page-inuse = s-objects; c-page-freelist = NULL; @@ -1500,16 +1513,18 @@ new_slab: goto load_freelist; } +alloc_slab: if (gfpflags __GFP_WAIT) local_irq_enable(); - new = new_slab(s, gfpflags, node); + new = new_slab(s, gfpflags, node, reserve); if (gfpflags __GFP_WAIT) local_irq_disable(); if (new) { c = get_cpu_slab(s, smp_processor_id()); + c-reserve = reserve; if (c-page) { /* * Someone else populated the cpu_slab while we @@ -1537,8 +1552,7 @@ new_slab: } return NULL; debug: - object = c-page-freelist; - if (!alloc_debug_processing(s, c-page, object, addr)) + if (SlabDebug(c-page) !alloc_debug_processing(s, c-page, object, addr)) goto another_slab; c-page-inuse++; @@ -2010,10 +2024,11 @@ static struct kmem_cache_node *early_kme { struct page *page; struct kmem_cache_node *n; + int reserve; BUG_ON(kmalloc_caches-size sizeof(struct kmem_cache_node)); - page = new_slab(kmalloc_caches, gfpflags, node); + page = new_slab(kmalloc_caches, gfpflags, node, reserve); BUG_ON(!page); if (page_to_nid(page) != node) { Index: linux-2.6/include/linux/slub_def.h === --- linux-2.6.orig/include/linux/slub_def.h +++ linux-2.6/include/linux/slub_def.h @@ -17,6 +17,7 @@ struct kmem_cache_cpu { int node; unsigned int offset; unsigned int objsize; + int reserve; }; struct kmem_cache_node { -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/33] net: sk_allocation() - concentrate socket related allocations
Introduce sk_allocation(), this function allows to inject sock specific flags to each sock related allocation. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h|7 ++- net/ipv4/tcp_output.c | 11 ++- net/ipv6/tcp_ipv6.c | 14 +- 3 files changed, 21 insertions(+), 11 deletions(-) Index: linux-2.6/net/ipv4/tcp_output.c === --- linux-2.6.orig/net/ipv4/tcp_output.c +++ linux-2.6/net/ipv4/tcp_output.c @@ -2081,7 +2081,7 @@ void tcp_send_fin(struct sock *sk) } else { /* Socket is locked, keep trying until memory is available. */ for (;;) { - skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL); + skb = alloc_skb_fclone(MAX_TCP_HEADER, sk-sk_allocation); if (skb) break; yield(); @@ -2114,7 +2114,7 @@ void tcp_send_active_reset(struct sock * struct sk_buff *skb; /* NOTE: No TCP options attached and we never retransmit this. */ - skb = alloc_skb(MAX_TCP_HEADER, priority); + skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority)); if (!skb) { NET_INC_STATS(LINUX_MIB_TCPABORTFAILED); return; @@ -2187,7 +2187,8 @@ struct sk_buff * tcp_make_synack(struct __u8 *md5_hash_location; #endif - skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC); + skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, + sk_allocation(sk, GFP_ATOMIC)); if (skb == NULL) return NULL; @@ -2446,7 +2447,7 @@ void tcp_send_ack(struct sock *sk) * tcp_transmit_skb() will set the ownership to this * sock. */ - buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); + buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC)); if (buff == NULL) { inet_csk_schedule_ack(sk); inet_csk(sk)-icsk_ack.ato = TCP_ATO_MIN; @@ -2488,7 +2489,7 @@ static int tcp_xmit_probe_skb(struct soc struct sk_buff *skb; /* We don't queue it, tcp_transmit_skb() sets ownership. */ - skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); + skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC)); if (skb == NULL) return -1; Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -419,6 +419,11 @@ static inline int sock_flag(struct sock return test_bit(flag, sk-sk_flags); } +static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask) +{ + return gfp_mask; +} + static inline void sk_acceptq_removed(struct sock *sk) { sk-sk_ack_backlog--; @@ -1212,7 +1217,7 @@ static inline struct sk_buff *sk_stream_ int hdr_len; hdr_len = SKB_DATA_ALIGN(sk-sk_prot-max_header); - skb = alloc_skb_fclone(size + hdr_len, gfp); + skb = alloc_skb_fclone(size + hdr_len, sk_allocation(sk, gfp)); if (skb) { skb-truesize += mem; if (sk_stream_wmem_schedule(sk, skb-truesize)) { Index: linux-2.6/net/ipv6/tcp_ipv6.c === --- linux-2.6.orig/net/ipv6/tcp_ipv6.c +++ linux-2.6/net/ipv6/tcp_ipv6.c @@ -573,7 +573,8 @@ static int tcp_v6_md5_do_add(struct sock } else { /* reallocate new list if current one is full. */ if (!tp-md5sig_info) { - tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), GFP_ATOMIC); + tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), + sk_allocation(sk, GFP_ATOMIC)); if (!tp-md5sig_info) { kfree(newkey); return -ENOMEM; @@ -583,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock tcp_alloc_md5sig_pool(); if (tp-md5sig_info-alloced6 == tp-md5sig_info-entries6) { keys = kmalloc((sizeof (tp-md5sig_info-keys6[0]) * - (tp-md5sig_info-entries6 + 1)), GFP_ATOMIC); + (tp-md5sig_info-entries6 + 1)), + sk_allocation(sk, GFP_ATOMIC)); if (!keys) { tcp_free_md5sig_pool(); @@ -709,7 +711,7 @@ static int tcp_v6_parse_md5_keys (struct struct tcp_sock *tp = tcp_sk(sk); struct tcp_md5sig_info *p; - p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL); + p = kzalloc(sizeof(struct tcp_md5sig_info), sk-sk_allocation);
[PATCH 18/33] netvm: INET reserves.
Add reserves for INET. The two big users seem to be the route cache and ip-fragment cache. Reserve the route cache under generic RX reserve, its usage is bounded by the high reclaim watermark, and thus does not need further accounting. Reserve the ip-fragement caches under SKB data reserve, these add to the SKB RX limit. By ensuring we can at least receive as much data as fits in the reassmbly line we avoid fragment attack deadlocks. Use proc conv() routines to update these limits and return -ENOMEM to user space. Adds to the reserve tree: total network reserve network TX reserve protocol TX pages network RX reserve + IPv6 route cache + IPv4 route cache SKB data reserve + IPv6 fragment cache + IPv4 fragment cache Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/sysctl.h | 11 +++ kernel/sysctl.c|8 ++-- net/ipv4/ip_fragment.c |7 +++ net/ipv4/route.c | 30 +- net/ipv4/sysctl_net_ipv4.c | 24 +++- net/ipv6/reassembly.c |7 +++ net/ipv6/route.c | 31 ++- net/ipv6/sysctl_net_ipv6.c | 24 +++- 8 files changed, 136 insertions(+), 6 deletions(-) Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c === --- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c +++ linux-2.6/net/ipv4/sysctl_net_ipv4.c @@ -18,6 +18,7 @@ #include net/route.h #include net/tcp.h #include net/cipso_ipv4.h +#include linux/reserve.h /* From af_inet.c */ extern int sysctl_ip_nonlocal_bind; @@ -186,6 +187,27 @@ static int strategy_allowed_congestion_c } +extern struct mem_reserve ipv4_frag_reserve; + +static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp, +int *valp, int write, void *data) +{ + if (write) { + long value = *negp ? -*lvalp : *lvalp; + int err = mem_reserve_kmalloc_set(ipv4_frag_reserve, value); + if (err) + return err; + } + return do_proc_dointvec_conv(negp, lvalp, valp, write, data); +} + +static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return do_proc_dointvec(table, write, filp, buffer, lenp, ppos, + do_proc_dointvec_fragment_conv, NULL); +} + ctl_table ipv4_table[] = { { .ctl_name = NET_IPV4_TCP_TIMESTAMPS, @@ -291,7 +313,7 @@ ctl_table ipv4_table[] = { .data = sysctl_ipfrag_high_thresh, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec + .proc_handler = proc_dointvec_fragment }, { .ctl_name = NET_IPV4_IPFRAG_LOW_THRESH, Index: linux-2.6/net/ipv6/sysctl_net_ipv6.c === --- linux-2.6.orig/net/ipv6/sysctl_net_ipv6.c +++ linux-2.6/net/ipv6/sysctl_net_ipv6.c @@ -12,9 +12,31 @@ #include net/ndisc.h #include net/ipv6.h #include net/addrconf.h +#include linux/reserve.h #ifdef CONFIG_SYSCTL +extern struct mem_reserve ipv6_frag_reserve; + +static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp, +int *valp, int write, void *data) +{ + if (write) { + long value = *negp ? -*lvalp : *lvalp; + int err = mem_reserve_kmalloc_set(ipv6_frag_reserve, value); + if (err) + return err; + } + return do_proc_dointvec_conv(negp, lvalp, valp, write, data); +} + +static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return do_proc_dointvec(table, write, filp, buffer, lenp, ppos, + do_proc_dointvec_fragment_conv, NULL); +} + static ctl_table ipv6_table[] = { { .ctl_name = NET_IPV6_ROUTE, @@ -44,7 +66,7 @@ static ctl_table ipv6_table[] = { .data = sysctl_ip6frag_high_thresh, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec + .proc_handler = proc_dointvec_fragment }, { .ctl_name = NET_IPV6_IP6FRAG_LOW_THRESH, Index: linux-2.6/net/ipv4/ip_fragment.c === --- linux-2.6.orig/net/ipv4/ip_fragment.c +++ linux-2.6/net/ipv4/ip_fragment.c @@ -43,6 +43,7 @@ #include linux/udp.h #include linux/inet.h #include linux/netfilter_ipv4.h +#include linux/reserve.h /*
[PATCH 07/33] mm: serialize access to min_free_kbytes
There is a small race between the procfs caller and the memory hotplug caller of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet another caller. Time to close the gap. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO Movable, }; +static DEFINE_SPINLOCK(min_free_lock); int min_free_kbytes = 1024; unsigned long __meminitdata nr_kernel_pages; @@ -4162,12 +4163,12 @@ static void setup_per_zone_lowmem_reserv } /** - * setup_per_zone_pages_min - called when min_free_kbytes changes. + * __setup_per_zone_pages_min - called when min_free_kbytes changes. * * Ensures that the pages_{min,low,high} values for each zone are set correctly * with respect to min_free_kbytes. */ -void setup_per_zone_pages_min(void) +static void __setup_per_zone_pages_min(void) { unsigned long pages_min = min_free_kbytes (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; @@ -4222,6 +4223,15 @@ void setup_per_zone_pages_min(void) calculate_totalreserve_pages(); } +void setup_per_zone_pages_min(void) +{ + unsigned long flags; + + spin_lock_irqsave(min_free_lock, flags); + __setup_per_zone_pages_min(); + spin_unlock_irqrestore(min_free_lock, flags); +} + /* * Initialise min_free_kbytes. * @@ -4257,7 +4267,7 @@ static int __init init_per_zone_pages_mi min_free_kbytes = 128; if (min_free_kbytes 65536) min_free_kbytes = 65536; - setup_per_zone_pages_min(); + __setup_per_zone_pages_min(); setup_per_zone_lowmem_reserve(); return 0; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/33] mm: memory reserve management
Generic reserve management code. It provides methods to reserve and charge. Upon this, generic alloc/free style reserve pools could be build, which could fully replace mempool_t functionality. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/reserve.h | 54 + mm/Makefile |2 mm/reserve.c| 436 3 files changed, 491 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/reserve.h === --- /dev/null +++ linux-2.6/include/linux/reserve.h @@ -0,0 +1,54 @@ +/* + * Memory reserve management. + * + * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED] + * + * This file contains the public data structure and API definitions. + */ + +#ifndef _LINUX_RESERVE_H +#define _LINUX_RESERVE_H + +#include linux/list.h +#include linux/spinlock.h + +struct mem_reserve { + struct mem_reserve *parent; + struct list_head children; + struct list_head siblings; + + const char *name; + + long pages; + long limit; + long usage; + spinlock_t lock;/* protects limit and usage */ +}; + +extern struct mem_reserve mem_reserve_root; + +void mem_reserve_init(struct mem_reserve *res, const char *name, + struct mem_reserve *parent); +int mem_reserve_connect(struct mem_reserve *new_child, + struct mem_reserve *node); +int mem_reserve_disconnect(struct mem_reserve *node); + +int mem_reserve_pages_set(struct mem_reserve *res, long pages); +int mem_reserve_pages_add(struct mem_reserve *res, long pages); +int mem_reserve_pages_charge(struct mem_reserve *res, long pages, +int overcommit); + +int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes); +int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes, + int overcommit); + +struct kmem_cache; + +int mem_reserve_kmem_cache_set(struct mem_reserve *res, + struct kmem_cache *s, + int objects); +int mem_reserve_kmem_cache_charge(struct mem_reserve *res, + long objs, + int overcommit); + +#endif /* _LINUX_RESERVE_H */ Index: linux-2.6/mm/Makefile === --- linux-2.6.orig/mm/Makefile +++ linux-2.6/mm/Makefile @@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o page_alloc.o page-writeback.o pdflush.o \ readahead.o swap.o truncate.o vmscan.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ - page_isolation.o $(mmu-y) + page_isolation.o reserve.o $(mmu-y) obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o Index: linux-2.6/mm/reserve.c === --- /dev/null +++ linux-2.6/mm/reserve.c @@ -0,0 +1,436 @@ +/* + * Memory reserve management. + * + * Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED] + * + * Description: + * + * Manage a set of memory reserves. + * + * A memory reserve is a reserve for a specified number of object of specified + * size. Since memory is managed in pages, this reserve demand is then + * translated into a page unit. + * + * So each reserve has a specified object limit, an object usage count and a + * number of pages required to back these objects. + * + * Usage is charged against a reserve, if the charge fails, the resource must + * not be allocated/used. + * + * The reserves are managed in a tree, and the resource demands (pages and + * limit) are propagated up the tree. Obviously the object limit will be + * meaningless as soon as the unit starts mixing, but the required page reserve + * (being of one unit) is still valid at the root. + * + * It is the page demand of the root node that is used to set the global + * reserve (adjust_memalloc_reserve() which sets zone-pages_emerg). + * + * As long as a subtree has the same usage unit, an aggregate node can be used + * to charge against, instead of the leaf nodes. However, do be consistent with + * who is charged, resource usage is not propagated up the tree (for + * performance reasons). + */ + +#include linux/reserve.h +#include linux/mutex.h +#include linux/mmzone.h +#include linux/log2.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include linux/module.h +#include linux/slab.h + +static DEFINE_MUTEX(mem_reserve_mutex); + +/** + * @mem_reserve_root - the global reserve root + * + * The global reserve is empty, and has no limit unit, it merely + * acts as an aggregation point for reserves and an interface to + * adjust_memalloc_reserve(). + */ +struct mem_reserve mem_reserve_root = { +
[PATCH 21/33] netvm: prevent a TCP specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers from receiving data, which will prevent userspace from running, which is needed to reduce the buffered data. Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |7 --- net/core/stream.c |5 +++-- 2 files changed, 7 insertions(+), 5 deletions(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -743,7 +743,8 @@ static inline struct inode *SOCK_INODE(s } extern void __sk_stream_mem_reclaim(struct sock *sk); -extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); +extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, + int size, int kind); #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE) @@ -761,13 +762,13 @@ static inline void sk_stream_mem_reclaim static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb) { return (int)skb-truesize = sk-sk_forward_alloc || - sk_stream_mem_schedule(sk, skb-truesize, 1); + sk_stream_mem_schedule(sk, skb, skb-truesize, 1); } static inline int sk_stream_wmem_schedule(struct sock *sk, int size) { return size = sk-sk_forward_alloc || - sk_stream_mem_schedule(sk, size, 0); + sk_stream_mem_schedule(sk, NULL, size, 0); } /* Used by processes to lock a socket state, so that Index: linux-2.6/net/core/stream.c === --- linux-2.6.orig/net/core/stream.c +++ linux-2.6/net/core/stream.c @@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock EXPORT_SYMBOL(__sk_stream_mem_reclaim); -int sk_stream_mem_schedule(struct sock *sk, int size, int kind) +int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind) { int amt = sk_stream_pages(size); @@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock * /* Over hard limit. */ if (atomic_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[2]) { sk-sk_prot-enter_memory_pressure(); - goto suppress_allocation; + if (!skb || (skb !skb_emergency(skb))) + goto suppress_allocation; } /* Under pressure. */ -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/33] net: packet split receive api
Add some packet-split receive hooks. For one this allows to do NUMA node affine page allocs. Later on these hooks will be extended to do emergency reserve allocations for fragments. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- drivers/net/e1000/e1000_main.c |8 ++-- drivers/net/sky2.c | 16 ++-- include/linux/skbuff.h | 23 +++ net/core/skbuff.c | 20 4 files changed, 51 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/net/e1000/e1000_main.c === --- linux-2.6.orig/drivers/net/e1000/e1000_main.c +++ linux-2.6/drivers/net/e1000/e1000_main.c @@ -4407,12 +4407,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt pci_unmap_page(pdev, ps_page_dma-ps_page_dma[j], PAGE_SIZE, PCI_DMA_FROMDEVICE); ps_page_dma-ps_page_dma[j] = 0; - skb_fill_page_desc(skb, j, ps_page-ps_page[j], 0, - length); + skb_add_rx_frag(skb, j, ps_page-ps_page[j], 0, length); ps_page-ps_page[j] = NULL; - skb-len += length; - skb-data_len += length; - skb-truesize += length; } /* strip the ethernet crc, problem is we're using pages now so @@ -4618,7 +4614,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a if (j adapter-rx_ps_pages) { if (likely(!ps_page-ps_page[j])) { ps_page-ps_page[j] = - alloc_page(GFP_ATOMIC); + netdev_alloc_page(netdev); if (unlikely(!ps_page-ps_page[j])) { adapter-alloc_rx_buff_failed++; goto no_buffers; Index: linux-2.6/include/linux/skbuff.h === --- linux-2.6.orig/include/linux/skbuff.h +++ linux-2.6/include/linux/skbuff.h @@ -846,6 +846,9 @@ static inline void skb_fill_page_desc(st skb_shinfo(skb)-nr_frags = i + 1; } +extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, + int off, int size); + #define SKB_PAGE_ASSERT(skb) BUG_ON(skb_shinfo(skb)-nr_frags) #define SKB_FRAG_ASSERT(skb) BUG_ON(skb_shinfo(skb)-frag_list) #define SKB_LINEAR_ASSERT(skb) BUG_ON(skb_is_nonlinear(skb)) @@ -1339,6 +1342,26 @@ static inline struct sk_buff *netdev_all return __netdev_alloc_skb(dev, length, GFP_ATOMIC); } +extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); + +/** + * netdev_alloc_page - allocate a page for ps-rx on a specific device + * @dev: network device to receive on + * + * Allocate a new page node local to the specified device. + * + * %NULL is returned if there is no free memory. + */ +static inline struct page *netdev_alloc_page(struct net_device *dev) +{ + return __netdev_alloc_page(dev, GFP_ATOMIC); +} + +static inline void netdev_free_page(struct net_device *dev, struct page *page) +{ + __free_page(page); +} + /** * skb_clone_writable - is the header of a clone writable * @skb: buffer to check Index: linux-2.6/net/core/skbuff.c === --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struc return skb; } +struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) +{ + int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; + struct page *page; + + page = alloc_pages_node(node, gfp_mask, 0); + return page; +} + +void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, + int size) +{ + skb_fill_page_desc(skb, i, page, off, size); + skb-len += size; + skb-data_len += size; + skb-truesize += size; +} + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; @@ -2464,6 +2482,8 @@ EXPORT_SYMBOL(kfree_skb); EXPORT_SYMBOL(__pskb_pull_tail); EXPORT_SYMBOL(__alloc_skb); EXPORT_SYMBOL(__netdev_alloc_skb); +EXPORT_SYMBOL(__netdev_alloc_page); +EXPORT_SYMBOL(skb_add_rx_frag); EXPORT_SYMBOL(pskb_copy); EXPORT_SYMBOL(pskb_expand_head); EXPORT_SYMBOL(skb_checksum); Index: linux-2.6/drivers/net/sky2.c === --- linux-2.6.orig/drivers/net/sky2.c +++ linux-2.6/drivers/net/sky2.c @@ -1173,7 +1173,7 @@ static struct sk_buff *sky2_rx_alloc(str skb_reserve(skb, ALIGN(p, RX_SKB_ALIGN) - p); for
[PATCH 00/33] Swap over NFS -v14
Hi, Another posting of the full swap over NFS series. [ I tried just posting the first part last time around, but that just gets more confusion by lack of a general picture ] [ patches against 2.6.23-mm1, also to be found online at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.23-mm1/ ] The patch-set can be split in roughtly 5 parts, for each of which I shall give a description. Part 1, patches 1-12 The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer. Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable. This first part provides a generic reserve framework. Care is taken to only affect the slow paths - when we're low on memory. Caveats: it is currently SLUB only. 1 - mm: gfp_to_alloc_flags() 2 - mm: tag reseve pages 3 - mm: slub: add knowledge of reserve pages 4 - mm: allow mempool to fall back to memalloc reserves 5 - mm: kmem_estimate_pages() 6 - mm: allow PF_MEMALLOC from softirq context 7 - mm: serialize access to min_free_kbytes 8 - mm: emergency pool 9 - mm: system wide ALLOC_NO_WATERMARK 10 - mm: __GFP_MEMALLOC 11 - mm: memory reserve management 12 - selinux: tag avc cache alloc as non-critical Part 2, patches 13-15 Provide some generic network infrastructure needed later on. 13 - net: wrap sk-sk_backlog_rcv() 14 - net: packet split receive api 15 - net: sk_allocation() - concentrate socket related allocations Part 3, patches 16-23 Now that we have a generic memory reserve system, use it on the network stack. The thing that makes this interesting is that, contrary to BIO, both the transmit and receive path require memory allocations. That is, in the BIO layer write back completion is usually just an ISR flipping a bit and waking stuff up. A network write back completion involved receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us. The solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack. So we reserve a little pool to act as a receive buffer, this allows us to inspect packets before tossing them. This way, we can filter out those packets that ensure progress (writeback completion) and disregard the others (as would have happened anyway). [ NOTE: this is a stable mode of operation with limited memory usage, exactly the kind of thing we need ] Again, care is taken to keep much of the overhead of this to only affect the slow path. Only packets allocated from the reserves will suffer the extra atomic overhead needed for accounting. 16 - netvm: network reserve infrastructure 17 - sysctl: propagate conv errors 18 - netvm: INET reserves. 19 - netvm: hook skb allocation to reserves 20 - netvm: filter emergency skbs. 21 - netvm: prevent a TCP specific deadlock 22 - netfilter: NF_QUEUE vs emergency skbs 23 - netvm: skb processing Part 4, patches 24-26 Generic vm infrastructure to handle swapping to a filesystem instead of a block device. The approach here has been questioned, people would like to see a less invasive approach. One suggestion is to create and use a_ops-swap_{in,out}(). 24 - mm: prepare swap entry methods for use in page methods 25 - mm: add support for non block device backed swap files 26 - mm: methods for teaching filesystems about PG_swapcache pages Part 5, patches 27-33 Finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS. 27 - nfs: remove mempools 28 - nfs: teach the NFS client how to treat PG_swapcache pages 29 - nfs: disable data cache revalidation for swapfiles 30 - nfs: swap vs nfs_writepage 31 - nfs: enable swap on NFS 32 - nfs: fix various memory recursions possible with swap over NFS. 33 - nfs: do not warn on radix tree node allocation failures - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 29/33] nfs: disable data cache revalidation for swapfiles
Do as Trond suggested: http://lkml.org/lkml/2006/8/25/348 Disable NFS data cache revalidation on swap files since it doesn't really make sense to have other clients change the file while you are using it. Thereby we can stop setting PG_private on swap pages, since there ought to be no further races with invalidate_inode_pages2() to deal with. And since we cannot set PG_private we cannot use page-private (which is already used by PG_swapcache pages anyway) to store the nfs_page. Thus augment the new nfs_page_find_request logic. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/inode.c |6 fs/nfs/write.c | 73 ++--- 2 files changed, 65 insertions(+), 14 deletions(-) Index: linux-2.6/fs/nfs/inode.c === --- linux-2.6.orig/fs/nfs/inode.c +++ linux-2.6/fs/nfs/inode.c @@ -744,6 +744,12 @@ int nfs_revalidate_mapping_nolock(struct struct nfs_inode *nfsi = NFS_I(inode); int ret = 0; + /* +* swapfiles are not supposed to be shared. +*/ + if (IS_SWAPFILE(inode)) + goto out; + if ((nfsi-cache_validity NFS_INO_REVAL_PAGECACHE) || nfs_attribute_timeout(inode) || NFS_STALE(inode)) { ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -112,25 +112,62 @@ static void nfs_context_set_write_error( set_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags); } -static struct nfs_page *nfs_page_find_request_locked(struct page *page) +static struct nfs_page * +__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get) { struct nfs_page *req = NULL; - if (PagePrivate(page)) { + if (PagePrivate(page)) req = (struct nfs_page *)page_private(page); - if (req != NULL) - kref_get(req-wb_kref); - } + else if (unlikely(PageSwapCache(page))) + req = radix_tree_lookup(nfsi-nfs_page_tree, page_file_index(page)); + + if (get req) + kref_get(req-wb_kref); + return req; } +static inline struct nfs_page * +nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page) +{ + return __nfs_page_find_request_locked(nfsi, page, 1); +} + +static int __nfs_page_has_request(struct page *page) +{ + struct inode *inode = page_file_mapping(page)-host; + struct nfs_page *req = NULL; + + spin_lock(inode-i_lock); + req = __nfs_page_find_request_locked(NFS_I(inode), page, 0); + spin_unlock(inode-i_lock); + + /* +* hole here plugged by the caller holding onto PG_locked +*/ + + return req != NULL; +} + +static inline int nfs_page_has_request(struct page *page) +{ + if (PagePrivate(page)) + return 1; + + if (unlikely(PageSwapCache(page))) + return __nfs_page_has_request(page); + + return 0; +} + static struct nfs_page *nfs_page_find_request(struct page *page) { struct inode *inode = page_file_mapping(page)-host; struct nfs_page *req = NULL; spin_lock(inode-i_lock); - req = nfs_page_find_request_locked(page); + req = nfs_page_find_request_locked(NFS_I(inode), page); spin_unlock(inode-i_lock); return req; } @@ -255,7 +292,7 @@ static int nfs_page_async_flush(struct n spin_lock(inode-i_lock); for(;;) { - req = nfs_page_find_request_locked(page); + req = nfs_page_find_request_locked(nfsi, page); if (req == NULL) { spin_unlock(inode-i_lock); return 0; @@ -374,8 +411,14 @@ static int nfs_inode_add_request(struct if (nfs_have_delegation(inode, FMODE_WRITE)) nfsi-change_attr++; } - SetPagePrivate(req-wb_page); - set_page_private(req-wb_page, (unsigned long)req); + /* +* Swap-space should not get truncated. Hence no need to plug the race +* with invalidate/truncate. +*/ + if (likely(!PageSwapCache(req-wb_page))) { + SetPagePrivate(req-wb_page); + set_page_private(req-wb_page, (unsigned long)req); + } nfsi-npages++; kref_get(req-wb_kref); return 0; @@ -392,8 +435,10 @@ static void nfs_inode_remove_request(str BUG_ON (!NFS_WBACK_BUSY(req)); spin_lock(inode-i_lock); - set_page_private(req-wb_page, 0); - ClearPagePrivate(req-wb_page); + if (likely(!PageSwapCache(req-wb_page))) { + set_page_private(req-wb_page, 0); + ClearPagePrivate(req-wb_page); + } radix_tree_delete(nfsi-nfs_page_tree,
[PATCH 27/33] nfs: remove mempools
With the introduction of the shared dirty page accounting in .19, NFS should not be able to surpise the VM with all dirty pages. Thus it should always be able to free some memory. Hence no more need for mempools. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/read.c | 15 +++ fs/nfs/write.c | 27 +-- 2 files changed, 8 insertions(+), 34 deletions(-) Index: linux-2.6/fs/nfs/read.c === --- linux-2.6.orig/fs/nfs/read.c +++ linux-2.6/fs/nfs/read.c @@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea static const struct rpc_call_ops nfs_read_full_ops; static struct kmem_cache *nfs_rdata_cachep; -static mempool_t *nfs_rdata_mempool; - -#define MIN_POOL_READ (32) struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount) { - struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS); + struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc else { p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); if (!p-pagevec) { - mempool_free(p, nfs_rdata_mempool); + kmem_cache_free(nfs_rdata_cachep, p); p = NULL; } } @@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_rdata_mempool); + kmem_cache_free(nfs_rdata_cachep, p); } static void nfs_readdata_free(struct nfs_read_data *rdata) @@ -597,16 +594,10 @@ int __init nfs_init_readpagecache(void) if (nfs_rdata_cachep == NULL) return -ENOMEM; - nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ, -nfs_rdata_cachep); - if (nfs_rdata_mempool == NULL) - return -ENOMEM; - return 0; } void nfs_destroy_readpagecache(void) { - mempool_destroy(nfs_rdata_mempool); kmem_cache_destroy(nfs_rdata_cachep); } Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -28,9 +28,6 @@ #define NFSDBG_FACILITYNFSDBG_PAGECACHE -#define MIN_POOL_WRITE (32) -#define MIN_POOL_COMMIT(4) - /* * Local function declarations */ @@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri static const struct rpc_call_ops nfs_commit_ops; static struct kmem_cache *nfs_wdata_cachep; -static mempool_t *nfs_wdata_mempool; -static mempool_t *nfs_commit_mempool; struct nfs_write_data *nfs_commit_alloc(void) { - struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_commit_mempool); + kmem_cache_free(nfs_wdata_cachep, p); } void nfs_commit_free(struct nfs_write_data *wdata) @@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount) { - struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all else { p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); if (!p-pagevec) { - mempool_free(p, nfs_wdata_mempool); + kmem_cache_free(nfs_wdata_cachep, p); p = NULL; } } @@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_wdata_mempool); + kmem_cache_free(nfs_wdata_cachep, p); } static void nfs_writedata_free(struct nfs_write_data *wdata) @@ -1474,16 +1469,6 @@ int __init nfs_init_writepagecache(void) if (nfs_wdata_cachep == NULL) return -ENOMEM; -
[PATCH 31/33] nfs: enable swap on NFS
Provide an a_ops-swapfile() implementation for NFS. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol -connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. (swapping continues over a server reset during heavy network traffic) Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/Kconfig | 18 fs/nfs/file.c | 10 ++ include/linux/sunrpc/xprt.h |5 ++- net/sunrpc/sched.c |9 -- net/sunrpc/xprtsock.c | 63 5 files changed, 102 insertions(+), 3 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -371,6 +371,13 @@ static int nfs_launder_page(struct page return nfs_wb_page(page_file_mapping(page)-host, page); } +#ifdef CONFIG_NFS_SWAP +static int nfs_swapfile(struct address_space *mapping, int enable) +{ + return xs_swapper(NFS_CLIENT(mapping-host)-cl_xprt, enable); +} +#endif + const struct address_space_operations nfs_file_aops = { .readpage = nfs_readpage, .readpages = nfs_readpages, @@ -385,6 +392,9 @@ const struct address_space_operations nf .direct_IO = nfs_direct_IO, #endif .launder_page = nfs_launder_page, +#ifdef CONFIG_NFS_SWAP + .swapfile = nfs_swapfile, +#endif }; static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page) Index: linux-2.6/include/linux/sunrpc/xprt.h === --- linux-2.6.orig/include/linux/sunrpc/xprt.h +++ linux-2.6/include/linux/sunrpc/xprt.h @@ -143,7 +143,9 @@ struct rpc_xprt { unsigned intmax_reqs; /* total slots */ unsigned long state; /* transport state */ unsigned char shutdown : 1, /* being shut down */ - resvport : 1; /* use a reserved port */ + resvport : 1, /* use a reserved port */ + swapper: 1; /* we're swapping over this + transport */ unsigned intbind_index; /* bind function index */ /* @@ -246,6 +248,7 @@ struct rpc_rqst * xprt_lookup_rqst(struc void xprt_complete_rqst(struct rpc_task *task, int copied); void xprt_release_rqst_cong(struct rpc_task *task); void xprt_disconnect(struct rpc_xprt *xprt); +intxs_swapper(struct rpc_xprt *xprt, int enable); /* * Reserved bit positions in xprt-state Index: linux-2.6/net/sunrpc/sched.c === --- linux-2.6.orig/net/sunrpc/sched.c +++ linux-2.6/net/sunrpc/sched.c @@ -761,7 +761,10 @@ struct rpc_buffer { void *rpc_malloc(struct rpc_task *task, size_t size) { struct rpc_buffer *buf; - gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT; + gfp_t gfp = GFP_NOWAIT; + + if (RPC_IS_SWAPPER(task)) + gfp |= __GFP_MEMALLOC; size += sizeof(struct rpc_buffer); if (size = RPC_BUFFER_MAXSIZE) @@ -817,6 +820,8 @@ void rpc_init_task(struct rpc_task *task atomic_set(task-tk_count, 1); task-tk_client = clnt; task-tk_flags = flags; + if (clnt-cl_xprt-swapper) + task-tk_flags |= RPC_TASK_SWAPPER; task-tk_ops = tk_ops; if (tk_ops-rpc_call_prepare != NULL) task-tk_action = rpc_prepare_task; @@ -853,7 +858,7 @@ void rpc_init_task(struct rpc_task *task static struct rpc_task * rpc_alloc_task(void) { - return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS); + return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO); } static void rpc_free_task(struct rcu_head *rcu) Index: linux-2.6/net/sunrpc/xprtsock.c === --- linux-2.6.orig/net/sunrpc/xprtsock.c +++ linux-2.6/net/sunrpc/xprtsock.c @@ -1397,6 +1397,9 @@ static void xs_udp_finish_connecting(str transport-sock = sock; transport-inet = sk; + if (xprt-swapper) + sk_set_memalloc(sk); + write_unlock_bh(sk-sk_callback_lock); } xs_udp_do_set_buffer_size(xprt); @@ -1414,11 +1417,15 @@ static void xs_udp_connect_worker4(struc container_of(work, struct sock_xprt, connect_worker.work); struct rpc_xprt *xprt = transport-xprt; struct socket *sock = transport-sock; + unsigned long pflags =
[PATCH 33/33] nfs: do not warn on radix tree node allocation failures
GFP_ATOMIC failures are rather common, no not warn about them. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/inode.c |2 +- fs/nfs/write.c | 10 ++ 2 files changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6/fs/nfs/inode.c === --- linux-2.6.orig/fs/nfs/inode.c +++ linux-2.6/fs/nfs/inode.c @@ -1172,7 +1172,7 @@ static void init_once(struct kmem_cache INIT_LIST_HEAD(nfsi-open_files); INIT_LIST_HEAD(nfsi-access_cache_entry_lru); INIT_LIST_HEAD(nfsi-access_cache_inode_lru); - INIT_RADIX_TREE(nfsi-nfs_page_tree, GFP_ATOMIC); + INIT_RADIX_TREE(nfsi-nfs_page_tree, GFP_ATOMIC|__GFP_NOWARN); nfsi-ncommit = 0; nfsi-npages = 0; nfs4_init_once(nfsi); Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -652,6 +652,7 @@ static struct nfs_page * nfs_update_requ struct inode *inode = mapping-host; struct nfs_page *req, *new = NULL; pgoff_t rqend, end; + int error; end = offset + bytes; @@ -659,6 +660,10 @@ static struct nfs_page * nfs_update_requ /* Loop over all inode entries and see if we find * A request for the page we wish to update */ + error = radix_tree_preload(GFP_NOIO); + if (error) + return ERR_PTR(error); + spin_lock(inode-i_lock); req = nfs_page_find_request_locked(NFS_I(inode), page); if (req) { @@ -666,6 +671,7 @@ static struct nfs_page * nfs_update_requ int error; spin_unlock(inode-i_lock); + radix_tree_preload_end(); error = nfs_wait_on_request(req); nfs_release_request(req); if (error 0) { @@ -676,6 +682,7 @@ static struct nfs_page * nfs_update_requ continue; } spin_unlock(inode-i_lock); + radix_tree_preload_end(); if (new) nfs_release_request(new); break; @@ -687,13 +694,16 @@ static struct nfs_page * nfs_update_requ error = nfs_inode_add_request(inode, new); if (error) { spin_unlock(inode-i_lock); + radix_tree_preload_end(); nfs_unlock_request(new); return ERR_PTR(error); } spin_unlock(inode-i_lock); + radix_tree_preload_end(); return new; } spin_unlock(inode-i_lock); + radix_tree_preload_end(); new = nfs_create_request(ctx, inode, page, offset, bytes); if (IS_ERR(new)) -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/sched.h |4 kernel/softirq.c |3 +++ mm/page_alloc.c |7 --- 3 files changed, 11 insertions(+), 3 deletions(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1557,9 +1557,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_HARDER; if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { - if (!in_interrupt() - ((p-flags PF_MEMALLOC) || -unlikely(test_thread_flag(TIF_MEMDIE + if (!in_irq() (p-flags PF_MEMALLOC)) + alloc_flags |= ALLOC_NO_WATERMARKS; + else if (!in_interrupt() + unlikely(test_thread_flag(TIF_MEMDIE))) alloc_flags |= ALLOC_NO_WATERMARKS; } Index: linux-2.6/kernel/softirq.c === --- linux-2.6.orig/kernel/softirq.c +++ linux-2.6/kernel/softirq.c @@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void) __u32 pending; int max_restart = MAX_SOFTIRQ_RESTART; int cpu; + unsigned long pflags = current-flags; + current-flags = ~PF_MEMALLOC; pending = local_softirq_pending(); account_system_vtime(current); @@ -249,6 +251,7 @@ restart: account_system_vtime(current); _local_bh_enable(); + tsk_restore_flags(current, pflags, PF_MEMALLOC); } #ifndef __ARCH_HAS_DO_SOFTIRQ Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1389,6 +1389,10 @@ static inline void put_task_struct(struc #define tsk_used_math(p) ((p)-flags PF_USED_MATH) #define used_math() tsk_used_math(current) +#define tsk_restore_flags(p, pflags, mask) \ + do {(p)-flags = ~(mask); \ + (p)-flags |= ((pflags) (mask)); } while (0) + #ifdef CONFIG_SMP extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask); #else -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/33] selinux: tag avc cache alloc as non-critical
Failing to allocate a cache entry will only harm performance not correctness. Do not consume valuable reserve pages for something like that. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] Acked-by: James Morris [EMAIL PROTECTED] --- security/selinux/avc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-2/security/selinux/avc.c === --- linux-2.6-2.orig/security/selinux/avc.c +++ linux-2.6-2/security/selinux/avc.c @@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v { struct avc_node *node; - node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC); + node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC); if (!node) goto out; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
Allow the mempool to use the memalloc reserves when all else fails and the allocation context would otherwise allow it. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/mempool.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c +++ linux-2.6/mm/mempool.c @@ -14,6 +14,7 @@ #include linux/mempool.h #include linux/blkdev.h #include linux/writeback.h +#include internal.h static void add_element(mempool_t *pool, void *element) { @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf void *element; unsigned long flags; wait_queue_t wait; - gfp_t gfp_temp; + gfp_t gfp_temp, gfp_orig = gfp_mask; might_sleep_if(gfp_mask __GFP_WAIT); @@ -228,6 +229,15 @@ repeat_alloc: } spin_unlock_irqrestore(pool-lock, flags); + /* if we really had right to the emergency reserves try those */ + if (gfp_to_alloc_flags(gfp_orig) ALLOC_NO_WATERMARKS) { + if (gfp_temp __GFP_NOMEMALLOC) { + gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN); + goto repeat_alloc; + } else + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN; + } + /* We must not sleep in the GFP_ATOMIC case */ if (!(gfp_mask __GFP_WAIT)) return NULL; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
Change ALLOC_NO_WATERMARK page allocation such that the reserves are system wide - which they are per setup_per_zone_pages_min(), when we scrape the barrel, do it properly. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1638,6 +1638,12 @@ restart: rebalance: if (alloc_flags ALLOC_NO_WATERMARKS) { nofail_alloc: + /* +* break out of mempolicy boundaries +*/ + zonelist = NODE_DATA(numa_node_id())-node_zonelists + + gfp_zone(gfp_mask); + /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, zonelist, ALLOC_NO_WATERMARKS); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24] ixgb: TX hangs under heavy load
Andy Gospodarek wrote: Auke, It has become clear that this patch resolves some tx-lockups on the ixgb driver. IBM did some checking and realized this hunk is in your sourceforge driver, but not anywhere else. Mind if we add it? I'll quickly double check where this came from in the first place and will post this to Jeff Thanks! Auke Signed-off-by: Andy Gospodarek [EMAIL PROTECTED] --- ixgb_main.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index d444de5..3ec7a41 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -1324,7 +1324,7 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff *skb, /* Workaround for premature desc write-backs * in TSO mode. Append 4-byte sentinel desc */ - if (unlikely(mss !nr_frags size == len + if (unlikely(mss (f == (nr_frags-1)) size == len size 8)) size -= 4; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/1][CORE] resend - fix free_netdev on register_netdev failure
Point 1: The unregistering of a network device schedule a netdev_run_todo. This function calls dev-destructor when it is set and the destructor calls free_netdev. Point 2: In the case of an initialization of a network device the usual code is: * alloc_netdev * register_netdev - if this one fails, call free_netdev and exit with error. Point 3: In the register_netdevice function at the later state, when the device is at the registered state, a call to the netdevice_notifiers is made. If one of the notification falls into an error, a rollback to the registered state is done using unregister_netdevice. Conclusion: When a network device fails to register during initialization because one network subsystem returned an error during a notification call chain, the network device is freed twice because of fact 1 and fact 2. The second free_netdev will be done with an invalid pointer. Proposed solution: The following patch move all the code of unregister_netdevice *except* the call to net_set_todo, to a new function rollback_registered. The following functions are changed in this way: * register_netdevice: calls rollback_registered when a notification fails * unregister_netdevice: calls rollback_register + net_set_todo, the call order to net_set_todo is changed because it is the latest now. Since it justs add an element to a list that should not break anything. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/core/dev.c | 112 ++--- 1 file changed, 59 insertions(+), 53 deletions(-) Index: net-2.6/net/core/dev.c === --- net-2.6.orig/net/core/dev.c +++ net-2.6/net/core/dev.c @@ -3496,6 +3496,60 @@ static void net_set_todo(struct net_devi spin_unlock(net_todo_list_lock); } +static void rollback_registered(struct net_device *dev) +{ + BUG_ON(dev_boot_phase); + ASSERT_RTNL(); + + /* Some devices call without registering for initialization unwind. */ + if (dev-reg_state == NETREG_UNINITIALIZED) { + printk(KERN_DEBUG unregister_netdevice: device %s/%p never + was registered\n, dev-name, dev); + + WARN_ON(1); + return; + } + + BUG_ON(dev-reg_state != NETREG_REGISTERED); + + /* If device is running, close it first. */ + dev_close(dev); + + /* And unlink it from device chain. */ + unlist_netdevice(dev); + + dev-reg_state = NETREG_UNREGISTERING; + + synchronize_net(); + + /* Shutdown queueing discipline. */ + dev_shutdown(dev); + + + /* Notify protocols, that we are about to destroy + this device. They should clean all the things. + */ + call_netdevice_notifiers(NETDEV_UNREGISTER, dev); + + /* +* Flush the unicast and multicast chains +*/ + dev_addr_discard(dev); + + if (dev-uninit) + dev-uninit(dev); + + /* Notifier chain MUST detach us from master device. */ + BUG_TRAP(!dev-master); + + /* Remove entries from kobject tree */ + netdev_unregister_kobject(dev); + + synchronize_net(); + + dev_put(dev); +} + /** * register_netdevice - register a network device * @dev: device to register @@ -3633,8 +3687,10 @@ int register_netdevice(struct net_device /* Notify protocols, that a new device appeared. */ ret = call_netdevice_notifiers(NETDEV_REGISTER, dev); ret = notifier_to_errno(ret); - if (ret) - unregister_netdevice(dev); + if (ret) { + rollback_registered(dev); + dev-reg_state = NETREG_UNREGISTERED; + } out: return ret; @@ -3911,59 +3967,9 @@ void synchronize_net(void) void unregister_netdevice(struct net_device *dev) { - BUG_ON(dev_boot_phase); - ASSERT_RTNL(); - - /* Some devices call without registering for initialization unwind. */ - if (dev-reg_state == NETREG_UNINITIALIZED) { - printk(KERN_DEBUG unregister_netdevice: device %s/%p never - was registered\n, dev-name, dev); - - WARN_ON(1); - return; - } - - BUG_ON(dev-reg_state != NETREG_REGISTERED); - - /* If device is running, close it first. */ - dev_close(dev); - - /* And unlink it from device chain. */ - unlist_netdevice(dev); - - dev-reg_state = NETREG_UNREGISTERING; - - synchronize_net(); - - /* Shutdown queueing discipline. */ - dev_shutdown(dev); - - - /* Notify protocols, that we are about to destroy - this device. They should clean all the things. - */ - call_netdevice_notifiers(NETDEV_UNREGISTER, dev); - - /* -* Flush the unicast and multicast chains -
[patch 1/1][NETNS] resend: fix net released by rcu callback
When a network namespace reference is held by a network subsystem, and when this reference is decremented in a rcu update callback, we must ensure that there is no more outstanding rcu update before trying to free the network namespace. In the normal case, the rcu_barrier is called when the network namespace is exiting in the cleanup_net function. But when a network namespace creation fails, and the subsystems are undone (like the cleanup), the rcu_barrier is missing. This patch adds the missing rcu_barrier. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/core/net_namespace.c |2 ++ 1 file changed, 2 insertions(+) Index: net-2.6/net/core/net_namespace.c === --- net-2.6.orig/net/core/net_namespace.c +++ net-2.6/net/core/net_namespace.c @@ -112,6 +112,8 @@ out_undo: if (ops-exit) ops-exit(net); } + + rcu_barrier(); goto out; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/1][IPV6] resend: remove duplicate call to proc_net_remove
The file /proc/net/if_inet6 is removed twice. First time in: inet6_exit -addrconf_cleanup And followed a few lines after by: inet6_exit - if6_proc_exit Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- net/ipv6/addrconf.c |4 1 file changed, 4 deletions(-) Index: net-2.6/net/ipv6/addrconf.c === --- net-2.6.orig/net/ipv6/addrconf.c +++ net-2.6/net/ipv6/addrconf.c @@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void) del_timer(addr_chk_timer); rtnl_unlock(); - -#ifdef CONFIG_PROC_FS - proc_net_remove(init_net, if_inet6); -#endif } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] NFS: change the ip_map cache code to handle IPv6 addresses
Hi, Here is the IPv6 support patch for the ip_map caching code part in nfs server. I have ported it on 2.6.24-rc1 (in which Brian Haley's ipv6_addr_v4mapped function is included) In case of bad formatting due to my mailer, you can also find the patch in attachment. Tests: tested with only IPv4 network and basic nfs ops (mount, file creation and modification) Signed-off-by: Aurelien Charbon [EMAIL PROTECTED] --- diff -p -u -r -N linux-2.6.24-rc1/fs/nfsd/export.c linux-2.6.24-rc1-ipmap/fs/nfsd/export.c --- linux-2.6.24-rc1/fs/nfsd/export.c2007-10-30 12:47:21.0 +0100 +++ linux-2.6.24-rc1-ipmap/fs/nfsd/export.c2007-10-30 17:18:21.0 +0100 @@ -35,6 +35,7 @@ #include linux/lockd/bind.h #include linux/sunrpc/msg_prot.h #include linux/sunrpc/gss_api.h +#include net/ipv6.h #define NFSDDBG_FACILITYNFSDDBG_EXPORT @@ -1556,6 +1557,7 @@ exp_addclient(struct nfsctl_client *ncp) { struct auth_domain*dom; inti, err; +struct in6_addr addr6; /* First, consistency check. */ err = -EINVAL; @@ -1574,9 +1576,12 @@ exp_addclient(struct nfsctl_client *ncp) goto out_unlock; /* Insert client into hashtable. */ -for (i = 0; i ncp-cl_naddr; i++) -auth_unix_add_addr(ncp-cl_addrlist[i], dom); - +for (i = 0; i ncp-cl_naddr; i++) { +/* Mapping address */ +ipv6_addr_set(addr6, 0, 0, +htonl(0x), ncp-cl_addrlist[i].s_addr); +auth_unix_add_addr(addr6, dom); +} auth_unix_forget_old(dom); auth_domain_put(dom); diff -p -u -r -N linux-2.6.24-rc1/fs/nfsd/nfsctl.c linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c --- linux-2.6.24-rc1/fs/nfsd/nfsctl.c2007-10-30 12:47:21.0 +0100 +++ linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c2007-10-30 17:15:45.0 +0100 @@ -37,6 +37,7 @@ #include linux/nfsd/syscall.h #include asm/uaccess.h +#include net/ipv6.h /* *We have a single directory with 9 nodes in it. @@ -222,6 +223,7 @@ static ssize_t write_getfs(struct file * struct auth_domain *clp; int err = 0; struct knfsd_fh *res; +struct in6_addr in6; if (size sizeof(*data)) return -EINVAL; @@ -236,7 +238,12 @@ static ssize_t write_getfs(struct file * res = (struct knfsd_fh*)buf; exp_readlock(); -if (!(clp = auth_unix_lookup(sin-sin_addr))) + +/* IPv6 address mapping */ +ipv6_addr_set(in6, 0, 0, +htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); + +if (!(clp = auth_unix_lookup(in6))) err = -EPERM; else { err = exp_rootfh(clp, data-gd_path, res, data-gd_maxlen); @@ -257,6 +264,7 @@ static ssize_t write_getfd(struct file * int err = 0; struct knfsd_fh fh; char *res; +struct in6_addr in6; if (size sizeof(*data)) return -EINVAL; @@ -271,7 +279,11 @@ static ssize_t write_getfd(struct file * res = buf; sin = (struct sockaddr_in *)data-gd_addr; exp_readlock(); -if (!(clp = auth_unix_lookup(sin-sin_addr))) + +/* IPv6 address mapping */ +ipv6_addr_set(in6, 0, 0, htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); + +if (!(clp = auth_unix_lookup(in6))) err = -EPERM; else { err = exp_rootfh(clp, data-gd_path, fh, NFS_FHSIZE); diff -p -u -r -N linux-2.6.24-rc1/include/linux/sunrpc/svcauth.h linux-2.6.24-rc1-ipmap/include/linux/sunrpc/svcauth.h --- linux-2.6.24-rc1/include/linux/sunrpc/svcauth.h2007-10-30 12:47:04.0 +0100 +++ linux-2.6.24-rc1-ipmap/include/linux/sunrpc/svcauth.h2007-10-30 13:14:04.0 +0100 @@ -120,10 +120,10 @@ extern voidsvc_auth_unregister(rpc_auth extern struct auth_domain *unix_domain_find(char *name); extern void auth_domain_put(struct auth_domain *item); -extern int auth_unix_add_addr(struct in_addr addr, struct auth_domain *dom); +extern int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain *dom); extern struct auth_domain *auth_domain_lookup(char *name, struct auth_domain *new); extern struct auth_domain *auth_domain_find(char *name); -extern struct auth_domain *auth_unix_lookup(struct in_addr addr); +extern struct auth_domain *auth_unix_lookup(struct in6_addr *addr); extern int auth_unix_forget_old(struct auth_domain *dom); extern void svcauth_unix_purge(void); extern void svcauth_unix_info_release(void *); diff -p -u -r -N linux-2.6.24-rc1/net/sunrpc/svcauth_unix.c linux-2.6.24-rc1-ipmap/net/sunrpc/svcauth_unix.c --- linux-2.6.24-rc1/net/sunrpc/svcauth_unix.c2007-10-30 12:47:07.0 +0100 +++ linux-2.6.24-rc1-ipmap/net/sunrpc/svcauth_unix.c2007-10-30 17:17:00.0 +0100 @@ -11,7 +11,8 @@ #include linux/hash.h #include linux/string.h #include net/sock.h - +#include net/ipv6.h +#include linux/kernel.h #define RPCDBG_FACILITYRPCDBG_AUTH @@ -84,7 +85,7 @@ static void svcauth_unix_domain_release( struct ip_map { struct cache_headh; charm_class[8];
[PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl
Here is a second missing part of the IPv6 support in NFS server code concerning knfd syscall interface. It updates write_getfd and write_getfd to accept IPv6 addresses. Applies on a kernel including ip_map cache modifications Tests: tested with only IPv4 network and basic nfs ops (mount, file creation and modification) Signed-off-by: Aurelien Charbon [EMAIL PROTECTED] --- diff -p -u -r -N linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c --- linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c2007-10-30 17:15:45.0 +0100 +++ linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c2007-10-30 17:21:36.0 +0100 @@ -219,7 +219,7 @@ static ssize_t write_unexport(struct fil static ssize_t write_getfs(struct file *file, char *buf, size_t size) { struct nfsctl_fsparm *data; -struct sockaddr_in *sin; +struct sockaddr_in6 *sin6, sin6_storage; struct auth_domain *clp; int err = 0; struct knfsd_fh *res; @@ -229,9 +229,21 @@ static ssize_t write_getfs(struct file * return -EINVAL; data = (struct nfsctl_fsparm*)buf; err = -EPROTONOSUPPORT; -if (data-gd_addr.sa_family != AF_INET) +switch (data-gd_addr.sa_family) { +case AF_INET6: +sin6 = sin6_storage; +sin6 = (struct sockaddr_in6 *)data-gd_addr; +ipv6_addr_copy(in6, sin6-sin6_addr); +break; +case AF_INET: +/* Map v4 address into v6 structure */ +ipv6_addr_set(in6, 0, 0, +htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); +break; +default: goto out; -sin = (struct sockaddr_in *)data-gd_addr; +} + if (data-gd_maxlen NFS3_FHSIZE) data-gd_maxlen = NFS3_FHSIZE; @@ -239,11 +251,7 @@ static ssize_t write_getfs(struct file * exp_readlock(); -/* IPv6 address mapping */ -ipv6_addr_set(in6, 0, 0, -htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); - -if (!(clp = auth_unix_lookup(in6))) +if (!(clp = auth_unix_lookup(in6))) err = -EPERM; else { err = exp_rootfh(clp, data-gd_path, res, data-gd_maxlen); @@ -259,7 +267,7 @@ static ssize_t write_getfs(struct file * static ssize_t write_getfd(struct file *file, char *buf, size_t size) { struct nfsctl_fdparm *data; -struct sockaddr_in *sin; +struct sockaddr_in6 *sin6, sin6_storage; struct auth_domain *clp; int err = 0; struct knfsd_fh fh; @@ -270,20 +278,31 @@ static ssize_t write_getfd(struct file * return -EINVAL; data = (struct nfsctl_fdparm*)buf; err = -EPROTONOSUPPORT; -if (data-gd_addr.sa_family != AF_INET) +if (data-gd_addr.sa_family != AF_INET data-gd_addr.sa_family != AF_INET6) goto out; err = -EINVAL; if (data-gd_version 2 || data-gd_version NFSSVC_MAXVERS) goto out; res = buf; -sin = (struct sockaddr_in *)data-gd_addr; exp_readlock(); - -/* IPv6 address mapping */ -ipv6_addr_set(in6, 0, 0, htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); -if (!(clp = auth_unix_lookup(in6))) +switch (data-gd_addr.sa_family) { +case AF_INET: +/* IPv6 address mapping */ +ipv6_addr_set(in6, 0, 0, htonl(0x), +((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr); +break; +case AF_INET6: +sin6 = sin6_storage; +sin6 = (struct sockaddr_in6 *)data-gd_addr; +ipv6_addr_copy(in6, sin6-sin6_addr); +break; +default: +BUG(); +} + +if (!(clp = auth_unix_lookup(in6))) err = -EPERM; else { err = exp_rootfh(clp, data-gd_path, fh, NFS_FHSIZE); -- Aurelien Charbon Linux NFSv4 team Bull SAS Echirolles - France http://nfsv4.bullopensource.org/ diff -p -u -r -N linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c --- linux-2.6.24-rc1-ipmap/fs/nfsd/nfsctl.c 2007-10-30 17:15:45.0 +0100 +++ linux-2.6.24-rc1-nfsctl/fs/nfsd/nfsctl.c 2007-10-30 17:21:36.0 +0100 @@ -219,7 +219,7 @@ static ssize_t write_unexport(struct fil static ssize_t write_getfs(struct file *file, char *buf, size_t size) { struct nfsctl_fsparm *data; - struct sockaddr_in *sin; + struct sockaddr_in6 *sin6, sin6_storage; struct auth_domain *clp; int err = 0; struct knfsd_fh *res; @@ -229,9 +229,21 @@ static ssize_t write_getfs(struct file * return -EINVAL; data = (struct nfsctl_fsparm*)buf; err = -EPROTONOSUPPORT; - if (data-gd_addr.sa_family != AF_INET) + switch (data-gd_addr.sa_family) { + case AF_INET6: + sin6 = sin6_storage; + sin6 = (struct sockaddr_in6 *)data-gd_addr; + ipv6_addr_copy(in6, sin6-sin6_addr); + break; + case AF_INET: + /* Map v4 address into v6 structure */ + ipv6_addr_set(in6, 0, 0, +htonl(0x), (((struct sockaddr_in *)data-gd_addr)-sin_addr.s_addr)); +
[PATCH] ixgb: fix TX hangs under heavy load
A merge error occurred where we merged the wrong block here in version 1.0.120. The right condition for frags is slightly different then for the skb, so account for the difference properly and trim the TSO based size right. Originally part of a fix reported by IBM to fix TSO hangs on pSeries hardware. Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] Cc: Andy Gospodarek [EMAIL PROTECTED] --- drivers/net/ixgb/ixgb_main.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index e564335..3021234 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -1321,8 +1321,8 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff *skb, /* Workaround for premature desc write-backs * in TSO mode. Append 4-byte sentinel desc */ - if (unlikely(mss !nr_frags size == len - size 8)) + if (unlikely(mss (f == (nr_frags - 1)) + size == len size 8)) size -= 4; buffer_info-length = size; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pegasos_eth.c: Fix compile error over MV643XX_ defines
Dale Farnsworth wrote: On Mon, Oct 29, 2007 at 05:27:29PM -0400, Luis R. Rodriguez wrote: This commit made an incorrect assumption: -- Author: Lennert Buytenhek [EMAIL PROTECTED] Date: Fri Oct 19 04:10:10 2007 +0200 mv643xx_eth: Move ethernet register definitions into private header Move the mv643xx's ethernet-related register definitions from include/linux/mv643xx.h into drivers/net/mv643xx_eth.h, since they aren't of any use outside the ethernet driver. Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Acked-by: Tzachi Perelstein [EMAIL PROTECTED] Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] -- arch/powerpc/platforms/chrp/pegasos_eth.c made use of a 3 defines there. [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe v2.6.24-rc1-138-g0119130 This patch fixes this by internalizing 3 defines onto pegasos which are simply no longer available elsewhere. Without this your compile will fail That compile failure was fixed in commit 30e69bf4cce16d4c2dcfd629a60fcd8e1aba9fee by Al Viro. However, as I examine that commit, I see that it defines offsets from the eth block in the chip, rather than the full chip registeri block as the Pegasos 2 code expects. So, I think it fixes the compile failure, but leaves the Pegasos 2 broken. Luis, do you have Pegasos 2 hardware? Can you (or anyone) verify that the following patch is needed for the Pegasos 2? Thanks, -Dale - mv643xx_eth: Fix MV643XX_ETH offsets used by Pegasos 2 In the mv643xx_eth driver, we now use offsets from the ethernet register block within the chip, but the pegasos 2 platform still needs offsets from the full chip's register base address. Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] --- include/linux/mv643xx_eth.h |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/mv643xx_eth.h b/include/linux/mv643xx_eth.h index 8df230a..30e11aa 100644 --- a/include/linux/mv643xx_eth.h +++ b/include/linux/mv643xx_eth.h @@ -8,9 +8,9 @@ #define MV643XX_ETH_NAME mv643xx_eth #define MV643XX_ETH_SHARED_REGS0x2000 #define MV643XX_ETH_SHARED_REGS_SIZE 0x2000 -#define MV643XX_ETH_BAR_4 0x220 -#define MV643XX_ETH_SIZE_REG_4 0x224 -#define MV643XX_ETH_BASE_ADDR_ENABLE_REG 0x0290 +#define MV643XX_ETH_BAR_4 0x2220 +#define MV643XX_ETH_SIZE_REG_4 0x2224 +#define MV643XX_ETH_BASE_ADDR_ENABLE_REG 0x2290 applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] ixgbe: minor sparse fixes
Auke Kok wrote: From: Stephen Hemminger [EMAIL PROTECTED] Make strings const if possible, and fix includes so forward definitions are seen. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/ixgbe/ixgbe.h |2 +- drivers/net/ixgbe/ixgbe_82598.c |3 +-- drivers/net/ixgbe/ixgbe_main.c |9 + 3 files changed, 7 insertions(+), 7 deletions(-) applied 1-4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e1000e: Fix typo !
Auke Kok wrote: From: Roel Kluin [EMAIL PROTECTED] Signed-off-by: Roel Kluin [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000e/82571.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/e1000e/82571.c b/drivers/net/e1000e/82571.c index cf70522..14141a5 100644 --- a/drivers/net/e1000e/82571.c +++ b/drivers/net/e1000e/82571.c @@ -283,7 +283,7 @@ static s32 e1000_get_invariants_82571(struct e1000_adapter *adapter) adapter-flags = ~FLAG_HAS_WOL; /* quad ports only support WoL on port A */ if (adapter-flags FLAG_IS_QUAD_PORT - (!adapter-flags FLAG_IS_QUAD_PORT_A)) + (!(adapter-flags FLAG_IS_QUAD_PORT_A))) adapter-flags = ~FLAG_HAS_WOL; break; applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ixgb: fix TX hangs under heavy load
Auke Kok wrote: A merge error occurred where we merged the wrong block here in version 1.0.120. The right condition for frags is slightly different then for the skb, so account for the difference properly and trim the TSO based size right. Originally part of a fix reported by IBM to fix TSO hangs on pSeries hardware. Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] Cc: Andy Gospodarek [EMAIL PROTECTED] --- drivers/net/ixgb/ixgb_main.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] using mii-bitbang on different processor ports
The patch makes possible to have mdio and mdc pins on different physical ports also for CONFIG_PPC_CPM_NEW_BINDING. To setup it in the device tree: reg = 10d40 14 10d60 14; // mdc: 0x10d40, mdio: 0x10d60 or reg = 10d40 14; // mdc and mdio have the same offset 10d40 The approach was taken from older version. Signed-off-by: Sergej Stepanov [EMAIL PROTECTED] -- diff --git a/drivers/net/fs_enet/mii-bitbang.c b/drivers/net/fs_enet/mii-bitbang.c index b8e4a73..eea5feb 100644 --- a/drivers/net/fs_enet/mii-bitbang.c +++ b/drivers/net/fs_enet/mii-bitbang.c @@ -29,12 +29,16 @@ #include fs_enet.h -struct bb_info { - struct mdiobb_ctrl ctrl; +struct bb_port { __be32 __iomem *dir; __be32 __iomem *dat; - u32 mdio_msk; - u32 mdc_msk; + u32 msk; +}; + +struct bb_info { + struct mdiobb_ctrl ctrl; + struct bb_port mdc; + struct bb_port mdio; }; /* FIXME: If any other users of GPIO crop up, then these will have to @@ -62,18 +66,18 @@ static inline void mdio_dir(struct mdiobb_ctrl *ctrl, int dir) struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); if (dir) - bb_set(bitbang-dir, bitbang-mdio_msk); + bb_set(bitbang-mdio.dir, bitbang-mdio.msk); else - bb_clr(bitbang-dir, bitbang-mdio_msk); + bb_clr(bitbang-mdio.dir, bitbang-mdio.msk); /* Read back to flush the write. */ - in_be32(bitbang-dir); + in_be32(bitbang-mdio.dir); } static inline int mdio_read(struct mdiobb_ctrl *ctrl) { struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); - return bb_read(bitbang-dat, bitbang-mdio_msk); + return bb_read(bitbang-mdio.dat, bitbang-mdio.msk); } static inline void mdio(struct mdiobb_ctrl *ctrl, int what) @@ -81,12 +85,12 @@ static inline void mdio(struct mdiobb_ctrl *ctrl, int what) struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); if (what) - bb_set(bitbang-dat, bitbang-mdio_msk); + bb_set(bitbang-mdio.dat, bitbang-mdio.msk); else - bb_clr(bitbang-dat, bitbang-mdio_msk); + bb_clr(bitbang-mdio.dat, bitbang-mdio.msk); /* Read back to flush the write. */ - in_be32(bitbang-dat); + in_be32(bitbang-mdio.dat); } static inline void mdc(struct mdiobb_ctrl *ctrl, int what) @@ -94,12 +98,12 @@ static inline void mdc(struct mdiobb_ctrl *ctrl, int what) struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); if (what) - bb_set(bitbang-dat, bitbang-mdc_msk); + bb_set(bitbang-mdc.dat, bitbang-mdc.msk); else - bb_clr(bitbang-dat, bitbang-mdc_msk); + bb_clr(bitbang-mdc.dat, bitbang-mdc.msk); /* Read back to flush the write. */ - in_be32(bitbang-dat); + in_be32(bitbang-mdc.dat); } static struct mdiobb_ops bb_ops = { @@ -114,23 +118,23 @@ static struct mdiobb_ops bb_ops = { static int __devinit fs_mii_bitbang_init(struct mii_bus *bus, struct device_node *np) { - struct resource res; + struct resource res[2]; const u32 *data; int mdio_pin, mdc_pin, len; struct bb_info *bitbang = bus-priv; - int ret = of_address_to_resource(np, 0, res); + int ret = of_address_to_resource(np, 0, res[0]); if (ret) return ret; - if (res.end - res.start 13) + if (res[0].end - res[0].start 13) return -ENODEV; /* This should really encode the pin number as well, but all * we get is an int, and the odds of multiple bitbang mdio buses * is low enough that it's not worth going too crazy. */ - bus-id = res.start; + bus-id = res[0].start; data = of_get_property(np, fsl,mdio-pin, len); if (!data || len != 4) @@ -142,15 +146,32 @@ static int __devinit fs_mii_bitbang_init(struct mii_bus *bus, return -ENODEV; mdc_pin = *data; - bitbang-dir = ioremap(res.start, res.end - res.start + 1); - if (!bitbang-dir) + bitbang-mdc.dir = ioremap(res[0].start, res[0].end - res[0].start + 1); + if (!bitbang-mdc.dir) return -ENOMEM; - bitbang-dat = bitbang-dir + 4; - bitbang-mdio_msk = 1 (31 - mdio_pin); - bitbang-mdc_msk = 1 (31 - mdc_pin); + bitbang-mdc.dat = bitbang-mdc.dir + 4; + if( !of_address_to_resource(np, 1, res[1])) { + if (res[1].end - res[1].start 13) + goto bad_resource; + bitbang-mdio.dir = ioremap(res[1].start, res[1].end - res[1].start + 1); + if (!bitbang-mdio.dir) + goto unmap_and_exit; + bitbang-mdio.dat = bitbang-mdio.dir + 4; + } else { + bitbang-mdio.dir = bitbang-mdc.dir; +
[git patches] net driver fixes
Fixes, and a new DM9601 USB NIC id. Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus to receive the following updates: drivers/net/bfin_mac.c|2 - drivers/net/e1000/e1000.h |8 +++ drivers/net/e1000/e1000_ethtool.c | 29 ++-- drivers/net/e1000/e1000_hw.c |4 +- drivers/net/e1000/e1000_main.c|7 + drivers/net/e1000/e1000_param.c | 23 ++- drivers/net/e1000e/82571.c|2 +- drivers/net/e1000e/ethtool.c |4 +- drivers/net/e1000e/param.c| 35 +++-- drivers/net/ixgb/ixgb.h |7 ++ drivers/net/ixgb/ixgb_ethtool.c |7 + drivers/net/ixgb/ixgb_hw.c|4 +- drivers/net/ixgb/ixgb_main.c | 15 +--- drivers/net/ixgb/ixgb_param.c | 43 +++-- drivers/net/ixgbe/ixgbe.h |2 +- drivers/net/ixgbe/ixgbe_82598.c |3 +- drivers/net/ixgbe/ixgbe_main.c|9 --- drivers/net/usb/dm9601.c |4 +++ include/linux/mv643xx_eth.h |6 ++-- 19 files changed, 110 insertions(+), 104 deletions(-) Auke Kok (1): ixgb: fix TX hangs under heavy load Dale Farnsworth (1): mv643xx_eth: Fix MV643XX_ETH offsets used by Pegasos 2 Michael Hennerich (1): Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost packets) Peter Korsgaard (1): DM9601: Support for ADMtek ADM8515 NIC Roel Kluin (1): e1000e: Fix typo ! Stephen Hemminger (4): e1000e: fix sparse warnings ixgb: fix sparse warnings e1000: sparse warnings fixes ixgbe: minor sparse fixes diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c index 53fe7de..084acfd 100644 --- a/drivers/net/bfin_mac.c +++ b/drivers/net/bfin_mac.c @@ -371,7 +371,6 @@ static void bf537_adjust_link(struct net_device *dev) if (phydev-speed != lp-old_speed) { #if defined(CONFIG_BFIN_MAC_RMII) u32 opmode = bfin_read_EMAC_OPMODE(); - bf537mac_disable(); switch (phydev-speed) { case 10: opmode |= RMII_10; @@ -386,7 +385,6 @@ static void bf537_adjust_link(struct net_device *dev) break; } bfin_write_EMAC_OPMODE(opmode); - bf537mac_enable(); #endif new_state = 1; diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index 781ed99..3b84028 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -351,4 +351,12 @@ enum e1000_state_t { __E1000_DOWN }; +extern char e1000_driver_name[]; +extern const char e1000_driver_version[]; + +extern void e1000_power_up_phy(struct e1000_adapter *); +extern void e1000_set_ethtool_ops(struct net_device *netdev); +extern void e1000_check_options(struct e1000_adapter *adapter); + + #endif /* _E1000_H_ */ diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c index 6c9a643..667f18b 100644 --- a/drivers/net/e1000/e1000_ethtool.c +++ b/drivers/net/e1000/e1000_ethtool.c @@ -32,9 +32,6 @@ #include asm/uaccess.h -extern char e1000_driver_name[]; -extern char e1000_driver_version[]; - extern int e1000_up(struct e1000_adapter *adapter); extern void e1000_down(struct e1000_adapter *adapter); extern void e1000_reinit_locked(struct e1000_adapter *adapter); @@ -733,16 +730,16 @@ err_setup: #define REG_PATTERN_TEST(R, M, W) \ { \ - uint32_t pat, value; \ - uint32_t test[] = \ + uint32_t pat, val; \ + const uint32_t test[] =\ {0x5A5A5A5A, 0xA5A5A5A5, 0x, 0x}; \ - for (pat = 0; pat ARRAY_SIZE(test); pat++) { \ + for (pat = 0; pat ARRAY_SIZE(test); pat++) { \ E1000_WRITE_REG(adapter-hw, R, (test[pat] W)); \ - value = E1000_READ_REG(adapter-hw, R); \ - if (value != (test[pat] W M)) { \ + val = E1000_READ_REG(adapter-hw, R); \ + if (val != (test[pat] W M)) { \ DPRINTK(DRV, ERR, pattern test reg %04X failed: got \ 0x%08X expected 0x%08X\n,\ - E1000_##R, value, (test[pat] W M));\ +
Re: [PATCH 1/2] NFS: change the ip_map cache code to handle IPv6 addresses
Thanks for working on this. Could you run linux/scripts/checkpatch.pl on your patch and fix the problems it complains about? On Tue, Oct 30, 2007 at 06:05:42PM +0100, Aurélien Charbon wrote: static void update(struct cache_head *cnew, struct cache_head *citem) { @@ -149,22 +157,24 @@ static void ip_map_request(struct cache_ struct cache_head *h, char **bpp, int *blen) { -char text_addr[20]; +char text_addr[40]; struct ip_map *im = container_of(h, struct ip_map, h); -__be32 addr = im-m_addr.s_addr; - -snprintf(text_addr, 20, %u.%u.%u.%u, - ntohl(addr) 24 0xff, - ntohl(addr) 16 0xff, - ntohl(addr) 8 0xff, - ntohl(addr) 0 0xff); +if (ipv6_addr_v4mapped((im-m_addr))) { +snprintf(text_addr, 20, NIPQUAD_FMT, +ntohl(im-m_addr.s6_addr32[3]) 24 0xff, +ntohl(im-m_addr.s6_addr32[3]) 16 0xff, +ntohl(im-m_addr.s6_addr32[3]) 8 0xff, +ntohl(im-m_addr.s6_addr32[3]) 0 0xff); +} else { +snprintf(text_addr, 40, NIP6_FMT, NIP6(im-m_addr)); +} qword_add(bpp, blen, im-m_class); qword_add(bpp, blen, text_addr); (*bpp)[-1] = '\n'; } What happens when an unpatched mountd gets this request? Does it ignore it, or respond with a negative entry? --b. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] using mii-bitbang on different processor ports
Sergej Stepanov wrote: + if( !of_address_to_resource(np, 1, res[1])) { The spacing is still wrong. - iounmap(bitbang-dir); + if ( bitbang-mdio.dir != bitbang-mdc.dir) + iounmap(bitbang-mdio.dir); + iounmap(bitbang-mdc.dir); And here. -Scott - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] DM9601: Support for ADMtek ADM8515 NIC
Peter Korsgaard wrote: Add device ID for the ADMtek ADM8515 USB NIC to the DM9601 driver. Signed-off-by: Peter Korsgaard [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Blackfin EMAC driver: Fix Ethernet communication bug (dupliated and lost packets)
Bryan Wu wrote: From: Michael Hennerich [EMAIL PROTECTED] Fix Ethernet communication bug(dupliated and lost packets) in RMII PHY mode- dont call mac_disable and mac_enable during 10/100 REFCLK changes - mac_enable screws up the DMA descriptor chain Signed-off-by: Michael Hennerich [EMAIL PROTECTED] Signed-off-by: Bryan Wu [EMAIL PROTECTED] --- drivers/net/bfin_mac.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl
Aurélien Charbon wrote: Here is a second missing part of the IPv6 support in NFS server code concerning knfd syscall interface. It updates write_getfd and write_getfd to accept IPv6 addresses. Applies on a kernel including ip_map cache modifications Both patches still have bugs, I think the patch I sent yesterday fixed them all, so I would recommend using that instead. Of course Neil's comment possibly trumps all that anyways... -Brian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ixgb: fix TX hangs under heavy load
On Tue, Oct 30, 2007 at 11:21:50AM -0700, Auke Kok wrote: A merge error occurred where we merged the wrong block here in version 1.0.120. The right condition for frags is slightly different then for the skb, so account for the difference properly and trim the TSO based size right. Originally part of a fix reported by IBM to fix TSO hangs on pSeries hardware. Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] Cc: Andy Gospodarek [EMAIL PROTECTED] --- Thanks, Auke and Jesse! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] remove claim balance_rr won't reorder on many to one
Remove the text which suggests that many balance_rr links feeding into a single uplink will not experience packet reordering. More up-to-date tests, with 1G links feeding into a switch with a 10G uplink, using a 2.6.23-rc8 kernel on the system on which the 1G links were bonded with balance_rr (mode=0) shows that even a many to one link configuration will experience packet reordering and the attendant TCP issues involving spurrious retransmissions and the congestion window. This happens even with a single, simple bulk transfer such as a netperf TCP_STREAM test. A more complete description of the tests and results, including tcptrace analysis of packet traces showing the degree of reordering and such can be found at: http://marc.info/?l=linux-netdevm=119101513406349w=2 Also, note that some switches use the term trunking in a context other than link aggregation. Signed-off-by: Rick Jones [EMAIL PROTECTED] --- diff -r 35e54d4beaad Documentation/networking/bonding.txt --- a/Documentation/networking/bonding.txt Wed Oct 24 05:06:40 2007 + +++ b/Documentation/networking/bonding.txt Mon Oct 29 03:47:19 2007 -0700 @@ -1696,23 +1696,6 @@ balance-rr: This mode is the only mode t interface's worth of throughput, even after adjusting tcp_reordering. - Note that this out of order delivery occurs when both the - sending and receiving systems are utilizing a multiple - interface bond. Consider a configuration in which a - balance-rr bond feeds into a single higher capacity network - channel (e.g., multiple 100Mb/sec ethernets feeding a single - gigabit ethernet via an etherchannel capable switch). In this - configuration, traffic sent from the multiple 100Mb devices to - a destination connected to the gigabit device will not see - packets out of order. However, traffic sent from the gigabit - device to the multiple 100Mb devices may or may not see - traffic out of order, depending upon the balance policy of the - switch. Many switches do not support any modes that stripe - traffic (instead choosing a port based upon IP or MAC level - addresses); for those devices, traffic flowing from the - gigabit device to the many 100Mb devices will only utilize one - interface. - If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram @@ -1720,7 +1703,9 @@ balance-rr: This mode is the only mode t to the bond. This mode requires the switch to have the appropriate ports - configured for etherchannel or trunking. + configured for etherchannel or aggregation. N.B. some + switches might use the term trunking for something other + than link aggregation. active-backup: There is not much advantage in this network topology to the active-backup mode, as the inactive backup devices are all - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6 patch] fix drivers/net/wan/lmc/ compilation
Documentation/SubmitChecklist, point 1: -- snip -- ... CC drivers/net/wan/lmc/lmc_main.o /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c: In function ‘lmc_ioctl’: /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c:239: error: expected expression before ‘else’ ... make[5]: *** [drivers/net/wan/lmc/lmc_main.o] Error 1 -- snip -- Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- d5e92a30491abf073e0a7f4d46b466c7c97f0f61 diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c index 64eb578..37c52e1 100644 --- a/drivers/net/wan/lmc/lmc_main.c +++ b/drivers/net/wan/lmc/lmc_main.c @@ -234,7 +234,7 @@ int lmc_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) /*fold00*/ sc-lmc_xinfo.Magic1 = 0xDEADBEEF; if (copy_to_user(ifr-ifr_data, sc-lmc_xinfo, - sizeof(struct lmc_xinfo))) { +sizeof(struct lmc_xinfo))) ret = -EFAULT; else ret = 0; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Saner thash_entries default with much memory
Next, machines that service that many sockets typically have them mostly with full transmit queues talking to a very slow receiver at the other end. Not sure -- there are likely use cases with lots of idle but connected sockets. Also the constraint here is not really how many sockets are served, but how well the hash function manages to spread them in the table.. I don't have good data on that. But still (512 * 1024) sounds reasonable because e.g. in the lots of idle socket case you're probably fine with the hash chains having more than one entry worst case because a small working set will fit in cache and as long as the chains do not end up very long walking in cache of a short list will be still fast enough. So to me (512 * 1024) is a very reasonable limit and (with lockdep and spinlock debugging disabled) this makes the EHASH table consume 8MB on UP 64-bit and ~12MB on SMP 64-bit systems. I still have my doubts it makes sense to have an own lock for each bucket. It would be probably better to just divide the hash value through a factor again and then use that to index a smaller lock only table. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] fix drivers/net/wan/lmc/ compilation
Adrian Bunk wrote: Documentation/SubmitChecklist, point 1: -- snip -- ... CC drivers/net/wan/lmc/lmc_main.o /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c: In function ‘lmc_ioctl’: /home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/net/wan/lmc/lmc_main.c:239: error: expected expression before ‘else’ ... make[5]: *** [drivers/net/wan/lmc/lmc_main.o] Error 1 -- snip -- Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- d5e92a30491abf073e0a7f4d46b466c7c97f0f61 diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c index 64eb578..37c52e1 100644 --- a/drivers/net/wan/lmc/lmc_main.c +++ b/drivers/net/wan/lmc/lmc_main.c @@ -234,7 +234,7 @@ int lmc_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) /*fold00*/ sc-lmc_xinfo.Magic1 = 0xDEADBEEF; if (copy_to_user(ifr-ifr_data, sc-lmc_xinfo, - sizeof(struct lmc_xinfo))) { + sizeof(struct lmc_xinfo))) ret = -EFAULT; else ret = 0; I am sorry, my patch broke this and Kristov Provost also noticed this. See http://lkml.org/lkml/2007/10/30/355 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1][NETNS] resend: fix net released by rcu callback
Daniel Lezcano [EMAIL PROTECTED] writes: When a network namespace reference is held by a network subsystem, and when this reference is decremented in a rcu update callback, we must ensure that there is no more outstanding rcu update before trying to free the network namespace. In the normal case, the rcu_barrier is called when the network namespace is exiting in the cleanup_net function. But when a network namespace creation fails, and the subsystems are undone (like the cleanup), the rcu_barrier is missing. This patch adds the missing rcu_barrier. Looks sane. Did you have any specific failures related to this or was this something that was just caught in review? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1][IPV6] resend: remove duplicate call to proc_net_remove
Daniel Lezcano [EMAIL PROTECTED] writes: The file /proc/net/if_inet6 is removed twice. First time in: inet6_exit -addrconf_cleanup And followed a few lines after by: inet6_exit - if6_proc_exit Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] Acked-by: Eric W. Biederman [EMAIL PROTECTED] Looks like a good clean up to me. --- net/ipv6/addrconf.c |4 1 file changed, 4 deletions(-) Index: net-2.6/net/ipv6/addrconf.c === --- net-2.6.orig/net/ipv6/addrconf.c +++ net-2.6/net/ipv6/addrconf.c @@ -4288,8 +4288,4 @@ void __exit addrconf_cleanup(void) del_timer(addr_chk_timer); rtnl_unlock(); - -#ifdef CONFIG_PROC_FS - proc_net_remove(init_net, if_inet6); -#endif } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove claim balance_rr won't reorder on many to one
Rick Jones [EMAIL PROTECTED] wrote: [...] - Note that this out of order delivery occurs when both the - sending and receiving systems are utilizing a multiple - interface bond. Consider a configuration in which a - balance-rr bond feeds into a single higher capacity network - channel (e.g., multiple 100Mb/sec ethernets feeding a single - gigabit ethernet via an etherchannel capable switch). In this - configuration, traffic sent from the multiple 100Mb devices to - a destination connected to the gigabit device will not see - packets out of order. However, traffic sent from the gigabit - device to the multiple 100Mb devices may or may not see - traffic out of order, depending upon the balance policy of the - switch. Many switches do not support any modes that stripe - traffic (instead choosing a port based upon IP or MAC level - addresses); for those devices, traffic flowing from the - gigabit device to the many 100Mb devices will only utilize one - interface. Rather than simply removing this entirely (because I do think there is value in discussion of the reordering aspects of balance-rr), I'd rather see something that makes the following points: 1- the worst reordering is balance-rr to balance-rr, back to back. The reordering rate here depends upon (a) the number of slaves involved and (b) packet reception scheduling behaviors (packet coalescing, NAPI, etc), and thus will vary signficantly, but won't be better than case #2. 2- next worst is balance-rr many slow to single fast, with the reordering rate generally being substantially lower than case #1 (it looked like your test showed about a 1% reordering rate, if I'm reading your data correctly). 3- For the single fast to balance-rr many case, going through a switch configured for etherchannel may or may not see traffic out of order, depending upon the balance policy of the switch. Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic flowing from the [single fast] device to the [balance-rr many] devices will only utilize one interface. [...] This mode requires the switch to have the appropriate ports - configured for etherchannel or trunking. + configured for etherchannel or aggregation. N.B. some + switches might use the term trunking for something other + than link aggregation. If memory serves, Sun uses the term trunking to refer to etherchannel compatible behavior. I'm also hearing aggregation used to described 802.3ad specifically. Perhaps text of the form: This mode requires the switch to have the appropriate ports configured for Etherchannel. Some switches use different terms, so the configuration may be called trunking or aggregation. Note that both of these terms also have other meanings. For example, trunking is also used to describe a type of switch port, and aggregation or link aggregation is often used to refer to 802.3ad link aggregation, which is compatible with bonding's 802.3ad mode, but not balance-rr. Thoughts? -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Saner thash_entries default with much memory
From: Jean Delvare [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 14:18:27 +0100 OK, let's go with (512 * 1024) then. Want me to send an updated patch? Why submit a patch that's already in Linus's tree :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/33] netvm: skb processing
On Tue, 30 Oct 2007 17:04:24 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote: In order to make sure emergency packets receive all memory needed to proceed ensure processing of emergency SKBs happens under PF_MEMALLOC. Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing. Skip taps, since those are user-space again. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 + net/core/dev.c | 44 ++-- net/core/sock.c| 18 ++ 3 files changed, 61 insertions(+), 6 deletions(-) Index: linux-2.6/net/core/dev.c === --- linux-2.6.orig/net/core/dev.c +++ linux-2.6/net/core/dev.c @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk struct net_device *orig_dev; int ret = NET_RX_DROP; __be16 type; + unsigned long pflags = current-flags; + + /* Emergency skb are special, they should + * - be delivered to SOCK_MEMALLOC sockets only + * - stay away from userspace + * - have bounded memory usage + * + * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind. + * This saves us from propagating the allocation context down to all + * allocation sites. + */ + if (skb_emergency(skb)) + current-flags |= PF_MEMALLOC; /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) - return NET_RX_DROP; + goto out; Why the change? doesn't gcc optimize the common exit case anyway? if (!skb-tstamp.tv64) net_timestamp(skb); @@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk orig_dev = skb_bond(skb); if (!orig_dev) - return NET_RX_DROP; + goto out; __get_cpu_var(netdev_rx_stat).total++; @@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk } #endif + if (skb_emergency(skb)) + goto skip_taps; + list_for_each_entry_rcu(ptype, ptype_all, list) { if (!ptype-dev || ptype-dev == skb-dev) { if (pt_prev) @@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk } } +skip_taps: #ifdef CONFIG_NET_CLS_ACT if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) { kfree_skb(skb); - goto out; + goto unlock; } skb-tc_verd = 0; ncls: #endif + if (skb_emergency(skb)) + switch(skb-protocol) { + case __constant_htons(ETH_P_ARP): + case __constant_htons(ETH_P_IP): + case __constant_htons(ETH_P_IPV6): + case __constant_htons(ETH_P_8021Q): + break; Indentation is wrong, and hard coding protocol values as spcial case seems bad here. What about vlan's, etc? + default: + goto drop; + } + skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; skb = handle_macvlan(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; type = skb-protocol; list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) { @@ -2056,6 +2085,7 @@ ncls: if (pt_prev) { ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev); } else { +drop: kfree_skb(skb); /* Jamal, now you will not able to escape explaining * me how you were going to use this. :-) @@ -2063,8 +2093,10 @@ ncls: ret = NET_RX_DROP; } -out: +unlock: rcu_read_unlock(); +out: + tsk_restore_flags(current, pflags, PF_MEMALLOC); return ret; } Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct skb-next = NULL; } +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb); + static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) { + if (skb_emergency(skb)) + return __sk_backlog_rcv(sk, skb); + return sk-sk_backlog_rcv(sk, skb); } Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_clear_memalloc); +#ifdef
Re: [PATCH 23/33] netvm: skb processing
On Tue, 2007-10-30 at 14:26 -0700, Stephen Hemminger wrote: On Tue, 30 Oct 2007 17:04:24 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote: In order to make sure emergency packets receive all memory needed to proceed ensure processing of emergency SKBs happens under PF_MEMALLOC. Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing. Skip taps, since those are user-space again. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 + net/core/dev.c | 44 ++-- net/core/sock.c| 18 ++ 3 files changed, 61 insertions(+), 6 deletions(-) Index: linux-2.6/net/core/dev.c === --- linux-2.6.orig/net/core/dev.c +++ linux-2.6/net/core/dev.c @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk struct net_device *orig_dev; int ret = NET_RX_DROP; __be16 type; + unsigned long pflags = current-flags; + + /* Emergency skb are special, they should +* - be delivered to SOCK_MEMALLOC sockets only +* - stay away from userspace +* - have bounded memory usage +* +* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind. +* This saves us from propagating the allocation context down to all +* allocation sites. +*/ + if (skb_emergency(skb)) + current-flags |= PF_MEMALLOC; /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) - return NET_RX_DROP; + goto out; Why the change? doesn't gcc optimize the common exit case anyway? It needs to unset PF_MEMALLOC at the exit. @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) { kfree_skb(skb); - goto out; + goto unlock; } skb-tc_verd = 0; ncls: #endif + if (skb_emergency(skb)) + switch(skb-protocol) { + case __constant_htons(ETH_P_ARP): + case __constant_htons(ETH_P_IP): + case __constant_htons(ETH_P_IPV6): + case __constant_htons(ETH_P_8021Q): + break; Indentation is wrong, and hard coding protocol values as spcial case seems bad here. What about vlan's, etc? The other protocols needs analysis on what memory allocations occur during packet processing, if anything is done that is not yet accounted for (skb, route cache) then that needs to be added to a reserve, if there are any paths that could touch user-space, those need to be handled. I've started looking at a few others, but its hard and difficult work if one is not familiar with the protocols. @@ -2063,8 +2093,10 @@ ncls: ret = NET_RX_DROP; } -out: +unlock: rcu_read_unlock(); +out: + tsk_restore_flags(current, pflags, PF_MEMALLOC); return ret; } Its that tsk_restore_flags() there what requires the s/return/goto/ stuff you noted earlier. I am still not convinced that this solves the problem well enough to be useful. Can you really survive a heavy memory overcommit? On a machine with mem=128M, I've ran 4 processes of 64M, 2 file backed with the files on NFS, 2 anonymous. The processes just cycle through the memory using writes. This is a 100% overcommit. During these tests I've ran various network loads. I've shut down the NFS server, waited for say 15 minutes, and restarted the NFS server, and the machine came back up and continued. In other words, can you prove that the added complexity causes the system to survive a real test where otherwise it would not? I've put some statistics in the skb reserve allocations, those are most definately used. I'm quite certain the machine would lock up solid without it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1][NETNS] resend: fix net released by rcu callback
Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: When a network namespace reference is held by a network subsystem, and when this reference is decremented in a rcu update callback, we must ensure that there is no more outstanding rcu update before trying to free the network namespace. In the normal case, the rcu_barrier is called when the network namespace is exiting in the cleanup_net function. But when a network namespace creation fails, and the subsystems are undone (like the cleanup), the rcu_barrier is missing. This patch adds the missing rcu_barrier. Looks sane. Did you have any specific failures related to this or was this something that was just caught in review? Yes, I had this problem when doing ipv6 isolation for netns49. The ipv6 subsystem creation failed and the different subsystem where rollbacked in the setup_net function. When the network namespace was about to be freed in free_net function, I had the error with an usage refcount different from zero. It appears that was coming from core/neighbour.c neigh_parms_release - neigh_rcu_free_parms - neigh_parms_put - neigh_parms_destroy - release_net The free_net function was called before rcu callback neigh_rcu_free_parms. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html