date:20080220

Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread David Miller

From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Wed, 20 Feb 2008 08:38:17 +0100

> Thanks very much Yanmin, I think we can apply your patch as is, if no 
> regression was found for 32bits.

Great.  Can I get a resubmission of the patch with a cleaned up
changelog entry that describes in the regression along with the
changelog bits I saw in the most recent version of the patch?

An explicit "Acked-by:" from Eric would be nice too :-)

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lksctp-developers] [PATCH][SCTP]: Pick up an orphaned sctp_sockets_allocated counter.

2008-02-20 Thread David Miller

From: Vlad Yasevich <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 11:43:44 -0500

> Pavel Emelyanov wrote:
> > This counter is currently write-only.
> > 
> > Drawing an analogy with the similar tcp counter, I think
> > that this one should be pointed by the sockets_allocated
> > members of sctp_prot and sctpv6_prot.
> > 
> > Or should it be instead removed at all?
> > 
> > Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> 
> Ack.  Looks like it got missed.  It should be added.

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.25 patch] fix broken error handling in ieee80211_sta_process_addba_request()

2008-02-20 Thread Tomas Winkler

On Feb 20, 2008 8:46 AM, Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> On 19-02-2008 23:58, Adrian Bunk wrote:
> ...
> > --- a/net/mac80211/ieee80211_sta.c
> > +++ b/net/mac80211/ieee80211_sta.c
> > @@ -1116,9 +1116,10 @@ static void 
> > ieee80211_sta_process_addba_request(struct net_device *dev,
> ...
> > + printk(KERN_ERR "can not allocate reordering buffer "
>
>   + printk(KERN_ERR "cannot allocate reordering buffer "
>
> Probably this can be fixed during the commit.
>
> Jarek P
.
ACK: both patches.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread Zhang, Yanmin

Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.
1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.

Above patch changes the cache line alignment, especially member __refcnt. I did 
a 
testing by adding 2 unsigned long pading before lastuse, so the 3 members,
lastuse/__refcnt/__use, are moved to next cache line. The performance is 
recovered.

I created a patch to rearrange the members in struct dst_entry.

With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.
2) Add comments before __refcnt.

On 16-core tigerton:
If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch;
If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't introduce 
regression.

Thank Eric, Valdis, and David!

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
Acked-by: Eric Dumazet <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: e1000: Question about polling

2008-02-20 Thread Brandeburg, Jesse

Badalian Vyacheslav wrote:
> Hello all.
> 
> Interesting think:
> 
> Have PC that do NAT. Bandwidth about 600 mbs.
> 
> Have  4 CPU (2xCoRe 2 DUO "HT OFF" 3.2 HZ).
> 
> irqbalance in kernel is off.
> 
> nat2 ~ # cat /proc/irq/217/smp_affinity
> 0001
this binds all 217 irq interrupts to cpu 0

> nat2 ~ # cat /proc/irq/218/smp_affinity
> 0003

do you mean to be balancing interrupts between core 1 and 2 here?
1 = cpu 0
2 = cpu 1
4 = cpu 2
8 = cpu 3

so 1+2 = 3 for irq 218, ie balancing between the two.

sometimes the cpus will have a paired cache, depending on your bios it
will be organized like cpu 0/2 = shared cache, and cput 1/3 = shared
cache.
you can find this out by looking at physical ID and CORE ID in
/proc/cpuinfo

> Load SI on CPU0 and CPU1 is about 90%
> 
> Good... try do
> echo  > /proc/irq/217/smp_affinity
> echo  > /proc/irq/218/smp_affinity
> 
> Get 100% SI at CPU0
> 
> Question Why?

because as each adapter generating interrupts gets rotated through cpu0,
it gets "stuck" on cpu0 because the napi scheduling can only run one at
a time, and so each is always waiting in line behind the other to run
its napi poll, always fills its quota (work_done is always != 0) and
keeps interrupts disabled "forever"

> I listen that if use IRQ from 1 netdevice to 1 CPU i can get 30%
> perfomance... but i have 4 CPU... i must get more perfomance if i cat
> ""  to smp_affinity.

only if your performance is not cache limited but cpu horsepower
limited.  you're sacrificing cache coherency for cpu power, but if that
works for you then great.

> picture looks liks this:
> 0-3 CPU get over 50% SI bandwith up 55% SI... bandwith up...
> 100% SI on CPU0
> 
> I remember patch to fix problem like it... patched function
> e1000_clean...  kernel on pc have this patch (2.6.24-rc7-git2)...
> e1000 driver work much better (i up to 1.5-2x bandwidth before i get
> 100% SI), but i think that it not get 100% that it can =)

the patch helps a little because it decreases the amount of time the
driver spends in napi mode, basically shortening the exit condition
(which reenables interrupts, and therefore balancing) to work_done <
budget, not work_done == 0.

> Thanks for answers and sorry for my English

you basically can't get much more than one cpu can do for each nic.  its
possible to get a little more, but my guess is you won't get much.  The
best thing you can do is make sure as much traffic as possible stays in
the same cache, on two different cores.

you can try turning off NAPI mode either in the .config, or build the
sourceforge driver with CFLAGS_EXTRA=-DE1000_NO_NAPI,  which seems
counterintuitive, but with the non-napi e1000 pushing packets to the
backlog queue on each cpu, you may actually get better performance due
to the balancing.

some day soon (maybe) we'll have some coherent way to have one tx and rx
interrupt per core, and enough queues for each port to be able to handle
1 queue per core.

good luck,
  Jesse  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [NETNS]: Namespace leak in pneigh_lookup.

2008-02-20 Thread David Miller

From: "Denis V. Lunev" <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 16:12:38 +0300

> release_net is missed on the error path in pneigh_lookup.
> 
> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

Applied, thanks a lot.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [NETLABEL] Minor cleanup: remove unused method definition

2008-02-20 Thread David Miller

From: Paul Moore <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 11:08:27 -0500

> On Tuesday 19 February 2008 9:25:31 am Rami Rosen wrote:
> > Hi,
> >
> > This patch removes definition of netlbl_cfg_cipsov4_del() method in
> > netlabel/netlabel_kapi.c and in include/net/netlabel.h as it is not
> > used.
> >
> >
> > Regards,
> > Rami Rosen
> >
> >
> > Signed-off-by: Rami Rosen <[EMAIL PROTECTED]>
> 
> This was added for use by Smack (and any other LSMs which want to 
> configure NetLabel directly) and since this is an area that is 
> undergoing a lot of churn at this point I'd prefer if this function was 
> left in place for the time being.
> 
> At a later date if this function is still unused, I'll gladly ack it's 
> removal or do so myself.

Ok, let's leave it in for now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-2.6][DRIVER][VETH] fix dev refcount race

2008-02-20 Thread David Miller

From: Daniel Lezcano <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 17:18:00 +0100

> veth: fix dev refcount race
> 
> When deleting the veth driver, veth_close calls netif_carrier_off
> for the two extremities of the network device. netif_carrier_off on
> the peer device will fire an event and hold a reference on the peer
> device. Just after, the peer is unregistered taking the rtnl_lock while
> the linkwatch_event is scheduled. If __linkwatch_run_queue does not
> occurs before the unregistering, unregister_netdevice will wait for
> the dev refcount to reach zero holding the rtnl_lock and linkwatch_event
> will wait for the rtnl_lock and hold the dev refcount.
> 
> Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

Thank you for fixing this bug, patch applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: Question about polling

2008-02-20 Thread Badalian Vyacheslav

Very big thanks for this answer. You ask for all my questions and for 
all future questions too. Thanks Again!

Badalian Vyacheslav wrote:
  

Hello all.

Interesting think:

Have PC that do NAT. Bandwidth about 600 mbs.

Have  4 CPU (2xCoRe 2 DUO "HT OFF" 3.2 HZ).

irqbalance in kernel is off.

nat2 ~ # cat /proc/irq/217/smp_affinity
0001


this binds all 217 irq interrupts to cpu 0

  

nat2 ~ # cat /proc/irq/218/smp_affinity
0003



do you mean to be balancing interrupts between core 1 and 2 here?
1 = cpu 0
2 = cpu 1
4 = cpu 2
8 = cpu 3

so 1+2 = 3 for irq 218, ie balancing between the two.

sometimes the cpus will have a paired cache, depending on your bios it
will be organized like cpu 0/2 = shared cache, and cput 1/3 = shared
cache.
you can find this out by looking at physical ID and CORE ID in
/proc/cpuinfo

  

Load SI on CPU0 and CPU1 is about 90%

Good... try do
echo  > /proc/irq/217/smp_affinity
echo  > /proc/irq/218/smp_affinity

Get 100% SI at CPU0

Question Why?



because as each adapter generating interrupts gets rotated through cpu0,
it gets "stuck" on cpu0 because the napi scheduling can only run one at
a time, and so each is always waiting in line behind the other to run
its napi poll, always fills its quota (work_done is always != 0) and
keeps interrupts disabled "forever"

  

I listen that if use IRQ from 1 netdevice to 1 CPU i can get 30%
perfomance... but i have 4 CPU... i must get more perfomance if i cat
""  to smp_affinity.



only if your performance is not cache limited but cpu horsepower
limited.  you're sacrificing cache coherency for cpu power, but if that
works for you then great.
 
  

picture looks liks this:
0-3 CPU get over 50% SI bandwith up 55% SI... bandwith up...
100% SI on CPU0

I remember patch to fix problem like it... patched function
e1000_clean...  kernel on pc have this patch (2.6.24-rc7-git2)...
e1000 driver work much better (i up to 1.5-2x bandwidth before i get
100% SI), but i think that it not get 100% that it can =)



the patch helps a little because it decreases the amount of time the
driver spends in napi mode, basically shortening the exit condition
(which reenables interrupts, and therefore balancing) to work_done <
budget, not work_done == 0.

  

Thanks for answers and sorry for my English



you basically can't get much more than one cpu can do for each nic.  its
possible to get a little more, but my guess is you won't get much.  The
best thing you can do is make sure as much traffic as possible stays in
the same cache, on two different cores.

you can try turning off NAPI mode either in the .config, or build the
sourceforge driver with CFLAGS_EXTRA=-DE1000_NO_NAPI,  which seems
counterintuitive, but with the non-napi e1000 pushing packets to the
backlog queue on each cpu, you may actually get better performance due
to the balancing.

some day soon (maybe) we'll have some coherent way to have one tx and rx
interrupt per core, and enough queues for each port to be able to handle
1 queue per core.

good luck,
  Jesse  

  


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: Question about polling

2008-02-20 Thread Badalian Vyacheslav

Sorry for little information and mistakes in letter. Jesse Brandeburg 
ask for all my questions. In future i will try to be more accurate then 
write letters and post more info.

Please not think that it disrespect for you. Its simple language barrier =(

On 18-02-2008 10:18, Badalian Vyacheslav wrote:
  

Hello all.



Hi,

  

Interesting think:

Have PC that do NAT. Bandwidth about 600 mbs.

Have  4 CPU (2xCoRe 2 DUO "HT OFF" 3.2 HZ).

irqbalance in kernel is off.

nat2 ~ # cat /proc/irq/217/smp_affinity
0001
nat2 ~ # cat /proc/irq/218/smp_affinity
0003

Load SI on CPU0 and CPU1 is about 90%

Good... try do
echo  > /proc/irq/217/smp_affinity
echo  > /proc/irq/218/smp_affinity

Get 100% SI at CPU0

Question Why?



I think you should show here /proc/interrupts in all these cases.

  
I listen that if use IRQ from 1 netdevice to 1 CPU i can get 30% 
perfomance... but i have 4 CPU... i must get more perfomance if i cat 
""  to smp_affinity.


picture looks liks this:
0-3 CPU get over 50% SI bandwith up 55% SI... bandwith up... 
100% SI on CPU0


I remember patch to fix problem like it... patched function 
e1000_clean...  kernel on pc have this patch (2.6.24-rc7-git2)... e1000 
driver work much better (i up to 1.5-2x bandwidth before i get 100% SI), 
but i think that it not get 100% that it can =)



If some patch works for you, and you can show here its advantages,
you should probably add here some link and request for merging.

BTW, I wonder if you tried to check if changing CONFIG_HZ makes any
difference here?

Regards,
Jarek P.

  


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: Question about polling

2008-02-20 Thread Jarek Poplawski

On Wed, Feb 20, 2008 at 12:25:32PM +0300, Badalian Vyacheslav wrote:
...
> Please not think that it disrespect for you. Its simple language barrier =(

OK! Don't disrespect for me -  I'll try fix my English next time!)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/04] smc91x: make superh use default config

2008-02-20 Thread Magnus Damm

Removes superh board specific configuration from the header file. These boards
will instead be configured using platform data.

Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
---

 drivers/net/smc91x.h |   30 --
 1 file changed, 30 deletions(-)

--- 0018/drivers/net/smc91x.h
+++ work/drivers/net/smc91x.h   2008-02-20 17:09:28.0 +0900
@@ -292,36 +292,6 @@ SMC_outw(u16 val, void __iomem *ioaddr, 
 #define SMC_insw(a, r, p, l)   insw((a) + (r), p, l)
 #define SMC_outsw(a, r, p, l)  outsw((a) + (r), p, l)
 
-#elif   defined(CONFIG_SUPERH)
-
-#ifdef CONFIG_SOLUTION_ENGINE
-#define SMC_IRQ_FLAGS  (0)
-#define SMC_CAN_USE_8BIT   0
-#define SMC_CAN_USE_16BIT  1
-#define SMC_CAN_USE_32BIT  0
-#define SMC_IO_SHIFT   0
-#define SMC_NOWAIT 1
-
-#define SMC_inw(a, r)  inw((a) + (r))
-#define SMC_outw(v, a, r)  outw(v, (a) + (r))
-#define SMC_insw(a, r, p, l)   insw((a) + (r), p, l)
-#define SMC_outsw(a, r, p, l)  outsw((a) + (r), p, l)
-
-#else /* BOARDS */
-
-#define SMC_CAN_USE_8BIT   1
-#define SMC_CAN_USE_16BIT  1
-#define SMC_CAN_USE_32BIT  0
-
-#define SMC_inb(a, r)  inb((a) + (r))
-#define SMC_inw(a, r)  inw((a) + (r))
-#define SMC_outb(v, a, r)  outb(v, (a) + (r))
-#define SMC_outw(v, a, r)  outw(v, (a) + (r))
-#define SMC_insw(a, r, p, l)   insw((a) + (r), p, l)
-#define SMC_outsw(a, r, p, l)  outsw((a) + (r), p, l)
-
-#endif  /* BOARDS */
-
 #elif   defined(CONFIG_M32R)
 
 #define SMC_CAN_USE_8BIT   0
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/04] smc91x: add insw/outsw to default config

2008-02-20 Thread Magnus Damm

This patch makes sure SMC_insw()/SMC_outsw() are defined for the
default configuration. Without this change BUG()s will be triggered
when using 16-bit only platform data and the default configuration.

Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
---

 drivers/net/smc91x.h |2 ++
 1 file changed, 2 insertions(+)

--- 0019/drivers/net/smc91x.h
+++ work/drivers/net/smc91x.h   2008-02-20 17:11:19.0 +0900
@@ -446,6 +446,8 @@ static inline void LPD7_SMC_outsw (unsig
 #define SMC_outb(v, a, r)  writeb(v, (a) + (r))
 #define SMC_outw(v, a, r)  writew(v, (a) + (r))
 #define SMC_outl(v, a, r)  writel(v, (a) + (r))
+#define SMC_insw(a, r, p, l)   readsw((a) + (r), p, l)
+#define SMC_outsw(a, r, p, l)  writesw((a) + (r), p, l)
 #define SMC_insl(a, r, p, l)   readsl((a) + (r), p, l)
 #define SMC_outsl(a, r, p, l)  writesl((a) + (r), p, l)
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/04] smc91x: introduce platform data flags

2008-02-20 Thread Magnus Damm

This patch introduces struct smc91x_platdata and modifies the driver so
bus width is checked during run time using SMC_nBIT() instead of
SMC_CAN_USE_nBIT.

Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
---

 drivers/net/smc91x.c   |   31 -
 drivers/net/smc91x.h   |   50 ++--
 include/linux/smc91x.h |   13 
 3 files changed, 67 insertions(+), 27 deletions(-)

--- 0017/drivers/net/smc91x.c
+++ work/drivers/net/smc91x.c   2008-02-20 17:04:41.0 +0900
@@ -1997,6 +1997,8 @@ err_out:
 
 static int smc_enable_device(struct platform_device *pdev)
 {
+   struct net_device *ndev = platform_get_drvdata(pdev);
+   struct smc_local *lp = netdev_priv(ndev);
unsigned long flags;
unsigned char ecor, ecsr;
void __iomem *addr;
@@ -2039,7 +2041,7 @@ static int smc_enable_device(struct plat
 * Set the appropriate byte/word mode.
 */
ecsr = readb(addr + (ECSR << SMC_IO_SHIFT)) & ~ECSR_IOIS8;
-   if (!SMC_CAN_USE_16BIT)
+   if (!SMC_16BIT(lp))
ecsr |= ECSR_IOIS8;
writeb(ecsr, addr + (ECSR << SMC_IO_SHIFT));
local_irq_restore(flags);
@@ -2124,10 +2126,11 @@ static void smc_release_datacs(struct pl
  */
 static int smc_drv_probe(struct platform_device *pdev)
 {
+   struct smc91x_platdata *pd = pdev->dev.platform_data;
+   struct smc_local *lp;
struct net_device *ndev;
struct resource *res, *ires;
unsigned int __iomem *addr;
-   unsigned long irq_flags = SMC_IRQ_FLAGS;
int ret;
 
res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "smc91x-regs");
@@ -2152,6 +2155,24 @@ static int smc_drv_probe(struct platform
}
SET_NETDEV_DEV(ndev, &pdev->dev);
 
+   /* get configuration from platform data, only allow use of
+* bus width if both SMC_CAN_USE_xxx and SMC91X_USE_xxx are set.
+*/
+
+   lp = netdev_priv(ndev);
+   if (pd)
+   memcpy(&lp->cfg, pd, sizeof(lp->cfg));
+   else {
+   lp->cfg.flags = SMC91X_USE_8BIT;
+   lp->cfg.flags |= SMC91X_USE_16BIT;
+   lp->cfg.flags |= SMC91X_USE_32BIT;
+   lp->cfg.irq_flags = SMC_IRQ_FLAGS;
+   }
+
+   lp->cfg.flags &= ~(SMC_CAN_USE_8BIT ? 0 : SMC91X_USE_8BIT);
+   lp->cfg.flags &= ~(SMC_CAN_USE_16BIT ? 0 : SMC91X_USE_16BIT);
+   lp->cfg.flags &= ~(SMC_CAN_USE_32BIT ? 0 : SMC91X_USE_32BIT);
+
ndev->dma = (unsigned char)-1;
 
ires = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
@@ -2162,7 +2183,7 @@ static int smc_drv_probe(struct platform
 
ndev->irq = ires->start;
if (SMC_IRQ_FLAGS == -1)
-   irq_flags = ires->flags & IRQF_TRIGGER_MASK;
+   lp->cfg.irq_flags = ires->flags & IRQF_TRIGGER_MASK;
 
ret = smc_request_attrib(pdev);
if (ret)
@@ -2170,6 +2191,7 @@ static int smc_drv_probe(struct platform
 #if defined(CONFIG_SA1100_ASSABET)
NCR_0 |= NCR_ENET_OSC_EN;
 #endif
+   platform_set_drvdata(pdev, ndev);
ret = smc_enable_device(pdev);
if (ret)
goto out_release_attrib;
@@ -2188,8 +2210,7 @@ static int smc_drv_probe(struct platform
}
 #endif
 
-   platform_set_drvdata(pdev, ndev);
-   ret = smc_probe(ndev, addr, irq_flags);
+   ret = smc_probe(ndev, addr, lp->cfg.irq_flags);
if (ret != 0)
goto out_iounmap;
 
--- 0017/drivers/net/smc91x.h
+++ work/drivers/net/smc91x.h   2008-02-20 17:08:35.0 +0900
@@ -34,6 +34,7 @@
 #ifndef _SMC91X_H_
 #define _SMC91X_H_
 
+#include 
 
 /*
  * Define your architecture specific bus configuration parameters here.
@@ -526,8 +527,13 @@ struct smc_local {
 #endif
void __iomem *base;
void __iomem *datacs;
+
+   struct smc91x_platdata cfg;
 };
 
+#define SMC_8BIT(p) (((p)->cfg.flags & SMC91X_USE_8BIT) && SMC_CAN_USE_8BIT)
+#define SMC_16BIT(p) (((p)->cfg.flags & SMC91X_USE_16BIT) && SMC_CAN_USE_16BIT)
+#define SMC_32BIT(p) (((p)->cfg.flags & SMC91X_USE_32BIT) && SMC_CAN_USE_32BIT)
 
 #ifdef SMC_USE_PXA_DMA
 /*
@@ -1108,41 +1114,41 @@ static const char * chip_ids[ 16 ] =  {
  *
  * Enforce it on any 32-bit capable setup for now.
  */
-#define SMC_MUST_ALIGN_WRITE   SMC_CAN_USE_32BIT
+#define SMC_MUST_ALIGN_WRITE(priv) SMC_32BIT(priv)
 
 #define SMC_GET_PN(priv)   \
-   (SMC_CAN_USE_8BIT   ? (SMC_inb(ioaddr, PN_REG(priv)))   \
+   (SMC_8BIT(priv) ? (SMC_inb(ioaddr, PN_REG(priv)))   \
: (SMC_inw(ioaddr, PN_REG(priv)) & 0xFF))
 
 #define SMC_SET_PN(priv, x)\
do {\
-   if (SMC_MUST_ALIGN_WRITE)   \
+   if (SMC_MUST_ALIGN_WRITE(priv)) \

[PATCH 00/04] smc91x: request bus width using platform data

2008-02-20 Thread Magnus Damm

These patches make it possible to request bus width in the platform data.

Instead of keep on updating smc91x.h with board specific configuration,
use platform data to pass along bus width and irq flags to the driver.
This change is designed to be backwards-compatible, so all boards configured
in the header file should just work as usual.

[PATCH 01/04] smc91x: pass along private data
[PATCH 02/04] smc91x: introduce platform data flags
[PATCH 03/04] smc91x: make superh use default config
[PATCH 04/04] smc91x: add insw/outsw to default config

Tested with and without platform data on a SuperH sh7722 MigoR board.

Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
---

 drivers/net/smc91x.c   |  340 +---
 drivers/net/smc91x.h   |  336 ++-
 include/linux/smc91x.h |   13 +
 3 files changed, 353 insertions(+), 336 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/04] smc91x: pass along private data

2008-02-20 Thread Magnus Damm

Pass a private data pointer to macros and functions. This makes it easy
to later on make run time decisions. This patch does not change any logic.
These changes should be optimized away during compilation.

Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
---

 drivers/net/smc91x.c |  309 +-
 drivers/net/smc91x.h |  254 -
 2 files changed, 284 insertions(+), 279 deletions(-)

--- 0001/drivers/net/smc91x.c
+++ work/drivers/net/smc91x.c   2008-02-20 16:52:48.0 +0900
@@ -220,23 +220,23 @@ static void PRINT_PKT(u_char *buf, int l
 
 
 /* this enables an interrupt in the interrupt mask register */
-#define SMC_ENABLE_INT(x) do { \
+#define SMC_ENABLE_INT(priv, x) do {   \
unsigned char mask; \
-   spin_lock_irq(&lp->lock);   \
-   mask = SMC_GET_INT_MASK();  \
+   spin_lock_irq(&priv->lock); \
+   mask = SMC_GET_INT_MASK(priv);  \
mask |= (x);\
-   SMC_SET_INT_MASK(mask); \
-   spin_unlock_irq(&lp->lock); \
+   SMC_SET_INT_MASK(priv, mask);   \
+   spin_unlock_irq(&priv->lock);   \
 } while (0)
 
 /* this disables an interrupt from the interrupt mask register */
-#define SMC_DISABLE_INT(x) do {
\
+#define SMC_DISABLE_INT(priv, x) do {  \
unsigned char mask; \
-   spin_lock_irq(&lp->lock);   \
-   mask = SMC_GET_INT_MASK();  \
+   spin_lock_irq(&priv->lock); \
+   mask = SMC_GET_INT_MASK(priv);  \
mask &= ~(x);   \
-   SMC_SET_INT_MASK(mask); \
-   spin_unlock_irq(&lp->lock); \
+   SMC_SET_INT_MASK(priv, mask);   \
+   spin_unlock_irq(&priv->lock);   \
 } while (0)
 
 /*
@@ -244,10 +244,10 @@ static void PRINT_PKT(u_char *buf, int l
  * if at all, but let's avoid deadlocking the system if the hardware
  * decides to go south.
  */
-#define SMC_WAIT_MMU_BUSY() do {   \
-   if (unlikely(SMC_GET_MMU_CMD() & MC_BUSY)) {\
+#define SMC_WAIT_MMU_BUSY(priv) do {   \
+   if (unlikely(SMC_GET_MMU_CMD(priv) & MC_BUSY)) {\
unsigned long timeout = jiffies + 2;\
-   while (SMC_GET_MMU_CMD() & MC_BUSY) {   \
+   while (SMC_GET_MMU_CMD(priv) & MC_BUSY) {   \
if (time_after(jiffies, timeout)) { \
printk("%s: timeout %s line %d\n",  \
dev->name, __FILE__, __LINE__); \
@@ -273,8 +273,8 @@ static void smc_reset(struct net_device 
 
/* Disable all interrupts, block TX tasklet */
spin_lock_irq(&lp->lock);
-   SMC_SELECT_BANK(2);
-   SMC_SET_INT_MASK(0);
+   SMC_SELECT_BANK(lp, 2);
+   SMC_SET_INT_MASK(lp, 0);
pending_skb = lp->pending_tx_skb;
lp->pending_tx_skb = NULL;
spin_unlock_irq(&lp->lock);
@@ -290,15 +290,15 @@ static void smc_reset(struct net_device 
 * This resets the registers mostly to defaults, but doesn't
 * affect EEPROM.  That seems unnecessary
 */
-   SMC_SELECT_BANK(0);
-   SMC_SET_RCR(RCR_SOFTRST);
+   SMC_SELECT_BANK(lp, 0);
+   SMC_SET_RCR(lp, RCR_SOFTRST);
 
/*
 * Setup the Configuration Register
 * This is necessary because the CONFIG_REG is not affected
 * by a soft reset
 */
-   SMC_SELECT_BANK(1);
+   SMC_SELECT_BANK(lp, 1);
 
cfg = CONFIG_DEFAULT;
 
@@ -316,7 +316,7 @@ static void smc_reset(struct net_device 
 */
cfg |= CONFIG_EPH_POWER_EN;
 
-   SMC_SET_CONFIG(cfg);
+   SMC_SET_CONFIG(lp, cfg);
 
/* this should pause enough for the chip to be happy */
/*
@@ -329,12 +329,12 @@ static void smc_reset(struct net_device 
udelay(1);
 
/* Disable transmit and receive functionality */
-   SMC_SELECT_BANK(0);
-   SMC_SET_RCR(RCR_CLEAR);
-   SMC_SET_TCR(TCR_CLEAR);
+   SMC_SELECT_BANK(lp, 0);
+   SMC_SET_RCR(lp, RCR_CL

Re: [NETFILTER]: Introduce nf_inet_address

2008-02-20 Thread Jan Engelhardt

On Feb 19 2008 15:45, Patrick McHardy wrote:
>> 
>> It's in busybox 1.9.1. Just including  seems to be
>> sufficient to make it happy again. I wonder if netfilter.h should
>> include that for itself?
>
> That would break iptables compilation, which already includes
> linux/in.h in some files. I guess the best fix for now is to
> include netinet/in.h in busybox and long-term clean this up
> properly.
>

If  includes , userspace compilation
fails (clashing defs, etc.); if  includes
, kernel compilation fails (file not found).

What comes to mind is the ugly hack we had for a few days:

#ifdef __KERNEL__
#   include 
#else
#   include 
#endif

and I think we should not impose any inclusion rules on userspace
this way.

Right now, it is solved this way:

kernel .c files explicitly include  when loading
,

userspace .c files explicitly includ  when loading
. And that seems to work out.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: Question about polling

2008-02-20 Thread Badalian Vyacheslav

Khrm i try to say that i have language barrier and some time may 
wrong compose clauses. Example below =)

"I'll try fix my English next time!"

Vyacheslav

On Wed, Feb 20, 2008 at 12:25:32PM +0300, Badalian Vyacheslav wrote:
...
  

Please not think that it disrespect for you. Its simple language barrier =(



OK! Don't disrespect for me -  I'll try fix my English next time!)

Jarek P.

  


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [VLAN] vlan_skb_recv

2008-02-20 Thread Patrick McHardy


Ben Greear wrote:

Stephen Anderson wrote:

Hello,

To help increase throughput and bypass the backlog queue, I changed the
netif_rx() to netif_receive_skb() in vlan_skb_recv().  What's the
argument for using netif_rx() other than legacy maintenance?  At this
point, interrupt context should not be an issue.  Layer 2 performance
has been a big focus in my area of development.



I guess the only point is to reduce stack usage. Its probably not
a problem with only VLAN, but it might be with further tunnels,
IPsec, ...


I'm sure you have seen many attempts to implement a single VLAN aware
IVL FDB in the past and I was wondering which attempt do you feel was
the best?  Have you ever considered integrating your VLAN support
natively into the bridging code base or know of any attempts to do just
that?



Without having thought about this much, it seems to me that
it needs to be integrated in the bridge fdb to work properly.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: Question about polling

2008-02-20 Thread Jarek Poplawski

On Wed, Feb 20, 2008 at 02:54:27PM +0300, Badalian Vyacheslav wrote:
> Khrm i try to say that i have language barrier and some time may  
> wrong compose clauses. Example below =)

No, only a bit joking...

> "I'll try fix my English next time!"

Don't worry Vyacheslav: I think your message was understandable enough
if you got good answer from Jesse. (And I've learned something BTW too;
Thanks Jesse!) And after all it's not a language group: we care here
for serious problems!

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/sk98lin/skge.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sk98lin/skge.c b/drivers/net/sk98lin/skge.c
index 20890e4..eedcbeb 100644
--- a/drivers/net/sk98lin/skge.c
+++ b/drivers/net/sk98lin/skge.c
@@ -5166,7 +5166,7 @@ err_out:
 #define skge_resume NULL
 #endif
 
-static struct pci_device_id skge_pci_tbl[] = {
+static PCI_DEVICE_TABLE(skge_pci_tbl) = {
 #ifdef SK98LIN_ALL_DEVICES
{ PCI_VENDOR_ID_3COM, 0x1700, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
{ PCI_VENDOR_ID_3COM, 0x80eb, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci_device_id cleanups

2008-02-20 Thread Sam Ravnborg

On Wed, Feb 20, 2008 at 01:53:36PM +0100, Jonas Bonn wrote:
> 
> The PCI_DEVICE_TABLE patch I sent earlier doesn't necessarily make
> much sense by itself... here is a set of patches that apply
> this macro, in turn moving a lot of this data into __devinitconst
> which is discardable in certain situations.
> Hopefully the benefit of this approach is a bit clearer now.
[shorter lines please..]

Can you please confirm that this does not break powerpc (64 bit)
as they have troubles with the constification..

Sam
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/starfire.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/starfire.c b/drivers/net/starfire.c
index c49214f..a67bac5 100644
--- a/drivers/net/starfire.c
+++ b/drivers/net/starfire.c
@@ -337,7 +337,7 @@ enum chipset {
CH_6915 = 0,
 };
 
-static struct pci_device_id starfire_pci_tbl[] = {
+static PCI_DEVICE_TABLE(starfire_pci_tbl) = {
{ 0x9004, 0x6915, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CH_6915 },
{ 0, }
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/skfp/skfddi.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/skfp/skfddi.c b/drivers/net/skfp/skfddi.c
index 7cf9b9f..2a8386b 100644
--- a/drivers/net/skfp/skfddi.c
+++ b/drivers/net/skfp/skfddi.c
@@ -150,7 +150,7 @@ extern void mac_drv_rx_mode(struct s_smc *smc, int mode);
 extern void mac_drv_clear_rx_queue(struct s_smc *smc);
 extern void enable_tx_irq(struct s_smc *smc, u_short queue);
 
-static struct pci_device_id skfddi_pci_tbl[] = {
+static PCI_DEVICE_TABLE(skfddi_pci_tbl) = {
{ PCI_VENDOR_ID_SK, PCI_DEVICE_ID_SK_FP, PCI_ANY_ID, PCI_ANY_ID, },
{ } /* Terminating entry */
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/wan/dscc4.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c
index c6f26e2..16d3a4c 100644
--- a/drivers/net/wan/dscc4.c
+++ b/drivers/net/wan/dscc4.c
@@ -2048,7 +2048,7 @@ static int __init dscc4_setup(char *str)
 __setup("dscc4.setup=", dscc4_setup);
 #endif
 
-static struct pci_device_id dscc4_pci_tbl[] = {
+static PCI_DEVICE_TABLE(dscc4_pci_tbl) = {
{ PCI_VENDOR_ID_SIEMENS, PCI_DEVICE_ID_SIEMENS_DSCC4,
PCI_ANY_ID, PCI_ANY_ID, },
{ 0,}
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/niu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/niu.c b/drivers/net/niu.c
index e98ce1e..ab8148a 100644
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -62,7 +62,7 @@ static void writeq(u64 val, void __iomem *reg)
 }
 #endif
 
-static struct pci_device_id niu_pci_tbl[] = {
+static PCI_DEVICE_TABLE(niu_pci_tbl) = {
{PCI_DEVICE(PCI_VENDOR_ID_SUN, 0xabcd)},
{}
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/wan/lmc/lmc_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c
index 6635ece..e85cfe7 100644
--- a/drivers/net/wan/lmc/lmc_main.c
+++ b/drivers/net/wan/lmc/lmc_main.c
@@ -82,7 +82,7 @@ static int lmc_first_load = 0;
 
 static int LMC_PKT_BUF_SZ = 1542;
 
-static struct pci_device_id lmc_pci_tbl[] = {
+static PCI_DEVICE_TABLE(lmc_pci_tbl) = {
{ PCI_VENDOR_ID_DEC, PCI_DEVICE_ID_DEC_TULIP_FAST,
  PCI_VENDOR_ID_LMC, PCI_ANY_ID },
{ PCI_VENDOR_ID_DEC, PCI_DEVICE_ID_DEC_TULIP_FAST,
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/hamachi.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/hamachi.c b/drivers/net/hamachi.c
index b53f6b6..d8056e9 100644
--- a/drivers/net/hamachi.c
+++ b/drivers/net/hamachi.c
@@ -1987,7 +1987,7 @@ static void __devexit hamachi_remove_one (struct pci_dev 
*pdev)
}
 }
 
-static struct pci_device_id hamachi_pci_tbl[] = {
+static PCI_DEVICE_TABLE(hamachi_pci_tbl) = {
{ 0x1318, 0x0911, PCI_ANY_ID, PCI_ANY_ID, },
{ 0, }
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Dealing with limited resources and DMA Engine copies

2008-02-20 Thread Sosnowski, Maciej

>-- Original message --
>From: Olof Johansson <[EMAIL PROTECTED]>
>Date: Feb 14, 2008 3:38 AM
>Subject: Dealing with limited resources and DMA Engine copies
>To: [EMAIL PROTECTED], [EMAIL PROTECTED]
>Cc: netdev@vger.kernel.org
>
>Hi,
>
>My DMA Engine has a limited resource: It's got a descriptor ring, so
>it's not always possible to add a new descriptor to it (i.e. it might
be
>full). While allocating a huge ring will help, eventually I'm sure I
>will hit a case where it'll overflow.
>
>I thought this was going to be taken care of automatically by the fact
>that you return your max(?) number of descriptors in the channel
>allocation function, but it looks like that value is discarded in
>dma_client_chan_alloc().
>
>So, I just got a couple of spurious:
>dma_cookie < 0
>dma_cookie < 0
>
>...on the console and the connection terminated. Looks like that came
>from tcp_recvmsg(). Ouch.
>
>How about falling back to the cpu-based copy in case of failure? Or
would
>you prefer that I sleep locally in my driver and wait on a descriptor
>slot to open up?
>

I have taken a closer look at this in the code. It seems to be a good
idea to withdraw for a while from ioat-dma copy in case of "dma_cookie <
0" error to let the ring free some descriptors.
I will be able to work on it next week probably. As soon as I have some
stable results, I will get back to you with it.

Maciej


-
Intel Technology Poland sp. z o.o.
z siedziba w Gdansku
ul. Slowackiego 173
80-298 Gdansk

Sad Rejonowy Gdansk Polnoc w Gdansku, 
VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
numer KRS 101882

NIP 957-07-52-316
Kapital zakladowy 200.000 zl

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/amd8111e.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/amd8111e.c b/drivers/net/amd8111e.c
index 85f7276..a4ad2fb 100644
--- a/drivers/net/amd8111e.c
+++ b/drivers/net/amd8111e.c
@@ -113,7 +113,7 @@ MODULE_PARM_DESC(coalesce, "Enable or Disable interrupt 
coalescing, 1: Enable, 0
 module_param_array(dynamic_ipg, bool, NULL, 0);
 MODULE_PARM_DESC(dynamic_ipg, "Enable or Disable dynamic IPG, 1: Enable, 0: 
Disable");
 
-static struct pci_device_id amd8111e_pci_tbl[] = {
+static PCI_DEVICE_TABLE(amd8111e_pci_tbl) = {
 
{ PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD8111E_7462,
 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL },
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/arcnet/com20020-pci.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/arcnet/com20020-pci.c 
b/drivers/net/arcnet/com20020-pci.c
index b8c0fa6..87ee0db 100644
--- a/drivers/net/arcnet/com20020-pci.c
+++ b/drivers/net/arcnet/com20020-pci.c
@@ -141,7 +141,7 @@ static void __devexit com20020pci_remove(struct pci_dev 
*pdev)
free_netdev(dev);
 }
 
-static struct pci_device_id com20020pci_id_table[] = {
+static PCI_DEVICE_TABLE(com20020pci_id_table) = {
{ 0x1571, 0xa001, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
{ 0x1571, 0xa002, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
{ 0x1571, 0xa003, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pci_device_id cleanups

2008-02-20 Thread Jonas Bonn


The PCI_DEVICE_TABLE patch I sent earlier doesn't necessarily make much sense 
by itself... here is a set of patches that apply this macro, in turn moving a 
lot of this data into __devinitconst which is discardable in certain 
situations.  Hopefully the benefit of this approach is a bit clearer now.

 drivers/net/3c59x.c   |2 +-
 drivers/net/amd8111e.c|2 +-
 drivers/net/arcnet/com20020-pci.c |2 +-
 drivers/net/defxx.c   |2 +-
 drivers/net/hamachi.c |2 +-
 drivers/net/niu.c |2 +-
 drivers/net/pasemi_mac.c  |2 +-
 drivers/net/sk98lin/skge.c|2 +-
 drivers/net/skfp/skfddi.c |2 +-
 drivers/net/starfire.c|2 +-
 drivers/net/sunhme.c  |2 +-
 drivers/net/tlan.c|2 +-
 drivers/net/wan/dscc4.c   |2 +-
 drivers/net/wan/lmc/lmc_main.c|2 +-
 include/linux/pci.h   |9 +
 15 files changed, 23 insertions(+), 14 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/tlan.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tlan.c b/drivers/net/tlan.c
index 3af5b92..bea59c6 100644
--- a/drivers/net/tlan.c
+++ b/drivers/net/tlan.c
@@ -253,7 +253,7 @@ static struct board {
{ "Compaq NetFlex-3/E", TLAN_ADAPTER_ACTIVITY_LED, 0x83 }, /* EISA card 
*/
 };
 
-static struct pci_device_id tlan_pci_tbl[] = {
+static PCI_DEVICE_TABLE(tlan_pci_tbl) = {
{ PCI_VENDOR_ID_COMPAQ, PCI_DEVICE_ID_COMPAQ_NETEL10,
PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
{ PCI_VENDOR_ID_COMPAQ, PCI_DEVICE_ID_COMPAQ_NETEL100,
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/defxx.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/defxx.c b/drivers/net/defxx.c
index ddc30c4..84a3ce5 100644
--- a/drivers/net/defxx.c
+++ b/drivers/net/defxx.c
@@ -3630,7 +3630,7 @@ static int __devinit dfx_pci_register(struct pci_dev *,
  const struct pci_device_id *);
 static void __devexit dfx_pci_unregister(struct pci_dev *);
 
-static struct pci_device_id dfx_pci_table[] = {
+static PCI_DEVICE_TABLE(dfx_pci_table) = {
{ PCI_DEVICE(PCI_VENDOR_ID_DEC, PCI_DEVICE_ID_DEC_FDDI) },
{ }
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/sunhme.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sunhme.c b/drivers/net/sunhme.c
index b4e7f30..beb0d27 100644
--- a/drivers/net/sunhme.c
+++ b/drivers/net/sunhme.c
@@ -3247,7 +3247,7 @@ static void __devexit happy_meal_pci_remove(struct 
pci_dev *pdev)
dev_set_drvdata(&pdev->dev, NULL);
 }
 
-static struct pci_device_id happymeal_pci_ids[] = {
+static PCI_DEVICE_TABLE(happymeal_pci_ids) = {
{ PCI_DEVICE(PCI_VENDOR_ID_SUN, PCI_DEVICE_ID_SUN_HAPPYMEAL) },
{ } /* Terminating entry */
 };
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/pasemi_mac.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/pasemi_mac.c b/drivers/net/pasemi_mac.c
index 2e39e02..069fa7c 100644
--- a/drivers/net/pasemi_mac.c
+++ b/drivers/net/pasemi_mac.c
@@ -1648,7 +1648,7 @@ static void __devexit pasemi_mac_remove(struct pci_dev 
*pdev)
free_netdev(netdev);
 }
 
-static struct pci_device_id pasemi_mac_pci_tbl[] = {
+static PCI_DEVICE_TABLE(pasemi_mac_pci_tbl) = {
{ PCI_DEVICE(PCI_VENDOR_ID_PASEMI, 0xa005) },
{ PCI_DEVICE(PCI_VENDOR_ID_PASEMI, 0xa006) },
{ },
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Add PCI_DEVICE_TABLE macro

2008-02-20 Thread Jonas Bonn

The definitions of struct pci_device_id arrays should generally follow
the same pattern across the entire kernel.  This macro defines this
array as const and puts it into the __devinitconst section.

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 include/linux/pci.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 87195b6..c7a91b1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -389,6 +389,15 @@ struct pci_driver {
 #defineto_pci_driver(drv) container_of(drv, struct pci_driver, driver)
 
 /**
+ * PCI_DEVICE_TABLE - macro used to describe a pci device table
+ * @_table: device table name
+ *
+ * This macro is used to create a struct pci_device_id array (a device table) 
+ * in a generic manner.
+ */
+#define PCI_DEVICE_TABLE(_table) const struct pci_device_id _table[] 
__devinitconst
+
+/**
  * PCI_DEVICE - macro used to describe a specific pci device
  * @vend: the 16 bit PCI Vendor ID
  * @dev: the 16 bit PCI Device ID
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [net] use PCI_DEVICE_TABLE: makes struct pci_device_id array const and adds section attribute __devinitconst

2008-02-20 Thread Jonas Bonn

Signed-off-by: Jonas Bonn <[EMAIL PROTECTED]>
---
 drivers/net/3c59x.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index 6f8e7d4..d2045d4 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -372,7 +372,7 @@ static struct vortex_chip_info {
 };
 
 
-static struct pci_device_id vortex_pci_tbl[] = {
+static PCI_DEVICE_TABLE(vortex_pci_tbl) = {
{ 0x10B7, 0x5900, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CH_3C590 },
{ 0x10B7, 0x5920, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CH_3C592 },
{ 0x10B7, 0x5970, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CH_3C597 },
-- 
1.5.3.8


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci_device_id cleanups

2008-02-20 Thread Jonas Bonn


Sam Ravnborg wrote:

On Wed, Feb 20, 2008 at 01:53:36PM +0100, Jonas Bonn wrote:

The PCI_DEVICE_TABLE patch I sent earlier doesn't necessarily make
much sense by itself... here is a set of patches that apply
this macro, in turn moving a lot of this data into __devinitconst
which is discardable in certain situations.
Hopefully the benefit of this approach is a bit clearer now.

[shorter lines please..]


Sorry...



Can you please confirm that this does not break powerpc (64 bit)
as they have troubles with the constification..


I do not have access to any PowerPC machine... Olof Johansson built the 
tree I posted earlier on PowerPC; there's nothing really new here except 
the wrapping of the definition in a macro.


But of course, it would great if someone could confirm this...



Sam



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8]: uninline & uninline

2008-02-20 Thread Ilpo Järvinen

Hi all,

I run some lengthy tests to measure cost of inlines in headers under
include/, simple coverage calculations yields to 89% but most of the
failed compiles are due to preprocessor cutting the tested block away
anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86,
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was
tested (function uninlined) at a time, the actual benefits of removing
multiple inlines may well be below what the sum of those individually
is (especially when something calls __-func with equal name).

Ok, here's the top of the list (1+ bytes):

-110805  869 f, 198 +, 111003 -, diff: -110805  skb_put 
-41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
-36290  42 f, 197 +, 36487 -, diff: -36290  cfi_build_cmd 
-35698  1234 f, 2391 +, 38089 -, diff: -35698  atomic_dec_and_test 
-28162  354 f, 3005 +, 31167 -, diff: -28162  skb_pull 
-23668  392 f, 104 +, 23772 -, diff: -23668  dev_alloc_skb 
-22212  415 f, 130 +, 22342 -, diff: -22212  __dev_alloc_skb 
-21593  356 f, 2418 +, 24011 -, diff: -21593  skb_push 
-19036  478 f, 259 +, 19295 -, diff: -19036  netif_wake_queue 
-18409  396 f, 6447 +, 24856 -, diff: -18409  __skb_pull 
-16420  187 f, 103 +, 16523 -, diff: -16420  dst_release 
-16025  13 f, 280 +, 16305 -, diff: -16025  cfi_send_gen_cmd 
-14925  486 f, 978 +, 15903 -, diff: -14925  add_timer 
-14896  199 f, 558 +, 15454 -, diff: -14896  sg_page 
-12870  36 f, 121 +, 12991 -, diff: -12870  le_key_k_type 
-12310  692 f, 7215 +, 19525 -, diff: -12310  signal_pending 
-11640  251 f, 118 +, 11758 -, diff: -11640  __skb_trim 
-11059  111 f, 293 +, 11352 -, diff: -11059  __nlmsg_put 
-10976  209 f, 123 +, 11099 -, diff: -10976  skb_trim 
-10344  125 f, 462 +, 10806 -, diff: -10344  pskb_may_pull 
-10061  300 f, 1163 +, 11224 -, diff: -10061  try_module_get 
-10008  75 f, 341 +, 10349 -, diff: -10008  nlmsg_put 

~250 are in 1000+ bytes category and ~440 in 500+. Full list
has some entries without number because given config doesn't
build them, and therefore nothing got uninlined, and the missing
entries consists solely of compile failures, available here:

  http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted

I made some patches to uninline couple of them (picked mostly
net related) to stir up some discussion, however, some of them
are not ready for inclusion as is (see patch descriptions).
The cases don't represent all top 8 cases because some of the
cases require a bit more analysis (e.g., config dependant,
comments about gcc optimizations).

The tools I used are available here except the site-specific
distribute machinery (in addition one needs pretty late
codiff from Arnaldo's toolset because there were some inline
related bugs fixed lately):

  http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/

I'm planning to run similar tests also on inlines in headers that
are not under include/ but it requires minor modifications to
those tools.

--
 i.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8]: uninline & uninline

2008-02-20 Thread Ilpo Järvinen

Hi all,

I run some lengthy tests to measure cost of inlines in headers under
include/, simple coverage calculations yields to 89% but most of the
failed compiles are due to preprocessor cutting the tested block away
anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86,
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was
tested (function uninlined) at a time, the actual benefits of removing
multiple inlines may well be below what the sum of those individually
is (especially when something calls __-func with equal name).

Ok, here's the top of the list (1+ bytes):

-110805  869 f, 198 +, 111003 -, diff: -110805  skb_put 
-41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
-36290  42 f, 197 +, 36487 -, diff: -36290  cfi_build_cmd 
-35698  1234 f, 2391 +, 38089 -, diff: -35698  atomic_dec_and_test 
-28162  354 f, 3005 +, 31167 -, diff: -28162  skb_pull 
-23668  392 f, 104 +, 23772 -, diff: -23668  dev_alloc_skb 
-22212  415 f, 130 +, 22342 -, diff: -22212  __dev_alloc_skb 
-21593  356 f, 2418 +, 24011 -, diff: -21593  skb_push 
-19036  478 f, 259 +, 19295 -, diff: -19036  netif_wake_queue 
-18409  396 f, 6447 +, 24856 -, diff: -18409  __skb_pull 
-16420  187 f, 103 +, 16523 -, diff: -16420  dst_release 
-16025  13 f, 280 +, 16305 -, diff: -16025  cfi_send_gen_cmd 
-14925  486 f, 978 +, 15903 -, diff: -14925  add_timer 
-14896  199 f, 558 +, 15454 -, diff: -14896  sg_page 
-12870  36 f, 121 +, 12991 -, diff: -12870  le_key_k_type 
-12310  692 f, 7215 +, 19525 -, diff: -12310  signal_pending 
-11640  251 f, 118 +, 11758 -, diff: -11640  __skb_trim 
-11059  111 f, 293 +, 11352 -, diff: -11059  __nlmsg_put 
-10976  209 f, 123 +, 11099 -, diff: -10976  skb_trim 
-10344  125 f, 462 +, 10806 -, diff: -10344  pskb_may_pull 
-10061  300 f, 1163 +, 11224 -, diff: -10061  try_module_get 
-10008  75 f, 341 +, 10349 -, diff: -10008  nlmsg_put 

~250 are in 1000+ bytes category and ~440 in 500+. Full list
has some entries without number because given config doesn't
build them, and therefore nothing got uninlined, and the missing
entries consists solely of compile failures, available here:

  http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted

I made some patches to uninline couple of them (picked mostly
net related) to stir up some discussion, however, some of them
are not ready for inclusion as is (see patch descriptions).
The cases don't represent all top 8 cases because some of the
cases require a bit more analysis (e.g., config dependant,
comments about gcc optimizations).

The tools I used are available here except the site-specific
distribute machinery (in addition one needs pretty late
codiff from Arnaldo's toolset because there were some inline
related bugs fixed lately):

  http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/

I'm planning to run similar tests also on inlines in headers that
are not under include/ but it requires minor modifications to
those tools.

--
 i.

ps. I'm sorry about the duplicates, old git-send-email's
8-bit-header problem bit me again. :-(



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

~500 files changed
...
kernel/uninlined.c:
  skb_put   | +104
 1 function changed, 104 bytes added, diff: +104

vmlinux.o:
 869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805

This change is INCOMPLETE, I think that the call to current_text_addr()
should be rethought but I don't have a clue how to do that.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   20 +---
 net/core/skbuff.c  |   21 +
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 412672a..5925435 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -896,25 +896,7 @@ static inline unsigned char *__skb_put(struct sk_buff 
*skb, unsigned int len)
return tmp;
 }
 
-/**
- * skb_put - add data to a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer. If this would
- * exceed the total buffer size the kernel will panic. A pointer to the
- * first byte of the extra data is returned.
- */
-static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-   unsigned char *tmp = skb_tail_pointer(skb);
-   SKB_LINEAR_ASSERT(skb);
-   skb->tail += len;
-   skb->len  += len;
-   if (unlikely(skb->tail > skb->end))
-   skb_over_panic(skb, len, current_text_addr());
-   return tmp;
-}
+extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e35422..661d439 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -857,6 +857,27 @@ free_skb:
return err;
 }
 
+/**
+ * skb_put - add data to a buffer
+ * @skb: buffer to use
+ * @len: amount of data to add
+ *
+ * This function extends the used data area of the buffer. If this would
+ * exceed the total buffer size the kernel will panic. A pointer to the
+ * first byte of the extra data is returned.
+ */
+unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+   unsigned char *tmp = skb_tail_pointer(skb);
+   SKB_LINEAR_ASSERT(skb);
+   skb->tail += len;
+   skb->len  += len;
+   if (unlikely(skb->tail > skb->end))
+   skb_over_panic(skb, len, current_text_addr());
+   return tmp;
+}
+EXPORT_SYMBOL(skb_put);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/8] [NET]: uninline skb_push, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-21593  356 funcs, 2418 +, 24011 -, diff: -21593 --- skb_push

Again, current_text_addr() needs to addressed.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   18 +-
 net/core/skbuff.c  |   19 +++
 2 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df3cce2..c11f248 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -905,23 +905,7 @@ static inline unsigned char *__skb_push(struct sk_buff 
*skb, unsigned int len)
return skb->data;
 }
 
-/**
- * skb_push - add data to the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer at the buffer
- * start. If this would exceed the total buffer headroom the kernel will
- * panic. A pointer to the first byte of the extra data is returned.
- */
-static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-   skb->data -= len;
-   skb->len  += len;
-   if (unlikely(skb->datahead))
-   skb_under_panic(skb, len, current_text_addr());
-   return skb->data;
-}
+extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
 
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 081bffb..05d43fd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -897,6 +897,25 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int 
len)
 EXPORT_SYMBOL(skb_put);
 
 /**
+ * skb_push - add data to the start of a buffer
+ * @skb: buffer to use
+ * @len: amount of data to add
+ *
+ * This function extends the used data area of the buffer at the buffer
+ * start. If this would exceed the total buffer headroom the kernel will
+ * panic. A pointer to the first byte of the extra data is returned.
+ */
+unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+   skb->data -= len;
+   skb->len  += len;
+   if (unlikely(skb->datahead))
+   skb_under_panic(skb, len, current_text_addr());
+   return skb->data;
+}
+EXPORT_SYMBOL(skb_push);
+
+/**
  * skb_pull - remove data from the start of a buffer
  * @skb: buffer to use
  * @len: amount of data to remove
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   17 +
 net/core/skbuff.c  |   18 ++
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a9f8f15..df3cce2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1269,22 +1269,7 @@ static inline struct sk_buff *__dev_alloc_skb(unsigned 
int length,
return skb;
 }
 
-/**
- * dev_alloc_skb - allocate an skbuff for receiving
- * @length: length to allocate
- *
- * Allocate a new &sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory. Although this function
- * allocates memory it can be called from an interrupt.
- */
-static inline struct sk_buff *dev_alloc_skb(unsigned int length)
-{
-   return __dev_alloc_skb(length, GFP_ATOMIC);
-}
+extern struct sk_buff *dev_alloc_skb(unsigned int length);
 
 extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 14f462b..081bffb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
return skb;
 }
 
+/**
+ * dev_alloc_skb - allocate an skbuff for receiving
+ * @length: length to allocate
+ *
+ * Allocate a new &sk_buff and assign it a usage count of one. The
+ * buffer has unspecified headroom built in. Users should allocate
+ * the headroom they think they need without accounting for the
+ * built in space. The built in space is used for optimisations.
+ *
+ * %NULL is returned if there is no free memory. Although this function
+ * allocates memory it can be called from an interrupt.
+ */
+struct sk_buff *dev_alloc_skb(unsigned int length)
+{
+   return __dev_alloc_skb(length, GFP_ATOMIC);
+}
+EXPORT_SYMBOL(dev_alloc_skb);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/8] [NET]: uninline skb_pull, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-28162  354 funcs, 3005 +, 31167 -, diff: -28162 --- skb_pull

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   15 +--
 net/core/skbuff.c  |   16 
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5925435..a9f8f15 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -930,20 +930,7 @@ static inline unsigned char *__skb_pull(struct sk_buff 
*skb, unsigned int len)
return skb->data += len;
 }
 
-/**
- * skb_pull - remove data from the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to remove
- *
- * This function removes data from the start of a buffer, returning
- * the memory to the headroom. A pointer to the next data in the buffer
- * is returned. Once the data has been pulled future pushes will overwrite
- * the old data.
- */
-static inline unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
-{
-   return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
-}
+extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 
 extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 661d439..14f462b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -878,6 +878,22 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL(skb_put);
 
+/**
+ * skb_pull - remove data from the start of a buffer
+ * @skb: buffer to use
+ * @len: amount of data to remove
+ *
+ * This function removes data from the start of a buffer, returning
+ * the memory to the headroom. A pointer to the next data in the buffer
+ * is returned. Once the data has been pulled future pushes will overwrite
+ * the old data.
+ */
+unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
+{
+   return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
+}
+EXPORT_SYMBOL(skb_pull);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 6/8] [NET]: uninline skb_trim, de-bloats

2008-02-20 Thread Ilpo Järvinen

-10976  209 funcs, 123 +, 11099 -, diff: -10976 --- skb_trim

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/skbuff.h |   16 +---
 net/core/skbuff.c  |   16 
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c11f248..75d8a66 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1156,21 +1156,7 @@ static inline void __skb_trim(struct sk_buff *skb, 
unsigned int len)
skb_set_tail_pointer(skb, len);
 }
 
-/**
- * skb_trim - remove end from a buffer
- * @skb: buffer to alter
- * @len: new length
- *
- * Cut the length of a buffer down by removing data from the tail. If
- * the buffer is already under the length specified it is not modified.
- * The skb must be linear.
- */
-static inline void skb_trim(struct sk_buff *skb, unsigned int len)
-{
-   if (skb->len > len)
-   __skb_trim(skb, len);
-}
-
+extern void skb_trim(struct sk_buff *skb, unsigned int len);
 
 static inline int __pskb_trim(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 05d43fd..b57cadb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -931,6 +931,22 @@ unsigned char *skb_pull(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL(skb_pull);
 
+/**
+ * skb_trim - remove end from a buffer
+ * @skb: buffer to alter
+ * @len: new length
+ *
+ * Cut the length of a buffer down by removing data from the tail. If
+ * the buffer is already under the length specified it is not modified.
+ * The skb must be linear.
+ */
+void skb_trim(struct sk_buff *skb, unsigned int len)
+{
+   if (skb->len > len)
+   __skb_trim(skb, len);
+}
+EXPORT_SYMBOL(skb_trim);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 7/8] [SCTP]: uninline sctp_add_cmd_sf

2008-02-20 Thread Ilpo Järvinen

I added inline to sctp_add_cmd and appropriate comment there to
avoid adding another call into the call chain. This works at least
with "gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)". Alternatively,
__sctp_add_cmd could be introduced to .h.

net/sctp/sm_statefuns.c:
  sctp_sf_cookie_wait_prm_abort  | -125
  sctp_sf_cookie_wait_prm_shutdown   |  -75
  sctp_sf_do_9_1_prm_abort   |  -75
  sctp_sf_shutdown_sent_prm_abort|  -50
  sctp_sf_pdiscard   |  -25
  sctp_stop_t1_and_abort | -100
  sctp_sf_do_9_2_start_shutdown  | -154
  __sctp_sf_do_9_1_abort |  -50
  sctp_send_stale_cookie_err |  -29
  sctp_sf_abort_violation| -181
  sctp_sf_do_9_2_shutdown_ack| -154
  sctp_sf_do_9_2_reshutack   |  -86
  sctp_sf_tabort_8_4_8   |  -28
  sctp_sf_heartbeat  |  -52
  sctp_sf_shut_8_4_5 |  -27
  sctp_eat_data  | -246
  sctp_sf_shutdown_sent_abort|  -58
  sctp_sf_check_restart_addrs|  -50
  sctp_sf_do_unexpected_init | -110
  sctp_sf_sendbeat_8_3   | -107
  sctp_sf_unk_chunk  |  -65
  sctp_sf_do_prm_asoc| -129
  sctp_sf_do_prm_send|  -25
  sctp_sf_do_9_2_prm_shutdown|  -50
  sctp_sf_error_closed   |  -25
  sctp_sf_error_shutdown |  -25
  sctp_sf_shutdown_pending_prm_abort |  -25
  sctp_sf_do_prm_requestheartbeat|  -28
  sctp_sf_do_prm_asconf  |  -75
  sctp_sf_do_6_3_3_rtx   | -104
  sctp_sf_do_6_2_sack|  -25
  sctp_sf_t1_init_timer_expire   | -133
  sctp_sf_t1_cookie_timer_expire | -104
  sctp_sf_t2_timer_expire| -161
  sctp_sf_t4_timer_expire| -175
  sctp_sf_t5_timer_expire|  -75
  sctp_sf_autoclose_timer_expire |  -50
  sctp_sf_do_5_2_4_dupcook   | -579
  sctp_sf_do_4_C | -125
  sctp_sf_shutdown_pending_abort |  -32
  sctp_sf_do_5_1E_ca | -186
  sctp_sf_backbeat_8_3   |  -27
  sctp_sf_cookie_echoed_err  | -300
  sctp_sf_eat_data_6_2   | -146
  sctp_sf_eat_data_fast_4_4  | -125
  sctp_sf_eat_sack_6_2   |  -29
  sctp_sf_operr_notify   |  -25
  sctp_sf_do_9_2_final   | -152
  sctp_sf_do_asconf  |  -64
  sctp_sf_do_asconf_ack  | -284
  sctp_sf_eat_fwd_tsn_fast   | -160
  sctp_sf_eat_auth   |  -86
  sctp_sf_do_5_1B_init   | -110
  sctp_sf_do_5_1C_ack| -204
  sctp_sf_do_9_2_shutdown|  -78
  sctp_sf_do_ecn_cwr |  -24
  sctp_sf_do_ecne|  -32
  sctp_sf_eat_fwd_tsn| -135
  sctp_sf_do_5_1D_ce | -197
  sctp_sf_beat_8_3   |  -28
 60 functions changed, 6184 bytes removed, diff: -6184
net/sctp/sm_sideeffect.c:
  sctp_side_effects | -3873
  sctp_do_sm| +3429
 2 functions changed, 3429 bytes added, 3873 bytes removed, diff: -444
kernel/uninlined.c:
  sctp_add_cmd_sf   |  +35
 1 function changed, 35 bytes added, diff: +35

vmlinux.o:
 63 functions changed, 3464 bytes added, 10057 bytes removed, diff: -6593

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
Cc: Vlad Yasevich <[EMAIL PROTECTED]>
---
 include/net/sctp/sm.h |8 ++--
 net/sctp/command.c|   12 +++-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index ef9e7ed..6740b11 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -385,13 +385,9 @@ static inline int ADDIP_SERIAL_gte(__u16 s, __u16 t)
return (((s) == (t)) || (((t) - (s)) & ADDIP_SERIAL_SIGN_BIT));
 }
 
-
 /* Run sctp_add_cmd() generating a BUG() if there is a failure.  */
-static inline void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb, 
sctp_arg_t obj)
-{
-   if (unlikely(!sctp_add_cmd(seq, verb, obj)))
-   BUG();
-}
+extern void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb,
+   sctp_arg_t obj);
 
 /* Check VTAG of the packet matches the sender's own tag. */
 static inline int
diff --git a/net/sctp/command.c b/net/sctp/command.c
index bb97733..187da2d 100644
--- a/net/sctp/command.c
+++ b/net/sctp/command.c
@@ -51,8 +51,11 @@ int sctp_init_cmd_seq(sctp_cmd_seq_t *seq)
 
 /* Add a command to a sctp_cmd_seq_t.
  * Return 0 if the command sequence is full.
+ *
+ * Inline here is not a mistake, this way sctp_add_cmd_sf doesn't need extra
+ * calls, size penalty is of insignificant magnitude here
  */
-int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj)
+inline int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj)
 {
if (seq->next_free_slot >= SCTP_MAX_NUM_COMMANDS)
goto fail;
@@ -66,6 +69,13 @@ fail:
return 0;
 }
 
+/* Run s

[RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-20 Thread Ilpo Järvinen

vmlinux.o:
 62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869

...+ these to lib/jhash.o:
 jhash_3words: 112
 jhash2: 276
 jhash: 475

select for networking code might need a more fine-grained approach.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 drivers/infiniband/Kconfig |1 +
 drivers/net/Kconfig|1 +
 fs/Kconfig |1 +
 fs/dlm/Kconfig |1 +
 fs/gfs2/Kconfig|1 +
 include/linux/jhash.h  |   99 +
 init/Kconfig   |2 +
 lib/Kconfig|6 ++
 lib/Makefile   |1 +
 lib/jhash.c|  116 
 net/Kconfig|1 +
 11 files changed, 134 insertions(+), 96 deletions(-)
 create mode 100644 lib/jhash.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a5dc78a..421ab71 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -2,6 +2,7 @@ menuconfig INFINIBAND
tristate "InfiniBand support"
depends on PCI || BROKEN
depends on HAS_IOMEM
+   select JHASH
---help---
  Core support for InfiniBand (IB).  Make sure to also select
  any protocols you wish to use as well as drivers for your
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index f337800..8257648 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2496,6 +2496,7 @@ config CHELSIO_T3
tristate "Chelsio Communications T3 10Gb Ethernet support"
depends on PCI
select FW_LOADER
+   select JHASH
help
  This driver supports Chelsio T3-based gigabit and 10Gb Ethernet
  adapters.
diff --git a/fs/Kconfig b/fs/Kconfig
index d731282..693fe71 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1667,6 +1667,7 @@ config NFSD
select LOCKD
select SUNRPC
select EXPORTFS
+   select JHASH
select NFSD_V2_ACL if NFSD_V3_ACL
select NFS_ACL_SUPPORT if NFSD_V2_ACL
select NFSD_TCP if NFSD_V4
diff --git a/fs/dlm/Kconfig b/fs/dlm/Kconfig
index 2dbb422..f27a99a 100644
--- a/fs/dlm/Kconfig
+++ b/fs/dlm/Kconfig
@@ -4,6 +4,7 @@ menuconfig DLM
depends on SYSFS && (IPV6 || IPV6=n)
select CONFIGFS_FS
select IP_SCTP
+   select JHASH
help
A general purpose distributed lock manager for kernel or userspace
applications.
diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
index de8e64c..b9dcabf 100644
--- a/fs/gfs2/Kconfig
+++ b/fs/gfs2/Kconfig
@@ -3,6 +3,7 @@ config GFS2_FS
depends on EXPERIMENTAL
select FS_POSIX_ACL
select CRC32
+   select JHASH
help
  A cluster filesystem.
 
diff --git a/include/linux/jhash.h b/include/linux/jhash.h
index 2a2f99f..14200c6 100644
--- a/include/linux/jhash.h
+++ b/include/linux/jhash.h
@@ -1,25 +1,6 @@
 #ifndef _LINUX_JHASH_H
 #define _LINUX_JHASH_H
 
-/* jhash.h: Jenkins hash support.
- *
- * Copyright (C) 1996 Bob Jenkins ([EMAIL PROTECTED])
- *
- * http://burtleburtle.net/bob/hash/
- *
- * These are the credits from Bob's sources:
- *
- * lookup2.c, by Bob Jenkins, December 1996, Public Domain.
- * hash(), hash2(), hash3, and mix() are externally useful functions.
- * Routines to test the hash are included if SELF_TEST is defined.
- * You can use this free for any purpose.  It has no warranty.
- *
- * Copyright (C) 2003 David S. Miller ([EMAIL PROTECTED])
- *
- * I've modified Bob's hash to be useful in the Linux kernel, and
- * any bugs present are surely my fault.  -DaveM
- */
-
 /* NOTE: Arguments are modified. */
 #define __jhash_mix(a, b, c) \
 { \
@@ -41,77 +22,12 @@
  * of bytes.  No alignment or length assumptions are made about
  * the input key.
  */
-static inline u32 jhash(const void *key, u32 length, u32 initval)
-{
-   u32 a, b, c, len;
-   const u8 *k = key;
-
-   len = length;
-   a = b = JHASH_GOLDEN_RATIO;
-   c = initval;
-
-   while (len >= 12) {
-   a += (k[0] +((u32)k[1]<<8) +((u32)k[2]<<16) +((u32)k[3]<<24));
-   b += (k[4] +((u32)k[5]<<8) +((u32)k[6]<<16) +((u32)k[7]<<24));
-   c += (k[8] +((u32)k[9]<<8) +((u32)k[10]<<16)+((u32)k[11]<<24));
-
-   __jhash_mix(a,b,c);
-
-   k += 12;
-   len -= 12;
-   }
-
-   c += length;
-   switch (len) {
-   case 11: c += ((u32)k[10]<<24);
-   case 10: c += ((u32)k[9]<<16);
-   case 9 : c += ((u32)k[8]<<8);
-   case 8 : b += ((u32)k[7]<<24);
-   case 7 : b += ((u32)k[6]<<16);
-   case 6 : b += ((u32)k[5]<<8);
-   case 5 : b += k[4];
-   case 4 : a += ((u32)k[3]<<24);
-   case 3 : a += ((u32)k[2]<<16);
-   case 2 : a += ((u32)k[1]<<8);
-   case 1 : a += k[0];
-   };
-
-   __jhash_mix(a,b,c);
-
-   return c;
-}
+extern u32 jhash(const void *key, u32 length, u32 initval);
 
 /* A special optimized version

[RFC PATCH 5/8] [NET]: uninline dst_release

2008-02-20 Thread Ilpo Järvinen

Codiff stats:
-16420  187 funcs, 103 +, 16523 -, diff: -16420 --- dst_release

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/dst.h |   10 +-
 net/core/dst.c|   10 ++
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index e3ac7d0..bf33471 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -158,15 +158,7 @@ struct dst_entry * dst_clone(struct dst_entry * dst)
return dst;
 }
 
-static inline
-void dst_release(struct dst_entry * dst)
-{
-   if (dst) {
-   WARN_ON(atomic_read(&dst->__refcnt) < 1);
-   smp_mb__before_atomic_dec();
-   atomic_dec(&dst->__refcnt);
-   }
-}
+extern void dst_release(struct dst_entry *dst);
 
 /* Children define the path of the packet through the
  * Linux networking.  Thus, destinations are stackable.
diff --git a/net/core/dst.c b/net/core/dst.c
index 7deef48..cc2e724 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -259,6 +259,16 @@ again:
return NULL;
 }
 
+void dst_release(struct dst_entry *dst)
+{
+   if (dst) {
+   WARN_ON(atomic_read(&dst->__refcnt) < 1);
+   smp_mb__before_atomic_dec();
+   atomic_dec(&dst->__refcnt);
+   }
+}
+EXPORT_SYMBOL(dst_release);
+
 /* Dirty hack. We did it in 2.2 (in __dst_free),
  * we have _very_ good reasons not to repeat
  * this mistake in 2.3, but we have no choice
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PowerPC toolchain for x86 [Was: pci_device_id cleanups]

2008-02-20 Thread Jonas Bonn




Sam Ravnborg wrote:

On Wed, Feb 20, 2008 at 02:27:19PM +0100, Jonas Bonn wrote:

Sam Ravnborg wrote:

On Wed, Feb 20, 2008 at 01:53:36PM +0100, Jonas Bonn wrote:

The PCI_DEVICE_TABLE patch I sent earlier doesn't necessarily make
much sense by itself... here is a set of patches that apply
this macro, in turn moving a lot of this data into __devinitconst
which is discardable in certain situations.
Hopefully the benefit of this approach is a bit clearer now.

[shorter lines please..]

Sorry...


Can you please confirm that this does not break powerpc (64 bit)
as they have troubles with the constification..
I do not have access to any PowerPC machine... Olof Johansson built the 
tree I posted earlier on PowerPC; there's nothing really new here except 
the wrapping of the definition in a macro.

And you added const and a specific section.


No... once the macro is expanded the code is exactly the same as that 
which built cleanly on powerpc previously (which Olof, built, I mean)... 
nothing new here.



Exactly what could break on PowerPC.

To do the build break check is easy.
Google for "crosstool" and build your own powerpc toolchain.



Thanks... I'll throw together a cross compiler and see what I can do.

/Jonas


Andrew has something precompiled somewhere but I lost the link.


Sam



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Patrick McHardy


Ilpo Järvinen wrote:

~500 files changed
...
kernel/uninlined.c:
  skb_put   | +104
 1 function changed, 104 bytes added, diff: +104

vmlinux.o:
 869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805

This change is INCOMPLETE, I think that the call to current_text_addr()
should be rethought but I don't have a clue how to do that.



I guess __builtin_return_address(0) would work.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Pavel Emelyanov

There are tree places, which declare the char buf[...] on the stack
to push it later into dprintk(). Since the dprintk sometimes (if the 
CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
cause gcc to produce appropriate warnings.

Mark them as __maybe_unused.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1d3e5fc..303f105 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -175,7 +175,7 @@ static int svc_sendto(struct svc_rqst *rqstp, struct 
xdr_buf *xdr)
size_t  base = xdr->page_base;
unsigned intpglen = xdr->page_len;
unsigned intflags = MSG_MORE;
-   charbuf[RPC_MAX_ADDRBUFLEN];
+   charbuf[RPC_MAX_ADDRBUFLEN] __maybe_unused;
 
slen = xdr->len;
 
@@ -716,7 +716,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt 
*xprt)
struct socket   *newsock;
struct svc_sock *newsvsk;
int err, slen;
-   charbuf[RPC_MAX_ADDRBUFLEN];
+   charbuf[RPC_MAX_ADDRBUFLEN] __maybe_unused;
 
dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
if (!sock)
@@ -1206,7 +1206,7 @@ static struct svc_xprt *svc_create_socket(struct svc_serv 
*serv,
struct socket   *sock;
int error;
int type;
-   charbuf[RPC_MAX_ADDRBUFLEN];
+   charbuf[RPC_MAX_ADDRBUFLEN] __maybe_unused;
struct sockaddr_storage addr;
struct sockaddr *newsin = (struct sockaddr *)&addr;
int newlen;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Eric Dumazet

On Wed, 20 Feb 2008 15:47:11 +0200
"Ilpo Järvinen" <[EMAIL PROTECTED]> wrote:

> ~500 files changed
> ...
> kernel/uninlined.c:
>   skb_put   | +104
>  1 function changed, 104 bytes added, diff: +104
> 
> vmlinux.o:
>  869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805
> 
> This change is INCOMPLETE, I think that the call to current_text_addr()
> should be rethought but I don't have a clue how to do that.

You want to use __builtin_return_address(0)

> 
> Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
> ---
>  include/linux/skbuff.h |   20 +---
>  net/core/skbuff.c  |   21 +
>  2 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 412672a..5925435 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -896,25 +896,7 @@ static inline unsigned char *__skb_put(struct sk_buff 
> *skb, unsigned int len)
>   return tmp;
>  }
>  
> -/**
> - *   skb_put - add data to a buffer
> - *   @skb: buffer to use
> - *   @len: amount of data to add
> - *
> - *   This function extends the used data area of the buffer. If this would
> - *   exceed the total buffer size the kernel will panic. A pointer to the
> - *   first byte of the extra data is returned.
> - */
> -static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
> -{
> - unsigned char *tmp = skb_tail_pointer(skb);
> - SKB_LINEAR_ASSERT(skb);
> - skb->tail += len;
> - skb->len  += len;
> - if (unlikely(skb->tail > skb->end))
> - skb_over_panic(skb, len, current_text_addr());
> - return tmp;
> -}
> +extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
>  
>  static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int 
> len)
>  {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4e35422..661d439 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -857,6 +857,27 @@ free_skb:
>   return err;
>  }
>  
> +/**
> + *   skb_put - add data to a buffer
> + *   @skb: buffer to use
> + *   @len: amount of data to add
> + *
> + *   This function extends the used data area of the buffer. If this would
> + *   exceed the total buffer size the kernel will panic. A pointer to the
> + *   first byte of the extra data is returned.
> + */
> +unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
> +{
> + unsigned char *tmp = skb_tail_pointer(skb);
> + SKB_LINEAR_ASSERT(skb);
> + skb->tail += len;
> + skb->len  += len;
> + if (unlikely(skb->tail > skb->end))
> + skb_over_panic(skb, len, current_text_addr());
> + return tmp;
> +}
> +EXPORT_SYMBOL(skb_put);
> +
>  /* Trims skb to length len. It can change skb pointers.
>   */
>  
> -- 
> 1.5.2.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PowerPC toolchain for x86 [Was: pci_device_id cleanups]

2008-02-20 Thread Sam Ravnborg

On Wed, Feb 20, 2008 at 02:27:19PM +0100, Jonas Bonn wrote:
> Sam Ravnborg wrote:
> >On Wed, Feb 20, 2008 at 01:53:36PM +0100, Jonas Bonn wrote:
> >>The PCI_DEVICE_TABLE patch I sent earlier doesn't necessarily make
> >>much sense by itself... here is a set of patches that apply
> >>this macro, in turn moving a lot of this data into __devinitconst
> >>which is discardable in certain situations.
> >>Hopefully the benefit of this approach is a bit clearer now.
> >[shorter lines please..]
> 
> Sorry...
> 
> >
> >Can you please confirm that this does not break powerpc (64 bit)
> >as they have troubles with the constification..
> 
> I do not have access to any PowerPC machine... Olof Johansson built the 
> tree I posted earlier on PowerPC; there's nothing really new here except 
> the wrapping of the definition in a macro.
And you added const and a specific section.
Exactly what could break on PowerPC.

To do the build break check is easy.
Google for "crosstool" and build your own powerpc toolchain.

Andrew has something precompiled somewhere but I lost the link.


Sam
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TG3 network data corruption regression 2.6.24/2.6.23.4

2008-02-20 Thread Tony Battersby

Michael Chan wrote:
> On Tue, 2008-02-19 at 17:14 -0500, Tony Battersby wrote:
>
>   
>> Update: when I revert Herbert's patch in addition to applying your
>> patch, the iSCSI performance goes back up to 115 MB/s again in both
>> directions.  So it looks like turning off SG for TX didn't itself cause
>> the performance drop, but rather that the performance drop is just
>> another manifestation of whatever bug is causing the data corruption.
>>
>> I do not regularly use wireshark or look at network packet dumps, so I
>> am not really sure what to look for.  Given the above information, do
>> you still believe that there is value in examining the packet dump?
>>
>> 
>
> Can you confirm whether you're getting TCP checksum errors on the other
> side that is receiving packets from the 5701?  You can just check
> statistics using netstat -s.  I suspect that after we turn off SG,
> checksum is no longer offloaded and we are getting lots of TCP checksum
> errors instead that are slowing the performance.
>
>
>   
Confirmed.  With a 100 MB read/write test, netstat -s shows 75 bad
segments received, and performance in the one direction is about 5
MB/s.  When I switch to the SysKonnect NIC, netstat -s shows 0 bad
segments received, and performance is 115 MB/s.  So that solves that
mystery - there is still data corruption, but the software-computed TCP
checksum causes the bad packets to be retransmitted rather than being
passed on to the application.

Tony

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/28] mm: slb: add knowledge of reserve pages

2008-02-20 Thread Peter Zijlstra

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/slub_def.h |1 
 mm/slab.c|   60 +++
 mm/slub.c|   42 +---
 3 files changed, 80 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -21,11 +21,12 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1098,15 +1099,15 @@ static struct page *allocate_slab(struct
return page;
 }
 
-static void setup_object(struct kmem_cache *s, struct page *page,
-   void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)
 {
setup_object_debug(s, page, object);
if (unlikely(s->ctor))
s->ctor(s, object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static
+struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int 
*reserve)
 {
struct page *page;
struct kmem_cache_node *n;
@@ -1121,6 +1122,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
 
+   *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1228,8 +1230,7 @@ static __always_inline int slab_trylock(
 /*
  * Management of partially allocated slabs
  */
-static void add_partial(struct kmem_cache_node *n,
-   struct page *page, int tail)
+static void add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
spin_lock(&n->list_lock);
n->nr_partial++;
@@ -1240,8 +1241,7 @@ static void add_partial(struct kmem_cach
spin_unlock(&n->list_lock);
 }
 
-static void remove_partial(struct kmem_cache *s,
-   struct page *page)
+static void remove_partial(struct kmem_cache *s, struct page *page)
 {
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
 
@@ -1256,7 +1256,8 @@ static void remove_partial(struct kmem_c
  *
  * Must hold list_lock.
  */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n, struct page 
*page)
+static inline
+int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
 {
if (slab_trylock(page)) {
list_del(&page->lru);
@@ -1514,11 +1515,21 @@ static void *__slab_alloc(struct kmem_ca
 {
void **object;
struct page *new;
+   int reserve;
 #ifdef SLUB_FASTPATH
unsigned long flags;
 
local_irq_save(flags);
 #endif
+   if (unlikely(c->reserve)) {
+   /*
+* If the current slab is a reserve slab and the current
+* allocation context does not allow access to the reserves we
+* must force an allocation to test the current levels.
+*/
+   if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+   goto grow_slab;
+   }
if (!c->page)
goto new_slab;
 
@@ -1530,7 +1541,7 @@ load_freelist:
object = c->page->freelist;
if (unlikely(object == c->page->end))
goto another_slab;
-   if (unlikely(SlabDebug(c->page)))
+   if (unlikely(SlabDebug(c->page) || c->reserve))
goto debug;
 
object = c->page->freelist;
@@ -1557,16 +1568,18 @@ new_slab:
goto load_freelist;
}
 
+grow_slab:
if (gfpflags & __GFP_WAIT)
local_irq_enable();
 
-   new = new_slab(s, gfpflags, node);
+   new = new_slab(s, gfpflags, node, &reserve);
 
if (gfpflags & __GFP_WAIT)
local_irq_disable();
 
if (new) {
c = get_cpu_slab(s, smp_processor_id());
+   c->reserve = reserve;
stat(c, ALLOC_SLAB);
if (c->page)
flush_slab(s, c);
@@ -1594,8 +1607,8 @@ new_slab:
 
return NULL;
 debug:
-   object = c->page->freelist;
-   if (!alloc_debug_processing(s, c->page, object, addr))
+   if (SlabDebug(c->page) &&
+   !alloc_debug_processing(s, c->page, object, addr))
goto another_slab;
 
c->page->inuse++;
@@ -2153,10 +2166,11 @@ static struct kmem_cache_node *early_kme
struct page *page;
struct kmem_cache_node *n;
unsigned long flags;
+   int reserve;

[PATCH 02/28] mm: tag reseve pages

2008-02-20 Thread Peter Zijlstra

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/mm_types.h |1 +
 mm/page_alloc.c  |4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -73,6 +73,7 @@ struct page {
union {
pgoff_t index;  /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+   int reserve;/* page_alloc: page is a reserve page */
};
struct list_head lru;   /* Pageout list, eg. active_list
 * protected by zone->lru_lock !
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1418,8 +1418,10 @@ zonelist_scan:
}
 
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-   if (page)
+   if (page) {
+   page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
break;
+   }
 this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/28] mm: kmem_estimate_pages()

2008-02-20 Thread Peter Zijlstra

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/slab.h |4 ++
 mm/slab.c|   75 ++
 mm/slub.c|   82 +++
 3 files changed, 161 insertions(+)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -60,6 +60,8 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep,
+   gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -94,6 +96,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2465,6 +2465,37 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+   unsigned long slabs;
+
+   if (WARN_ON(!s) || WARN_ON(!s->objects))
+   return 0;
+
+   slabs = DIV_ROUND_UP(objects, s->objects);
+
+   /*
+* Account the possible additional overhead if the slab holds more that
+* one object.
+*/
+   if (s->objects > 1) {
+   /*
+* Account the possible additional overhead if per cpu slabs
+* are currently empty and have to be allocated. This is very
+* unlikely but a possible scenario immediately after
+* kmem_cache_shrink.
+*/
+   slabs += num_online_cpus();
+   }
+
+   return slabs << s->order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2818,6 +2849,57 @@ static unsigned long count_partial(struc
 }
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+   struct kmem_cache *s = get_slab(size, flags);
+   if (!s)
+   return 0;
+
+   return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+   int i;
+   unsigned long pages;
+
+   /*
+* multiply by two, in order to account the worst case slack space
+* due to the power-of-two allocation sizes.
+*/
+   pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+   /*
+* add the kmem_cache overhead of each possible kmalloc cache
+*/
+   for (i = 1; i < PAGE_SHIFT; i++) {
+   struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+   if (unlikely(flags & SLUB_DMA))
+   s = dma_kmalloc_cache(i, flags);
+   else
+#endif
+   s = &kmalloc_caches[i];
+
+   if (s)
+   pages += kmem_estimate_pages(s, flags, 0);
+   }
+
+   return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -3851,6 +3851,81 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *cachep,
+   gfp_t flags, int objects)
+{
+   /*
+

[PATCH 12/28] net: wrap sk->sk_backlog_rcv()

2008-02-20 Thread Peter Zijlstra

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h   |5 +
 include/net/tcp.h|2 +-
 net/core/sock.c  |4 ++--
 net/ipv4/tcp.c   |2 +-
 net/ipv4/tcp_timer.c |2 +-
 5 files changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -474,6 +474,11 @@ static inline void sk_add_backlog(struct
skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+   return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)  \
({  int __rc;   \
release_sock(__sk); \
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -325,7 +325,7 @@ int sk_receive_skb(struct sock *sk, stru
 */
mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-   rc = sk->sk_backlog_rcv(sk, skb);
+   rc = sk_backlog_rcv(sk, skb);
 
mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
} else
@@ -1360,7 +1360,7 @@ static void __release_sock(struct sock *
struct sk_buff *next = skb->next;
 
skb->next = NULL;
-   sk->sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
/*
 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1158,7 +1158,7 @@ static void tcp_prequeue_process(struct 
 * necessary */
local_bh_disable();
while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-   sk->sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
local_bh_enable();
 
/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -203,7 +203,7 @@ static void tcp_delack_timer(unsigned lo
NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-   sk->sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
tp->ucopy.memory = 0;
}
Index: linux-2.6/include/net/tcp.h
===
--- linux-2.6.orig/include/net/tcp.h
+++ linux-2.6/include/net/tcp.h
@@ -879,7 +879,7 @@ static inline int tcp_prequeue(struct so
BUG_ON(sock_owned_by_user(sk));
 
while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != 
NULL) {
-   sk->sk_backlog_rcv(sk, skb1);
+   sk_backlog_rcv(sk, skb1);
NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
}
 

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/28] mm: emergency pool

2008-02-20 Thread Peter Zijlstra

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/mmzone.h |3 +
 mm/page_alloc.c|   84 +++--
 mm/vmstat.c|6 +--
 3 files changed, 79 insertions(+), 14 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -213,7 +213,7 @@ enum zone_type {
 
 struct zone {
/* Fields commonly accessed by the page allocator */
-   unsigned long   pages_min, pages_low, pages_high;
+   unsigned long   pages_emerg, pages_min, pages_low, pages_high;
/*
 * We don't know if the memory that we're going to allocate will be 
freeable
 * or/and it will be released eventually, so to avoid totally wasting 
several
@@ -683,6 +683,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1240,7 +1242,7 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
 
-   if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+   if (free_pages <= min+z->lowmem_reserve[classzone_idx]+z->pages_emerg)
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -1569,7 +1571,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
struct reclaim_state reclaim_state;
struct task_struct *p = current;
int do_retry;
-   int alloc_flags;
+   int alloc_flags = 0;
int did_some_progress;
 
might_sleep_if(wait);
@@ -1721,8 +1723,8 @@ nofail_alloc:
 nopage:
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
printk(KERN_WARNING "%s: page allocation failure."
-   " order:%d, mode:0x%x\n",
-   p->comm, order, gfp_mask);
+   " order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+   p->comm, order, gfp_mask, alloc_flags, p->flags);
dump_stack();
show_mem();
}
@@ -1937,9 +1939,9 @@ void show_free_areas(void)
"\n",
zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
-   K(zone->pages_min),
-   K(zone->pages_low),
-   K(zone->pages_high),
+   K(zone->pages_emerg + zone->pages_min),
+   K(zone->pages_emerg + zone->pages_low),
+   K(zone->pages_emerg + zone->pages_high),
K(zone_page_state(zone, NR_ACTIVE)),
K(zone_page_state(zone, NR_INACTIVE)),
K(zone->present_pages),
@@ -4125,7 +4127,7 @@ static void calculate_totalreserve_pages
}
 
/* we treat pages_high as reserved pages. */
-   max += zone->pages_high;
+   max += zone->pages_high + zone->pages_emerg;
 
if (max > zone->present_pages)
max = zone->present_pages;
@@ -4182,7 +4184,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-   unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+   unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+   unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -4194,11 +4197,13 @@ static void __setup_per_zone_pages_min(v
}
 
for_each_zone(zone) {
-   u64 tmp;
+   u64 tmp, tmp_emerg;
 
spin_lock_irqsave(&zone->lru_lock, flags);
tmp = (u64)pages_min * zon

[PATCH 10/28] mm: memory reserve management

2008-02-20 Thread Peter Zijlstra

Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/reserve.h |   54 ++
 mm/Makefile |2 
 mm/reserve.c|  429 
 3 files changed, 484 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/reserve.h
===
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,54 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[EMAIL PROTECTED]>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include 
+#include 
+
+struct mem_reserve {
+   struct mem_reserve *parent;
+   struct list_head children;
+   struct list_head siblings;
+
+   const char *name;
+
+   long pages;
+   long limit;
+   long usage;
+   spinlock_t lock;/* protects limit and usage */
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+ struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+   struct mem_reserve *node);
+int mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages,
+int overcommit);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+  int overcommit);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+  struct kmem_cache *s,
+  int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+ long objs,
+ int overcommit);
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
   page_alloc.o page-writeback.o pdflush.o \
   readahead.o swap.o truncate.o vmscan.o \
   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-  page_isolation.o $(mmu-y)
+  page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)   += bounce.o
Index: linux-2.6/mm/reserve.c
===
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,429 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra <[EMAIL PROTECTED]>
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone->pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+   .children = LIST_HEAD_INIT(mem_reserve_root.children),
+   .siblings = LIST_HEAD_

[PATCH 13/28] net: packet split receive api

2008-02-20 Thread Peter Zijlstra

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 drivers/net/bnx2.c |8 +++-
 drivers/net/e1000/e1000_main.c |8 ++--
 drivers/net/e1000e/netdev.c|7 ++-
 drivers/net/igb/igb_main.c |8 ++--
 drivers/net/ixgbe/ixgbe_main.c |   10 +++---
 drivers/net/sky2.c |   16 ++--
 include/linux/skbuff.h |   23 +++
 net/core/skbuff.c  |   20 
 8 files changed, 61 insertions(+), 39 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4478,12 +4478,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
PAGE_SIZE, PCI_DMA_FROMDEVICE);
ps_page_dma->ps_page_dma[j] = 0;
-   skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-  length);
+   skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
ps_page->ps_page[j] = NULL;
-   skb->len += length;
-   skb->data_len += length;
-   skb->truesize += length;
}
 
/* strip the ethernet crc, problem is we're using pages now so
@@ -4691,7 +4687,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
if (j < adapter->rx_ps_pages) {
if (likely(!ps_page->ps_page[j])) {
ps_page->ps_page[j] =
-   alloc_page(GFP_ATOMIC);
+   netdev_alloc_page(netdev);
if (unlikely(!ps_page->ps_page[j])) {
adapter->alloc_rx_buff_failed++;
goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -846,6 +846,9 @@ static inline void skb_fill_page_desc(st
skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+   int off, int size);
+
 #define SKB_PAGE_ASSERT(skb)   BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb)   BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1339,6 +1342,26 @@ static inline struct sk_buff *netdev_all
return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+
+/**
+ * netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @dev: network device to receive on
+ *
+ * Allocate a new page node local to the specified device.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+   return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+   __free_page(page);
+}
+
 /**
  * skb_clone_writable - is the header of a clone writable
  * @skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,26 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+   int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+   struct page *page;
+
+   page = alloc_pages_node(node, gfp_mask, 0);
+   return page;
+}
+EXPORT_SYMBOL(__netdev_alloc_page);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+   int size)
+{
+   skb_fill_page_desc(skb, i, page, off, size);
+   skb->len += size;
+   skb->data_len += size;
+   skb->truesize += size;
+}
+EXPORT_SYMBOL(skb_add_rx_frag);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
Index: linux-2.6/drivers/net/sky2.c
===
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1216,7 +1216,7 @@ static struct sk_buff *sky2_rx_alloc(str
}
 
for (i = 0; i < sky2->rx_nfrags; i++) {
-   struct page *p

[PATCH 15/28] netvm: network reserve infrastructure

2008-02-20 Thread Peter Zijlstra

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)network TX reserve
3)  protocol TX pages
4)network RX reserve
5)  SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |   35 +++-
 net/Kconfig|3 +
 net/core/sock.c|  113 +
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -51,6 +51,7 @@
 #include   /* struct sk_buff */
 #include 
 #include 
+#include 
 
 #include 
 
@@ -405,6 +406,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -427,9 +429,40 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+   return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+   return atomic_read(&memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-   return gfp_mask;
+   return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+struct mem_reserve net_skb_reserve;
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+   return mem_reserve_kmalloc_charge(&net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+   return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+   return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+   mem_reserve_kmalloc_charge(&net_skb_reserve, -bytes, 0);
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_MEMALLOC sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ *@tx_reserve_pages is an upper-bound of memory used for TX hence
+ *we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+   int nr_socks;
+   int err;
+
+   err = mem_reserve_pages

[PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK

2008-02-20 Thread Peter Zijlstra

Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
wide - which they are per setup_per_zone_pages_min(), when we scrape the
barrel, do it properly.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page_alloc.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1552,6 +1552,12 @@ restart:
 rebalance:
if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+   /*
+* break out of mempolicy boundaries
+*/
+   zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+   gfp_zone(gfp_mask);
+
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order, zonelist,
ALLOC_NO_WATERMARKS);

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages

2008-02-20 Thread Peter Zijlstra

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/mm.h  |   25 +
 include/linux/pagemap.h |2 +-
 mm/swapfile.c   |   19 +++
 3 files changed, 45 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -600,6 +600,17 @@ static inline struct address_space *page
return mapping;
 }
 
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+   if (unlikely(PageSwapCache(page)))
+   return __page_file_mapping(page);
+
+   return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -616,6 +627,20 @@ static inline pgoff_t page_index(struct 
return page->index;
 }
 
+extern pgoff_t __page_file_index(struct page *page);
+
+/*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+   if (unlikely(PageSwapCache(page)))
+   return __page_file_index(page);
+
+   return page->index;
+}
+
 /*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-   return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+   return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
Index: linux-2.6/mm/swapfile.c
===
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1818,6 +1818,25 @@ struct swap_info_struct *page_swap_info(
 }
 
 /*
+ * out-of-line __page_file_ methods to avoid include hell.
+ */
+
+struct address_space *__page_file_mapping(struct page *page)
+{
+   VM_BUG_ON(!PageSwapCache(page));
+   return page_swap_info(page)->swap_file->f_mapping;
+}
+EXPORT_SYMBOL_GPL(__page_file_mapping);
+
+pgoff_t __page_file_index(struct page *page)
+{
+   swp_entry_t swap = { .val = page_private(page) };
+   VM_BUG_ON(!PageSwapCache(page));
+   return swp_offset(swap);
+}
+EXPORT_SYMBOL_GPL(__page_file_index);
+
+/*
  * swap_lock prevents swap_map being freed. Don't grab an extra
  * reference on the swaphandle, it doesn't matter if it becomes unused.
  */

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 27/28] nfs: enable swap on NFS

2008-02-20 Thread Peter Zijlstra

Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
SOCK_MEMALLOC before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
packets required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/Kconfig  |   17 +++
 fs/nfs/file.c   |   12 
 fs/nfs/write.c  |   19 +
 include/linux/nfs_fs.h  |2 +
 include/linux/sunrpc/xprt.h |5 ++-
 net/sunrpc/sched.c  |9 --
 net/sunrpc/xprtsock.c   |   63 
 7 files changed, 124 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -373,6 +373,13 @@ static int nfs_launder_page(struct page 
return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+   return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -387,6 +394,11 @@ const struct address_space_operations nf
.direct_IO = nfs_direct_IO,
 #endif
.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+   .swapfile = nfs_swapfile,
+   .swap_out = nfs_swap_out,
+   .swap_in = nfs_readpage,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -362,6 +362,25 @@ int nfs_writepage(struct page *page, str
return ret;
 }
 
+int nfs_swap_out(struct file *file, struct page *page,
+struct writeback_control *wbc)
+{
+   struct nfs_open_context *ctx = nfs_file_open_context(file);
+   int status;
+
+   status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
+   if (status < 0) {
+   nfs_set_pageerror(page);
+   goto out;
+   }
+
+   status = nfs_writepage_locked(page, wbc);
+
+out:
+   unlock_page(page);
+   return status;
+}
+
 static int nfs_writepages_callback(struct page *page, struct writeback_control 
*wbc, void *data)
 {
int ret;
Index: linux-2.6/include/linux/nfs_fs.h
===
--- linux-2.6.orig/include/linux/nfs_fs.h
+++ linux-2.6/include/linux/nfs_fs.h
@@ -453,6 +453,8 @@ extern int  nfs_flush_incompatible(struc
 extern int  nfs_updatepage(struct file *, struct page *, unsigned int, 
unsigned int);
 extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 extern void nfs_writedata_release(void *);
+extern int  nfs_swap_out(struct file *file, struct page *page,
+struct writeback_control *wbc);
 
 /*
  * Try to write back everything synchronously (but check the
Index: linux-2.6/fs/Kconfig
===
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1661,6 +1661,18 @@ config NFS_DIRECTIO
  causes open() to return EINVAL if a file residing in NFS is
  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+   bool "Provide swap over NFS support"
+   default n
+   depends on NFS_FS
+   select SUNRPC_SWAP
+   help
+ This option enables swapon to work on files located on NFS mounts.
+
+ For more details, see Documentation/vm_deadlock.txt
+
+ If unsure, say N.
+
 config NFSD
tristate "NFS server support"
depends on INET
@@ -1794,6 +1806,11 @@ config SUNRPC_BIND34
  If unsure, say N to get traditional behavior (version 2 rpcbind
  requests only).
 
+config SUNRPC_SWAP
+   def_bool n
+   depends on SUNRPC
+   select NETVM
+
 config RPCSEC_GSS_KRB5
tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
depends on SUNRPC && EXPERIMENTAL
Index: linux-2.6/include/linux/sunrpc/xprt.h
===
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -143,7 +143,9 @@ struct rpc_xprt {
unsigned intmax_reqs;   /* total slots */
unsigned long   state;  /* transport state */
unsigned char   shutdown   : 1, /* being shut down */
-   resvport   : 1; /* use a reserved port */
+

[PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages

2008-02-20 Thread Peter Zijlstra

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/file.c |6 +++---
 fs/nfs/internal.h |7 ---
 fs/nfs/pagelist.c |6 +++---
 fs/nfs/read.c |6 +++---
 fs/nfs/write.c|   51 ++-
 5 files changed, 39 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -359,7 +359,7 @@ static void nfs_invalidate_page(struct p
if (offset != 0)
return;
/* Cancel any unstarted writes on this page */
-   nfs_wb_page_cancel(page->mapping->host, page);
+   nfs_wb_page_cancel(page_file_mapping(page)->host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -370,7 +370,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-   return nfs_wb_page(page->mapping->host, page);
+   return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -397,7 +397,7 @@ static int nfs_vm_page_mkwrite(struct vm
struct address_space *mapping;
 
lock_page(page);
-   mapping = page->mapping;
+   mapping = page_file_mapping(page);
if (mapping != vma->vm_file->f_path.dentry->d_inode->i_mapping)
goto out_unlock;
 
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -76,11 +76,11 @@ nfs_create_request(struct nfs_open_conte
 * update_nfs_request below if the region is not locked. */
req->wb_page= page;
atomic_set(&req->wb_complete, 0);
-   req->wb_index   = page->index;
+   req->wb_index   = page_file_index(page);
page_cache_get(page);
BUG_ON(PagePrivate(page));
BUG_ON(!PageLocked(page));
-   BUG_ON(page->mapping->host != inode);
+   BUG_ON(page_file_mapping(page)->host != inode);
req->wb_offset  = offset;
req->wb_pgbase  = offset;
req->wb_bytes   = count;
@@ -376,7 +376,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -458,11 +458,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
struct nfs_open_context *ctx;
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = page_file_mapping(page)->host;
int error;
 
dprintk("NFS: nfs_readpage (%p [EMAIL PROTECTED])\n",
-   page, PAGE_CACHE_SIZE, page->index);
+   page, PAGE_CACHE_SIZE, page_file_index(page));
nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -509,7 +509,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *new;
unsigned int len;
int error;
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *req = NULL;
 
spin_lock(&inode->i_lock);
@@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int 
count)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = page_file_mapping(page)->host;
loff_t end, i_size = i_size_read(inode);
pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-   if (i_size > 0 && page->index < end_index)
+   if (i_size > 0 && page_file_index(page) < end_index)
return;
-   end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + 
((loff_t)offset+count);
+   end = page_offset(page) + ((loff_t)offset+count)

[PATCH 17/28] netvm: hook skb allocation to reserves

2008-02-20 Thread Peter Zijlstra

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/mm_types.h |1 
 include/linux/skbuff.h   |   26 +-
 net/core/skbuff.c|  177 +--
 3 files changed, 177 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -308,7 +308,9 @@ struct sk_buff {
__u16   tc_verd;/* traffic control verdict */
 #endif
 #endif
-   /* 2 byte hole */
+   __u8emergency:1;
+   /* 7 bit hole */
+   /* 1 byte hole */
 
 #ifdef CONFIG_NET_DMA
dma_cookie_tdma_cookie;
@@ -339,10 +341,22 @@ struct sk_buff {
 
 #include 
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+   return unlikely(skb->emergency);
+#else
+   return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
@@ -352,7 +366,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
@@ -1297,7 +1311,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
  gfp_t gfp_mask)
 {
-   struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+   struct sk_buff *skb =
+   __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -1343,6 +1358,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  * netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1359,7 +1375,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-   __free_page(page);
+   __netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int flags, int node)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+   int emergency = 0, memalloc = sk_memalloc_socks();
 
-   cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+   size = SKB_DATA_ALIGN(size);
+   cache = (flags & SKB_ALLOC_FCLONE)
+   ? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+   if (memalloc && (flags & SKB_ALLOC_RX))
+   gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
+#endif
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
if (!skb)
-   goto out;
+   goto noskb;
 
-   size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
@@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int
 * See comment in sk_buff definition, just before the 'tail' memb

[PATCH 00/28] Swap over NFS -v16

2008-02-20 Thread Peter Zijlstra

Hi,

Another posting of the full swap over NFS series. 

Andrew/Linus, could we start thinking of sticking this in -mm?

[ patches against 2.6.25-rc2-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.25-rc2-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


  Part 1, patches 1-11

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it currently doesn't do SLOB.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: sl[au]b: add knowledge of reserve pages
 4 - mm: kmem_estimate_pages()
 5 - mm: allow PF_MEMALLOC from softirq context
 6 - mm: serialize access to min_free_kbytes
 7 - mm: emergency pool
 8 - mm: system wide ALLOC_NO_WATERMARK
 9 - mm: __GFP_MEMALLOC
10 - mm: memory reserve management
11 - selinux: tag avc cache alloc as non-critical


  Part 2, patches 12-14

Provide some generic network infrastructure needed later on.

12 - net: wrap sk->sk_backlog_rcv()
13 - net: packet split receive api
14 - net: sk_allocation() - concentrate socket related allocations


  Part 3, patches 15-21

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

15 - netvm: network reserve infrastructure
16 - netvm: INET reserves.
17 - netvm: hook skb allocation to reserves
18 - netvm: filter emergency skbs.
19 - netvm: prevent a TCP specific deadlock
20 - netfilter: NF_QUEUE vs emergency skbs
21 - netvm: skb processing


  Part 4, patches 22-23

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

This provides new a_ops to handle swapcache pages and could be used to obsolete
the bmap usage for swapfiles.

22 - mm: add support for non block device backed swap files
23 - mm: methods for teaching filesystems about PG_swapcache pages


  Part 5, patches 24-28

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

24 - nfs: remove mempools
25 - nfs: teach the NFS client how to treat PG_swapcache pages
26 - nfs: disable data cache revalidation for swapfiles
27 - nfs: enable swap on NFS
28 - nfs: fix various memory recursions possible with swap over NFS.


Changes since -v15:
 - fwd port
 - more SGE fragment drivers ported
 - made the new swapfile logic unconditional
 - various bug fixes and cleanups


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/28] selinux: tag avc cache alloc as non-critical

2008-02-20 Thread Peter Zijlstra

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Acked-by: James Morris <[EMAIL PROTECTED]>
---
 security/selinux/avc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/security/selinux/avc.c
===
--- linux-2.6.orig/security/selinux/avc.c
+++ linux-2.6/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
struct avc_node *node;
 
-   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;
 

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/28] mm: gfp_to_alloc_flags()

2008-02-20 Thread Peter Zijlstra

Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/internal.h   |   11 ++
 mm/page_alloc.c |   98 
 2 files changed, 67 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/internal.h
===
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -47,4 +47,15 @@ static inline unsigned long page_order(s
VM_BUG_ON(!PageBuddy(page));
return page_private(page);
 }
+
+#define ALLOC_HARDER   0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH   0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1127,14 +1127,6 @@ failed:
return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH   0x08 /* use pages_high watermark */
-#define ALLOC_HARDER   0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1523,6 +1515,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+   struct task_struct *p = current;
+   int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+   const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+   /*
+* The caller may dip into page reserves a bit more if the caller
+* cannot run direct reclaim, or if the caller has realtime scheduling
+* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+*/
+   if (gfp_mask & __GFP_HIGH)
+   alloc_flags |= ALLOC_HIGH;
+
+   if (!wait) {
+   alloc_flags |= ALLOC_HARDER;
+   /*
+* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+*/
+   alloc_flags &= ~ALLOC_CPUSET;
+   } else if (unlikely(rt_task(p)) && !in_interrupt())
+   alloc_flags |= ALLOC_HARDER;
+
+   if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+   if (!in_interrupt() &&
+   ((p->flags & PF_MEMALLOC) ||
+unlikely(test_thread_flag(TIF_MEMDIE
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   }
+
+   return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
@@ -1577,48 +1607,28 @@ restart:
 * OK, we're below the kswapd watermark and have kicked background
 * reclaim. Now things get more complex, so set up alloc_flags according
 * to how we want to proceed.
-*
-* The caller may dip into page reserves a bit more if the caller
-* cannot run direct reclaim, or if the caller has realtime scheduling
-* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 */
-   alloc_flags = ALLOC_WMARK_MIN;
-   if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-   alloc_flags |= ALLOC_HARDER;
-   if (gfp_mask & __GFP_HIGH)
-   alloc_flags |= ALLOC_HIGH;
-   if (wait)
-   alloc_flags |= ALLOC_CPUSET;
+   alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-   /*
-* Go through the zonelist again. Let __GFP_HIGH and allocations
-* coming from realtime tasks go deeper into reserves.
-*
-* This is the last chance, in general, before the goto nopage.
-* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-*/
-   page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+   /* This is the last chance, in general, before the goto nopage. */
+   page = get_page_from_freelist(gfp_mask, order, zonelist,
+

[PATCH 14/28] net: sk_allocation() - concentrate socket related allocations

2008-02-20 Thread Peter Zijlstra

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h|5 +
 net/ipv4/tcp.c|3 ++-
 net/ipv4/tcp_output.c |   12 +++-
 net/ipv6/tcp_ipv6.c   |   14 +-
 4 files changed, 23 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2078,7 +2078,8 @@ void tcp_send_fin(struct sock *sk)
} else {
/* Socket is locked, keep trying until memory is available. */
for (;;) {
-   skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+   skb = alloc_skb_fclone(MAX_TCP_HEADER,
+  sk->sk_allocation);
if (skb)
break;
yield();
@@ -2104,7 +2105,7 @@ void tcp_send_active_reset(struct sock *
struct sk_buff *skb;
 
/* NOTE: No TCP options attached and we never retransmit this. */
-   skb = alloc_skb(MAX_TCP_HEADER, priority);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
return;
@@ -2171,7 +2172,8 @@ struct sk_buff *tcp_make_synack(struct s
__u8 *md5_hash_location;
 #endif
 
-   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+   sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
 
@@ -2425,7 +2427,7 @@ void tcp_send_ack(struct sock *sk)
 * tcp_transmit_skb() will set the ownership to this
 * sock.
 */
-   buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2460,7 +2462,7 @@ static int tcp_xmit_probe_skb(struct soc
struct sk_buff *skb;
 
/* We don't queue it, tcp_transmit_skb() sets ownership. */
-   skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -427,6 +427,11 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+   return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
sk->sk_ack_backlog--;
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -568,7 +568,8 @@ static int tcp_v6_md5_do_add(struct sock
} else {
/* reallocate new list if current one is full. */
if (!tp->md5sig_info) {
-   tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), 
GFP_ATOMIC);
+   tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+   sk_allocation(sk, GFP_ATOMIC));
if (!tp->md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -581,7 +582,8 @@ static int tcp_v6_md5_do_add(struct sock
}
if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-  (tp->md5sig_info->entries6 + 1)), 
GFP_ATOMIC);
+  (tp->md5sig_info->entries6 + 1)),
+  sk_allocation(sk, GFP_ATOMIC));
 
if (!keys) {
tcp_free_md5sig_pool();
@@ -705,7 +707,7 @@ static int tcp_v6_parse_md5_keys (struct
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
 
-   p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+   p = kzalloc(sizeof(struct tcp_md5sig_info), sk->sk_allocation);
if (!p)
return -ENOMEM;
 
@@ -1006,7 +1008,7 @@ static void tcp_v6_send_reset(struct soc
 */
 
buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-GFP_ATOMIC);
+sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL)
return;
 
@@ -1085,10 +1087,12

[PATCH 19/28] netvm: prevent a stream specific deadlock

2008-02-20 Thread Peter Zijlstra

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h   |7 ---
 net/core/sock.c  |2 +-
 net/ipv4/tcp_input.c |8 
 net/sctp/ulpevent.c  |2 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -791,12 +791,13 @@ static inline int sk_wmem_schedule(struc
__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
-static inline int sk_rmem_schedule(struct sock *sk, int size)
+static inline int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
if (!sk_has_account(sk))
return 1;
-   return size <= sk->sk_forward_alloc ||
-   __sk_mem_schedule(sk, size, SK_MEM_RECV);
+   return skb->truesize <= sk->sk_forward_alloc ||
+   __sk_mem_schedule(sk, skb->truesize, SK_MEM_RECV) ||
+   skb_emergency(skb);
 }
 
 static inline void sk_mem_reclaim(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -388,7 +388,7 @@ int sock_queue_rcv_skb(struct sock *sk, 
if (err)
goto out;
 
-   if (!sk_rmem_schedule(sk, skb->truesize)) {
+   if (!sk_rmem_schedule(sk, skb)) {
err = -ENOBUFS;
goto out;
}
Index: linux-2.6/net/ipv4/tcp_input.c
===
--- linux-2.6.orig/net/ipv4/tcp_input.c
+++ linux-2.6/net/ipv4/tcp_input.c
@@ -3858,9 +3858,9 @@ static void tcp_data_queue(struct sock *
 queue_and_out:
if (eaten < 0 &&
(atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-!sk_rmem_schedule(sk, skb->truesize))) {
+!sk_rmem_schedule(sk, skb))) {
if (tcp_prune_queue(sk) < 0 ||
-   !sk_rmem_schedule(sk, skb->truesize))
+   !sk_rmem_schedule(sk, skb))
goto drop;
}
skb_set_owner_r(skb, sk);
@@ -3932,9 +3932,9 @@ drop:
TCP_ECN_check_ce(tp, skb);
 
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-   !sk_rmem_schedule(sk, skb->truesize)) {
+   !sk_rmem_schedule(sk, skb)) {
if (tcp_prune_queue(sk) < 0 ||
-   !sk_rmem_schedule(sk, skb->truesize))
+   !sk_rmem_schedule(sk, skb))
goto drop;
}
 
Index: linux-2.6/net/sctp/ulpevent.c
===
--- linux-2.6.orig/net/sctp/ulpevent.c
+++ linux-2.6/net/sctp/ulpevent.c
@@ -701,7 +701,7 @@ struct sctp_ulpevent *sctp_ulpevent_make
if (rx_count >= asoc->base.sk->sk_rcvbuf) {
 
if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
-   (!sk_rmem_schedule(asoc->base.sk, chunk->skb->truesize)))
+   (!sk_rmem_schedule(asoc->base.sk, chunk->skb)))
goto fail;
}
 

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 21/28] netvm: skb processing

2008-02-20 Thread Peter Zijlstra

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |5 
 net/core/dev.c |   59 +++--
 net/core/sock.c|   18 
 3 files changed, 76 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2004,6 +2004,30 @@ out:
 }
 #endif
 
+/*
+ * Filter the protocols for which the reserves are adequate.
+ *
+ * Before adding a protocol make sure that it is either covered by the existing
+ * reserves, or add reserves covering the memory need of the new protocol's
+ * packet processing.
+ */
+static int skb_emergency_protocol(struct sk_buff *skb)
+{
+   if (skb_emergency(skb))
+   switch (skb->protocol) {
+   case __constant_htons(ETH_P_ARP):
+   case __constant_htons(ETH_P_IP):
+   case __constant_htons(ETH_P_IPV6):
+   case __constant_htons(ETH_P_8021Q):
+   break;
+
+   default:
+   return 0;
+   }
+
+   return 1;
+}
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
@@ -2025,10 +2049,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+   unsigned long pflags = current->flags;
+
+   /* Emergency skb are special, they should
+*  - be delivered to SOCK_MEMALLOC sockets only
+*  - stay away from userspace
+*  - have bounded memory usage
+*
+* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+* This saves us from propagating the allocation context down to all
+* allocation sites.
+*/
+   if (skb_emergency(skb))
+   current->flags |= PF_MEMALLOC;
 
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
-   return NET_RX_DROP;
+   goto out;
 
if (!skb->tstamp.tv64)
net_timestamp(skb);
@@ -2039,7 +2076,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);
 
if (!orig_dev)
-   return NET_RX_DROP;
+   goto out;
 
__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2058,6 +2095,9 @@ int netif_receive_skb(struct sk_buff *sk
}
 #endif
 
+   if (skb_emergency(skb))
+   goto skip_taps;
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
@@ -2066,19 +2106,23 @@ int netif_receive_skb(struct sk_buff *sk
}
}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 ncls:
 #endif
 
+   if (!skb_emergency_protocol(skb))
+   goto drop;
+
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 
type = skb->protocol;
list_for_each_entry_rcu(ptype,
@@ -2094,6 +2138,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
 * me how you were going to use this. :-)
@@ -2101,8 +2146,10 @@ ncls:
ret = NET_RX_DROP;
}
 
-out:
+unlock:
rcu_read_unlock();
+out:
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -512,8 +512,13 @@ static inline void sk_add_backlog(struct
skb->next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+   if (skb_emergency(skb))
+   return __sk_backlog_rcv(sk, skb);
+
return sk->sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GP

Re: TG3 network data corruption regression 2.6.24/2.6.23.4

2008-02-20 Thread Tony Battersby

Herbert Xu wrote:
> On Tue, Feb 19, 2008 at 05:14:26PM -0500, Tony Battersby wrote:
>   
>> Update: when I revert Herbert's patch in addition to applying your
>> patch, the iSCSI performance goes back up to 115 MB/s again in both
>> directions.  So it looks like turning off SG for TX didn't itself cause
>> the performance drop, but rather that the performance drop is just
>> another manifestation of whatever bug is causing the data corruption.
>> 
>
> Interesting.  So the workload that regressed is mostly RX with a
> little TX traffic? Can you try to reproduce this with something
> like netperf to eliminate other variables?
>
> This is all very puzzling since the patch in question shouldn't
> change an RX load at all.
>
> Thanks,
>   
We have established that the slowdown was caused by TCP checksum errors
and retransmits.  I assume that the slowdown in my test was due to the
light TX rather than the heavy RX.  I am no TCP protocol expert, but
perhaps heavy TX (such as iperf) might not be affected as much because
the wire stays busy while waiting for the retransmit, whereas with my
light TX iSCSI load, the wire goes idle while waiting for the retransmit
because the iSCSI state machine is stalled.

Tony

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs

2008-02-20 Thread Peter Zijlstra

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 net/netfilter/core.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
ret = 1;
goto unlock;
} else if (verdict == NF_DROP) {
+drop:
kfree_skb(skb);
ret = -EPERM;
} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+   if (skb_emergency(*pskb))
+   goto drop;
if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
  verdict >> NF_VERDICT_BITS))
goto next_hook;

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/28] netvm: INET reserves.

2008-02-20 Thread Peter Zijlstra

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Use proc conv() routines to update these limits and return -ENOMEM to user
space.

Adds to the reserve tree:

  total network reserve  
network TX reserve   
  protocol TX pages  
network RX reserve   
+ IPv6 route cache   
+ IPv4 route cache   
  SKB data reserve   
+   IPv6 fragment cache  
+   IPv4 fragment cache  

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 net/ipv4/ip_fragment.c |   65 +++--
 net/ipv4/route.c   |   65 +++--
 net/ipv6/reassembly.c  |   65 +++--
 net/ipv6/route.c   |   65 +++--
 4 files changed, 252 insertions(+), 8 deletions(-)

Index: linux-2.6/net/ipv4/ip_fragment.c
===
--- linux-2.6.orig/net/ipv4/ip_fragment.c
+++ linux-2.6/net/ipv4/ip_fragment.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -591,17 +592,72 @@ int ip_defrag(struct sk_buff *skb, u32 u
return -ENOMEM;
 }
 
+static struct mem_reserve ipv4_frag_reserve;
+
 #ifdef CONFIG_SYSCTL
+static int ipv4_frag_bytes;
+
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+   struct file *filp, void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int old_bytes, ret;
+
+   if (!write)
+   ipv4_frag_bytes = init_net.ipv4.frags.high_thresh;
+   old_bytes = ipv4_frag_bytes;
+
+   ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+   if (!ret && write) {
+   ret = mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+ ipv4_frag_bytes);
+   if (!ret)
+   init_net.ipv4.frags.high_thresh = ipv4_frag_bytes;
+   else
+   ipv4_frag_bytes = old_bytes;
+   }
+
+   return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+   int __user *name, int nlen,
+   void __user *oldval, size_t __user *oldlenp,
+   void __user *newval, size_t newlen)
+{
+   int old_bytes, ret;
+   int write = (newval && newlen);
+
+   if (!write)
+   ipv4_frag_bytes = init_net.ipv4.frags.high_thresh;
+   old_bytes = ipv4_frag_bytes;
+
+   ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+   if (!ret && write) {
+   ret = mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+ ipv4_frag_bytes);
+   if (!ret)
+   init_net.ipv4.frags.high_thresh = ipv4_frag_bytes;
+   else
+   ipv4_frag_bytes = old_bytes;
+   }
+
+   return ret;
+}
+
 static int zero;
 
 static struct ctl_table ip4_frags_ctl_table[] = {
{
.ctl_name   = NET_IPV4_IPFRAG_HIGH_THRESH,
.procname   = "ipfrag_high_thresh",
-   .data   = &init_net.ipv4.frags.high_thresh,
+   .data   = &ipv4_frag_bytes,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = &proc_dointvec
+   .proc_handler   = &proc_dointvec_fragment,
+   .strategy   = &sysctl_intvec_fragment,
},
{
.ctl_name   = NET_IPV4_IPFRAG_LOW_THRESH,
@@ -736,6 +792,11 @@ void __init ipfrag_init(void)
ip4_frags.frag_expire = ip_expire;
ip4_frags.secret_interval = 10 * 60 * HZ;
inet_frags_init(&ip4_frags);
+
+   mem_reserve_init(&ipv4_frag_reserve, "IPv4 fragment cache",
+&net_skb_reserve);
+   mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+   init_net.ipv4.frags.high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6/net/ipv6/reassembly.c
===
--- linux-2.6.orig/net/ipv6/reassembly.c
+++ linux-2.6/net/ipv6/reassembly.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -628,15 +629,70 @@ static struct inet6_protocol frag_protoc
.flags  =   INET6_PROTO_NOPOLICY,
 };
 
+static struct

[PATCH 26/28] nfs: disable data cache revalidation for swapfiles

2008-02-20 Thread Peter Zijlstra

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/inode.c |6 
 fs/nfs/write.c |   73 ++---
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -763,6 +763,12 @@ int nfs_revalidate_mapping_nolock(struct
struct nfs_inode *nfsi = NFS_I(inode);
int ret = 0;
 
+   /*
+* swapfiles are not supposed to be shared.
+*/
+   if (IS_SWAPFILE(inode))
+   goto out;
+
if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int 
get)
 {
struct nfs_page *req = NULL;
 
-   if (PagePrivate(page)) {
+   if (PagePrivate(page))
req = (struct nfs_page *)page_private(page);
-   if (req != NULL)
-   kref_get(&req->wb_kref);
-   }
+   else if (unlikely(PageSwapCache(page)))
+   req = radix_tree_lookup(&nfsi->nfs_page_tree, 
page_file_index(page));
+
+   if (get && req)
+   kref_get(&req->wb_kref);
+
return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+   return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+   struct inode *inode = page_file_mapping(page)->host;
+   struct nfs_page *req = NULL;
+
+   spin_lock(&inode->i_lock);
+   req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+   spin_unlock(&inode->i_lock);
+
+   /*
+* hole here plugged by the caller holding onto PG_locked
+*/
+
+   return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+   if (PagePrivate(page))
+   return 1;
+
+   if (unlikely(PageSwapCache(page)))
+   return __nfs_page_has_request(page);
+
+   return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *req = NULL;
 
spin_lock(&inode->i_lock);
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(NFS_I(inode), page);
spin_unlock(&inode->i_lock);
return req;
 }
@@ -252,7 +289,7 @@ static int nfs_page_async_flush(struct n
 
spin_lock(&inode->i_lock);
for(;;) {
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(NFS_I(inode), page);
if (req == NULL) {
spin_unlock(&inode->i_lock);
return 0;
@@ -367,8 +404,14 @@ static void nfs_inode_add_request(struct
if (nfs_have_delegation(inode, FMODE_WRITE))
nfsi->change_attr++;
}
-   SetPagePrivate(req->wb_page);
-   set_page_private(req->wb_page, (unsigned long)req);
+   /*
+* Swap-space should not get truncated. Hence no need to plug the race
+* with invalidate/truncate.
+*/
+   if (likely(!PageSwapCache(req->wb_page))) {
+   SetPagePrivate(req->wb_page);
+   set_page_private(req->wb_page, (unsigned long)req);
+   }
nfsi->npages++;
kref_get(&req->wb_kref);
radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index,
@@ -386,8 +429,10 @@ static void nfs_inode_remove_request(str
BUG_ON (!NFS_WBACK_BUSY(req));
 
spin_lock(&inode->i_lock);
-   set_page_private(req->wb_page, 0);
-   ClearPagePrivate(req->wb_page);
+   if (likely(!PageSwapCache(req->wb_page))) {
+   set_page_private(req->wb_page, 0);
+   Cl

[PATCH 18/28] netvm: filter emergency skbs.

2008-02-20 Thread Peter Zijlstra

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -998,6 +998,9 @@ static inline int sk_filter(struct sock 
 {
int err;
struct sk_filter *filter;
+
+   if (skb_emergency(skb) && !sk_has_memalloc(sk))
+   return -ENOMEM;

err = security_sock_rcv_skb(sk, skb);
if (err)

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/28] mm: __GFP_MEMALLOC

2008-02-20 Thread Peter Zijlstra

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/gfp.h |3 ++-
 mm/page_alloc.c |4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT   ((__force gfp_t)0x400u) /* Retry the allocation.  Might 
fail */
 #define __GFP_NOFAIL   ((__force gfp_t)0x800u) /* Retry for ever.  Cannot fail 
*/
 #define __GFP_NORETRY  ((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-   __GFP_NORETRY|__GFP_NOMEMALLOC)
+   __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1474,7 +1474,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-   if (!in_irq() && (p->flags & PF_MEMALLOC))
+   if (gfp_mask & __GFP_MEMALLOC)
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_irq() && (p->flags & PF_MEMALLOC))
alloc_flags |= ALLOC_NO_WATERMARKS;
else if (!in_interrupt() &&
unlikely(test_thread_flag(TIF_MEMDIE)))

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS.

2008-02-20 Thread Peter Zijlstra

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/pagelist.c |2 +-
 fs/nfs/write.c|6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all
if (pagecount <= ARRAY_SIZE(p->page_array))
p->pagevec = p->page_array;
else {
-   p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
+   p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOIO);
if (!p->pagevec) {
kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
struct nfs_page *p;
-   p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+   p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
if (p) {
memset(p, 0, sizeof(*p));
INIT_LIST_HEAD(&p->wb_list);

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 22/28] mm: add support for non block device backed swap files

2008-02-20 Thread Peter Zijlstra

New addres_space_operations methods are added:
  int swapfile(struct address_space *, int);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapfile(,1) so that
->swap_{out,in}() have instant access to it. It can be released on
->swapfile(,0).

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
in the context of writepage, as the page content is normally dirtied
using either of the following interfaces:
  write_{begin,end}()
  {prepare,commit}_write()
  page_mkwrite()
which do have the file context.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 Documentation/filesystems/Locking |   19 +
 Documentation/filesystems/vfs.txt |   17 
 include/linux/buffer_head.h   |2 -
 include/linux/fs.h|8 +
 include/linux/swap.h  |4 ++
 mm/page_io.c  |   52 ++
 mm/swap_state.c   |4 +-
 mm/swapfile.c |   26 ++-
 8 files changed, 128 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -120,6 +120,7 @@ enum {
SWP_USED= (1 << 0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1 << 1), /* ok to write to this swap?*/
SWP_ACTIVE  = (SWP_USED | SWP_WRITEOK),
+   SWP_FILE= (1 << 2), /* file swap area */
/* add others here before... */
SWP_SCANNING= (1 << 8), /* refcount in scan_swap_map */
 };
@@ -217,6 +218,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
@@ -250,6 +253,7 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
+extern struct swap_info_struct *page_swap_info(struct page *);
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 struct backing_dev_info;
Index: linux-2.6/mm/page_io.c
===
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -97,11 +98,21 @@ int swap_writepage(struct page *page, st
 {
struct bio *bio;
int ret = 0, rw = WRITE;
+   struct swap_info_struct *sis = page_swap_info(page);
 
if (remove_exclusive_swap_page(page)) {
unlock_page(page);
goto out;
}
+
+   if (sis->flags & SWP_FILE) {
+   ret = sis->swap_file->f_mapping->
+   a_ops->swap_out(sis->swap_file, page, wbc);
+   if (!ret)
+   count_vm_event(PSWPOUT);
+   return ret;
+   }
+
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -120,13 +131,54 @@ out:
return ret;
 }
 
+void swap_sync_page(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis->flags & SWP_FILE) {
+   const struct address_space_operations *a_ops =
+   sis->swap_file->f_mapping->a_ops;
+   if (a_ops->sync_page)
+   a_ops->sync_page(page);
+   } else
+   block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis->fla

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Joe Perches

On Wed, 2008-02-20 at 17:02 +0300, Pavel Emelyanov wrote:
> There are tree places, which declare the char buf[...] on the stack
> to push it later into dprintk(). Since the dprintk sometimes (if the 
> CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
> cause gcc to produce appropriate warnings.

What about the uses in fs?

fs/lockd/svc.c: char buf[RPC_MAX_ADDRBUFLEN];
fs/lockd/svc4proc.c:char buf[RPC_MAX_ADDRBUFLEN];
fs/lockd/svcproc.c: char buf[RPC_MAX_ADDRBUFLEN];
fs/nfs/callback.c:  char buf[RPC_MAX_ADDRBUFLEN];
fs/nfsd/nfsfh.c:char buf[RPC_MAX_ADDRBUFLEN];
fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];

Perhaps there should be a DECLARE_RPC_BUF(buf) macro?

#define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Joe Perches

On Wed, 2008-02-20 at 07:29 -0800, Joe Perches wrote:
> fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
> Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
> #define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused

Make that:

#define DECLARE_RPC_BUF(var) char var[RPC_MAX_ADDRBUFLEN] __maybe_unused


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Patrick McHardy


Joe Perches wrote:

On Wed, 2008-02-20 at 07:29 -0800, Joe Perches wrote:
  

fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
#define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused



Make that:

#define DECLARE_RPC_BUF(var) char var[RPC_MAX_ADDRBUFLEN] __maybe_unuse


Alternatively change the dprintk macro to behave similar like
pr_debug() and mark things like svc_print_addr() __pure, which
has the advantage that is still performs format checking even
if debugging is disabled.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Pavel Emelyanov

Joe Perches wrote:
> On Wed, 2008-02-20 at 17:02 +0300, Pavel Emelyanov wrote:
>> There are tree places, which declare the char buf[...] on the stack
>> to push it later into dprintk(). Since the dprintk sometimes (if the 
>> CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
>> cause gcc to produce appropriate warnings.
> 
> What about the uses in fs?
> 
> fs/lockd/svc.c: char buf[RPC_MAX_ADDRBUFLEN];
> fs/lockd/svc4proc.c:char buf[RPC_MAX_ADDRBUFLEN];
> fs/lockd/svcproc.c: char buf[RPC_MAX_ADDRBUFLEN];
> fs/nfs/callback.c:  char buf[RPC_MAX_ADDRBUFLEN];
> fs/nfsd/nfsfh.c:char buf[RPC_MAX_ADDRBUFLEN];
> fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
> 
> Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
> 
> #define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused

Sigh... Why is that better than a strait declaration with attribute?

> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/04] smc91x: request bus width using platform data

2008-02-20 Thread Nicolas Pitre

On Wed, 20 Feb 2008, Magnus Damm wrote:

> These patches make it possible to request bus width in the platform data.
> 
> Instead of keep on updating smc91x.h with board specific configuration,
> use platform data to pass along bus width and irq flags to the driver.
> This change is designed to be backwards-compatible, so all boards configured
> in the header file should just work as usual.

Thank you for doing this work.  I really meant to do it, and commit 
09779c6df2dbe95483269d194b327d41fe2cc57e was the first step towards that 
goal, but as you can see we're nearly two years later and I didn't do 
it.

I have a few comments though.  I will reply to those messages 
separately.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/28] mm: allow PF_MEMALLOC from softirq context

2008-02-20 Thread Peter Zijlstra

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/sched.h |4 
 kernel/softirq.c  |3 +++
 mm/page_alloc.c   |7 ---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1471,9 +1471,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-   if (!in_interrupt() &&
-   ((p->flags & PF_MEMALLOC) ||
-unlikely(test_thread_flag(TIF_MEMDIE
+   if (!in_irq() && (p->flags & PF_MEMALLOC))
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_interrupt() &&
+   unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
 
Index: linux-2.6/kernel/softirq.c
===
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -213,6 +213,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+   unsigned long pflags = current->flags;
+   current->flags &= ~PF_MEMALLOC;
 
pending = local_softirq_pending();
account_system_vtime(current);
@@ -251,6 +253,7 @@ restart:
 
account_system_vtime(current);
_local_bh_enable();
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1497,6 +1497,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+   do {(p)->flags &= ~(mask); \
+   (p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 24/28] nfs: remove mempools

2008-02-20 Thread Peter Zijlstra

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/read.c  |   15 +++
 fs/nfs/write.c |   27 +--
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ  (32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-   struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+   struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
else {
p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p->pagevec) {
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
p = NULL;
}
}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
struct nfs_read_data *p = container_of(head, struct nfs_read_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -595,16 +592,10 @@ int __init nfs_init_readpagecache(void)
if (nfs_rdata_cachep == NULL)
return -ENOMEM;
 
-   nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-nfs_rdata_cachep);
-   if (nfs_rdata_mempool == NULL)
-   return -ENOMEM;
-
return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-   mempool_destroy(nfs_rdata_mempool);
kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITYNFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE (32)
-#define MIN_POOL_COMMIT(4)
-
 /*
  * Local function declarations
  */
@@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_commit_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all
else {
p->pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p->pagevec) {
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
}
}
@@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p && (p->pagevec != &p->page_array[0]))
kfree(p->pagevec);
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1514,16 +1509,6 @@ int __init nfs_init_writepagecache(void)
if (nfs_wdata_cachep == NULL)
r

[PATCH 06/28] mm: serialize access to min_free_kbytes

2008-02-20 Thread Peter Zijlstra

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page_alloc.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO
 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4087,12 +4088,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -4147,6 +4148,15 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(&min_free_lock, flags);
+   __setup_per_zone_pages_min();
+   spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4182,7 +4192,7 @@ static int __init init_per_zone_pages_mi
min_free_kbytes = 128;
if (min_free_kbytes > 65536)
min_free_kbytes = 65536;
-   setup_per_zone_pages_min();
+   __setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
return 0;
 }

--

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/04] smc91x: pass along private data

2008-02-20 Thread Nicolas Pitre

On Wed, 20 Feb 2008, Magnus Damm wrote:

> Pass a private data pointer to macros and functions. This makes it easy
> to later on make run time decisions. This patch does not change any logic.
> These changes should be optimized away during compilation.
> 
> Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
> ---
> --- 0001/drivers/net/smc91x.c
> +++ work/drivers/net/smc91x.c 2008-02-20 16:52:48.0 +0900
> @@ -220,23 +220,23 @@ static void PRINT_PKT(u_char *buf, int l
>  
>  
>  /* this enables an interrupt in the interrupt mask register */
> -#define SMC_ENABLE_INT(x) do {   
> \
> +#define SMC_ENABLE_INT(priv, x) do { \
>   unsigned char mask; \
> - spin_lock_irq(&lp->lock);   \
> - mask = SMC_GET_INT_MASK();  \
> + spin_lock_irq(&priv->lock); \
> + mask = SMC_GET_INT_MASK(priv);  \

Since "lp" is already used all over the place, could you simply use "lp" 
for the macro argument name as well instead of "priv"?  This will make 
the code more uniform and reduce the patch size.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Joe Perches

> Sigh... Why is that better than a strait declaration with attribute?

If at some point there's a gcc'ism to remove a maybe_unused
variable from the stack declaration, you only have change
the macro.

cheers, Joe


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/04] smc91x: introduce platform data flags

2008-02-20 Thread Nicolas Pitre

On Wed, 20 Feb 2008, Magnus Damm wrote:

> This patch introduces struct smc91x_platdata and modifies the driver so
> bus width is checked during run time using SMC_nBIT() instead of
> SMC_CAN_USE_nBIT.
> 
> Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>
> ---

NAK.

The SMC91C111 (for example) is often used on devices which have a CPU 
clock barely higher than the network throughput, hence it is crutial for 
those devices to have the most efficient access possible to the chip or 
performance will suffer.  This is the main reason behind the heavily 
macroized register access so things are always optimized for the data 
bus capabilities at compile time.

This patch introduces a runtimes test on lp->cfg.flags for every 
register access even on those platforms not using the platform data 
based bus configuration at all.

I think you should add a SMC_DYNAMIC_BUS_CONFIG and redefine SMC_nBITS() 
so they dereference cfg.flags only when it is defined.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/04] smc91x: add insw/outsw to default config

2008-02-20 Thread Nicolas Pitre

On Wed, 20 Feb 2008, Magnus Damm wrote:

> This patch makes sure SMC_insw()/SMC_outsw() are defined for the
> default configuration. Without this change BUG()s will be triggered
> when using 16-bit only platform data and the default configuration.
> 
> Signed-off-by: Magnus Damm <[EMAIL PROTECTED]>

You should have introduced this patch as 3/4 instead of 4/4 so to make 
sure the series won't  create a non functional kernel between 3/4 and 
4/4.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][PPPOL2TP]: Fix SMP oops in pppol2tp driver

2008-02-20 Thread James Chapman


Jarek Poplawski wrote:

On Mon, Feb 18, 2008 at 10:09:24PM +, James Chapman wrote:
...

Unfortunately the ISP's syslog stops. But I've been able to borrow
two Quad Xeon boxes and have reproduced the problem.

Here's a new version of the patch. The patch avoids disabling irqs
and fixes the sk_dst_get() usage that DaveM mentioned. But even with
this patch, lockdep still complains if hundreds of ppp sessions are
inserted into a tunnel as rapidly as possible (lockdep trace is below).
I can stop these errors by wrapping the call to ppp_input() in
pppol2tp_recv_dequeue_skb() with local_irq_save/restore. What is a
better fix?


I send here my proposal: it's intended for testing and to check one of
possible solutions here. IMHO your lockdep reports show there is no
use to change anything around sk_dst_lock: it would need the global
change of this lock to fix this problem. So the fix should be done
around pch->upl lock and this means changing ppp_generic.


Hmm, I need to study the lockdep report again. It seems I'm misreading 
the lockdep output. :(



In the patch below I've used trylock in places which seem to allow
for skipping some things (while config is changed only) or simply
don't need this lock because there is no ppp struct. This could be
modified to add some waiting loop if necessary. Another option is to
change the write side of this lock: it looks like more vulnerable if
something missed because there are more locks involved, but probably
should be enough to solve this problem too.

I think pppol2tp need to be first checked only with hlist_lock bh
patch, unless there were some lockdep reports on these other locks
too. (BTW, I added ppp maintainer to CC - I hope we get Paul's opinion
on this.)


I tried your ppp_generic patch with only the hlist_lock bh patch in 
pppol2tp and it seems to fix the ppp create/delete issue. However, when 
I added much more traffic into the test (flood pings over ppp interfaces 
while repeatedly creating/deleting the L2TP (PPP) sessions) I get a soft 
lockup detected in pppol2tp_xmit() after anything between 1 minute and 
an hour. :( I'm investigating that now.


Thanks for your help!


(testing patch #1)
---

 drivers/net/ppp_generic.c |   33 +++--
 1 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 4dc5b4b..5cbc534 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1473,7 +1473,7 @@ void
 ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 {
struct channel *pch = chan->ppp;
-   int proto;
+   int proto, locked;
 
 	if (!pch || skb->len == 0) {

kfree_skb(skb);
@@ -1481,8 +1481,13 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
}
 
 	proto = PPP_PROTO(skb);

-   read_lock_bh(&pch->upl);
-   if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
+   /*
+* We use trylock to avoid dependency between soft-irq-safe upl lock
+* and soft-irq-unsafe sk_dst_lock.
+*/
+   local_bh_disable();
+   locked = read_trylock(&pch->upl);
+   if (!locked || !pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
/* put it on the channel queue */
skb_queue_tail(&pch->file.rq, skb);
/* drop old frames if queue too long */
@@ -1493,7 +1498,10 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
} else {
ppp_do_recv(pch->ppp, skb, pch);
}
-   read_unlock_bh(&pch->upl);
+
+   if (locked)
+   read_unlock(&pch->upl);
+   local_bh_enable();
 }
 
 /* Put a 0-length skb in the receive queue as an error indication */

@@ -1506,16 +1514,18 @@ ppp_input_error(struct ppp_channel *chan, int code)
if (!pch)
return;
 
-	read_lock_bh(&pch->upl);

-   if (pch->ppp) {
+   /* a trylock comment in ppp_input() */
+   local_bh_disable();
+   if (read_trylock(&pch->upl) && pch->ppp) {
skb = alloc_skb(0, GFP_ATOMIC);
if (skb) {
skb->len = 0;/* probably unnecessary */
skb->cb[0] = code;
ppp_do_recv(pch->ppp, skb, pch);
}
+   read_unlock(&pch->upl);
}
-   read_unlock_bh(&pch->upl);
+   local_bh_enable();
 }
 
 /*

@@ -2044,10 +2054,13 @@ int ppp_unit_number(struct ppp_channel *chan)
int unit = -1;
 
 	if (pch) {

-   read_lock_bh(&pch->upl);
-   if (pch->ppp)
+   /* a trylock comment in ppp_input() */
+   local_bh_disable();
+   if (read_trylock(&pch->upl) && pch->ppp) {
unit = pch->ppp->file.index;
-   read_unlock_bh(&pch->upl);
+   read_unlock(&pch->upl);
+   }
+   local_bh_enable();
}
return unit;
 }
--



--

Re: TG3 network data corruption regression 2.6.24/2.6.23.4

2008-02-20 Thread Tony Battersby

Matt Carlson wrote:
> Hi Tony.  Can you give us the output of :
>
> sudo lspci -vvv - -s 03:01.0'
>   
03:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit 
Ethernet (rev 15)
Subsystem: Compaq Computer Corporation NC7770 Gigabit Server Adapter 
(PCI-X, 10/100/1000-T)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable-
Address: 063000119b608000  Data: 0423
00: e4 14 45 16 06 00 b0 02 15 00 00 02 10 40 00 00
10: 04 00 7f df 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 11 0e 7c 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 40 00
40: 07 48 00 00 09 03 03 00 01 50 02 c0 00 20 00 64
50: 03 58 00 00 08 10 21 08 05 00 86 00 00 80 60 9b
60: 11 00 30 06 23 04 00 00 98 02 05 01 0f 00 db 76
70: 8a 10 00 00 c7 00 00 80 50 00 00 00 00 00 00 00
80: 03 58 00 00 00 00 00 00 34 80 13 04 82 10 00 00
90: 09 06 00 01 00 00 00 00 00 00 00 00 c6 01 00 00
a0: 00 00 00 00 fe 02 00 00 00 00 00 00 af 01 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00



> Also, after some digging, I found that the 5701 can run into trouble if
> a 64-bit DMA read terminates early and then completes as a 32-bit transfer.
> The problem is reportedly very rare, but the failure mode looks like a
> match.  Can you apply the following patch and see if it helps your
> performance / corruption problems?
>
>
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index db606b6..7ad08ce 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -11409,6 +11409,8 @@ static int __devinit tg3_get_invariants(struct tg3 
> *tp)
>   tp->tg3_flags |= TG3_FLAG_PCI_HIGH_SPEED;
>   if ((pci_state_reg & PCISTATE_BUS_32BIT) != 0)
>   tp->tg3_flags |= TG3_FLAG_PCI_32BIT;
> + else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701)
> + tp->grc_mode |= GRC_MODE_FORCE_PCI32BIT;
>  
>   /* Chip-specific fixup from Broadcom driver */
>   if ((tp->pci_chip_rev_id == CHIPREV_ID_5704_A0) &&
>
>   
Sorry, this didn't help.  I still get data corruption with hardware
checksumming or poor performance with software checksumming.

Tony

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Jan Engelhardt


On Feb 20 2008 15:47, Ilpo Järvinen wrote:
>
>-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb
>
>-static inline struct sk_buff *dev_alloc_skb(unsigned int length)
>-{
>-  return __dev_alloc_skb(length, GFP_ATOMIC);
>-}
>+extern struct sk_buff *dev_alloc_skb(unsigned int length);

Striking. How can this even happen? A callsite which calls

dev_alloc_skb(n)

is just equivalent to

__dev_alloc_skb(n, GFP_ATOMIC);

which means there's like 4 (or 8 if it's long) bytes more on the
stack. For a worst case, count in another 8 bytes for push and pop or mov on
the stack. But that still does not add up to 23 kb.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Patrick McHardy


Jan Engelhardt wrote:

On Feb 20 2008 15:47, Ilpo Järvinen wrote:

-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb

-static inline struct sk_buff *dev_alloc_skb(unsigned int length)
-{
-   return __dev_alloc_skb(length, GFP_ATOMIC);
-}
+extern struct sk_buff *dev_alloc_skb(unsigned int length);


Striking. How can this even happen? A callsite which calls

dev_alloc_skb(n)

is just equivalent to

__dev_alloc_skb(n, GFP_ATOMIC);

which means there's like 4 (or 8 if it's long) bytes more on the
stack. For a worst case, count in another 8 bytes for push and pop or mov on
the stack. But that still does not add up to 23 kb.



__dev_alloc_skb() is also an inline function which performs
some extra work. Which raises the question - if dev_alloc_skb()
is uninlined, shouldn't __dev_alloc_skb() be uninline as well?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Pavel Emelyanov

Patrick McHardy wrote:
> Joe Perches wrote:
>> On Wed, 2008-02-20 at 07:29 -0800, Joe Perches wrote:
>>   
>>> fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
>>> Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
>>> #define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused
>>> 
>> Make that:
>>
>> #define DECLARE_RPC_BUF(var) char var[RPC_MAX_ADDRBUFLEN] __maybe_unuse

OK, I'll send the patch in a moment.

> Alternatively change the dprintk macro to behave similar like

This is too heavy. The problem is that some arguments passed to this
function exist only under appropriate ifdefs, so having a static
inline there will produce a warning :(

> pr_debug() and mark things like svc_print_addr() __pure, which
> has the advantage that is still performs format checking even
> if debugging is disabled.

Taking my above statement into account, this becomes useless, since
svc_print_addr() is used inside those "empty" macros and are complied
out automatically.

> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] SUNRPC: Compile out bufs for debug printks

2008-02-20 Thread Pavel Emelyanov

There are many places, which declare the char buf[...] on the stack
to push it later into dprintk(). Since the dprintk sometimes (if the 
CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
cause gcc to produce appropriate warnings.

Introduce a macro that declares that buf as __maybe_unused.

More candidates for patching are found by Joe Perches.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 0822646..b7e179e 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -153,7 +153,7 @@ lockd(struct svc_rqst *rqstp)
 */
while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) {
long timeout = MAX_SCHEDULE_TIMEOUT;
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
 
if (signalled()) {
flush_signals(current);
diff --git a/fs/lockd/svc4proc.c b/fs/lockd/svc4proc.c
index 385437e..5643f44 100644
--- a/fs/lockd/svc4proc.c
+++ b/fs/lockd/svc4proc.c
@@ -436,7 +436,7 @@ nlm4svc_proc_sm_notify(struct svc_rqst *rqstp, struct 
nlm_reboot *argp,
dprintk("lockd: SM_NOTIFY called\n");
if (saddr.sin_addr.s_addr != htonl(INADDR_LOOPBACK)
 || ntohs(saddr.sin_port) >= 1024) {
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
printk(KERN_WARNING "lockd: rejected NSM callback from %s\n",
svc_print_addr(rqstp, buf, sizeof(buf)));
return rpc_system_err;
diff --git a/fs/lockd/svcproc.c b/fs/lockd/svcproc.c
index 88379cc..5f0cf50 100644
--- a/fs/lockd/svcproc.c
+++ b/fs/lockd/svcproc.c
@@ -468,7 +468,7 @@ nlmsvc_proc_sm_notify(struct svc_rqst *rqstp, struct 
nlm_reboot *argp,
dprintk("lockd: SM_NOTIFY called\n");
if (saddr.sin_addr.s_addr != htonl(INADDR_LOOPBACK)
 || ntohs(saddr.sin_port) >= 1024) {
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
printk(KERN_WARNING "lockd: rejected NSM callback from %s\n",
svc_print_addr(rqstp, buf, sizeof(buf)));
return rpc_system_err;
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index bd185a5..33950dc 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -165,7 +165,7 @@ void nfs_callback_down(void)
 static int nfs_callback_authenticate(struct svc_rqst *rqstp)
 {
struct nfs_client *clp;
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
 
/* Don't talk to strangers */
clp = nfs_find_client(svc_addr(rqstp), 4);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 8fbd2dc..94fe70e 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -101,7 +101,7 @@ static __be32 nfsd_setuser_and_check_port(struct svc_rqst 
*rqstp,
 {
/* Check if the request originated from a secure port. */
if (!rqstp->rq_secure && EX_SECURE(exp)) {
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
dprintk(KERN_WARNING
   "nfsd: request from insecure port %s!\n",
   svc_print_addr(rqstp, buf, sizeof(buf)));
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 977a71f..66383ae 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -148,7 +148,7 @@ nfsd_proc_read(struct svc_rqst *rqstp, struct nfsd_readargs 
*argp,
 */
 
if (NFSSVC_MAXBLKSIZE_V2 < argp->count) {
-   char buf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
printk(KERN_NOTICE
"oversized read request from %s (%d bytes)\n",
svc_print_addr(rqstp, buf, sizeof(buf)),
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 64c9755..fd701e6 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -402,6 +402,8 @@ char * svc_print_addr(struct svc_rqst *, 
char *, size_t);
 
 #defineRPC_MAX_ADDRBUFLEN  (63U)
 
+#define DECLARE_RPC_BUF(name)  char name[RPC_MAX_ADDRBUFLEN] __maybe_unused
+
 /*
  * When we want to reduce the size of the reserved space in the response
  * buffer, we need to take into account the size of any checksum data that
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index a290e15..895d365 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -813,7 +813,7 @@ svc_printk(struct svc_rqst *rqstp, const char *fmt, ...)
 {
va_list args;
int r;
-   charbuf[RPC_MAX_ADDRBUFLEN];
+   DECLARE_RPC_BUF(buf);
 
if (!net_ratelimit())
return 0;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1d3e5fc..87dc4bc 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -175,7 +175,7 @@ static int svc_sendto(struct svc_rqst *rqstp, struct 
xdr_buf *xdr)
size_t  base = xdr->page_base;
unsigned intpg

1 2 3 >

1 - 100 of 201 matches

Mail list logo