Re: [VLAN]: Avoid expensive divides

2008-01-08 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 12:28:52 +0100

> We can avoid divides (as seen with CONFIG_CC_OPTIMIZE_FOR_SIZE=y on x86)
> changing vlan_group_get_device()/vlan_group_set_device()  id parameter 
> from signed to
> unsigned.
> 
> Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Wed, 09 Jan 2008 08:51:39 +0100

> Yes I have a patch for these divides, but will apply on 2.6.25 once this one 
> hits it.  (this saves 192 bytes of kernel text BTW)

I never doubted you for a second.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv6]: IPV6_MULTICAST_IF setting is ignored on link-local connect()

2008-01-08 Thread David Miller
From: David Stevens <[EMAIL PROTECTED]>
Date: Mon, 7 Jan 2008 17:18:56 -0800

> Acked-by: David L Stevens <[EMAIL PROTECTED]>

Patch applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread Eric Dumazet

David Miller a écrit :

From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Wed, 09 Jan 2008 08:29:11 +0100

Thanks for catching this.

Applied to net-2.6


+static inline int xfrm_alg_len(struct xfrm_algo *alg)
+{
+   return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
+}


That gets emitted as a divide doesn't it :-


Yes I have a patch for these divides, but will apply on 2.6.25 once this one 
hits it.  (this saves 192 bytes of kernel text BTW)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SOCK]: Adds a rcu_dereference() in sk_filter

2008-01-08 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Wed, 09 Jan 2008 17:17:06 +1100

> Eric Dumazet <[EMAIL PROTECTED]> wrote:
> > 
> > It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU 
> > protection for sk_filter(), without a rcu_dereference()
> > 
> > Either we need a rcu_dereference(), either a comment should explain why we 
> > dont need it. I vote for the former.
> > 
> > Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
> 
> Acked-by: Herbert Xu <[EMAIL PROTECTED]>

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache

2008-01-08 Thread Eric Dumazet

Herbert Xu a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:
Very good question, but honestly I really dont see why it was there at the 
first place :


It was there because someone went through this file and robotically
replaced all conditional read barriers with rcu_dereference, even when
it made zero sense.

Basically you can add a conditional barrier either at the point where
the pointer gets read, or where it gets derferenced.  Previously we
did the latter (except that the show function didn't have a barrier
at all which is technically a bug though harmless in pratice).  This
patch moves it to the spot where it gets read which is also OK.


static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
{
-   struct rt_cache_iter_state *st = rcu_dereference(seq->private);
+   struct rt_cache_iter_state *st = seq->private;

-   r = r->u.dst.rt_next;
+   r = rcu_dereference(r->u.dst.rt_next);
   while (!r) {
   rcu_read_unlock_bh();
   if (--st->bucket < 0)
   break;
   rcu_read_lock_bh();
-   r = rt_hash_table[st->bucket].chain;
+   r = rcu_dereference(rt_hash_table[st->bucket].chain);
   }
   return r;


Slight optimisation: please move both barriers onto the return statement,
i.e.,

return rcu_dereference(r);


I am not sure this is valid, since it will do this :

r = rt_hash_table[st->bucket].chain;
if (r)
return rcu_dereference(r);

So compiler might be dumb enough do dereference 
&rt_hash_table[st->bucket].chain two times.



It seems Dipankar is busy at this moment, so I will post again the patch and 
ask a comment from Paul :)


Thank you

[NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache

In rt_cache_get_next(), no need to guard seq->private by a rcu_dereference()
since seq is private to the thread running this function. Reading seq.private
once (as guaranted bu rcu_dereference()) or several time if compiler really is 
dumb enough wont change the result.


But we miss real spots where rcu_dereference() are needed, both in 
rt_cache_get_first() and rt_cache_get_next()


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d337706..3b7562f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -278,7 +278,7 @@ static struct rtable *rt_cache_get_first(struct seq_file 
*seq)
 
for (st->bucket = rt_hash_mask; st->bucket >= 0; --st->bucket) {
rcu_read_lock_bh();
-   r = rt_hash_table[st->bucket].chain;
+   r = rcu_dereference(rt_hash_table[st->bucket].chain);
if (r)
break;
rcu_read_unlock_bh();
@@ -288,15 +288,15 @@ static struct rtable *rt_cache_get_first(struct seq_file 
*seq)
 
 static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
 {
-   struct rt_cache_iter_state *st = rcu_dereference(seq->private);
+   struct rt_cache_iter_state *st = seq->private;
 
-   r = r->u.dst.rt_next;
+   r = rcu_dereference(r->u.dst.rt_next);
while (!r) {
rcu_read_unlock_bh();
if (--st->bucket < 0)
break;
rcu_read_lock_bh();
-   r = rt_hash_table[st->bucket].chain;
+   r = rcu_dereference(rt_hash_table[st->bucket].chain);
}
return r;
 }


Re: [XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Wed, 09 Jan 2008 08:29:11 +0100

Thanks for catching this.

Applied to net-2.6

> +static inline int xfrm_alg_len(struct xfrm_algo *alg)
> +{
> + return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
> +}

That gets emitted as a divide doesn't it :-
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream-davem' branch of wireless-2.6

2008-01-08 Thread David Miller
From: "John W. Linville" <[EMAIL PROTECTED]>
Date: Tue, 8 Jan 2008 14:29:14 -0500

> Here are a few more for 2.6.25.  The are mostly clean-ups for the new
> PID rate control algorithm, and some A-MPDU bits related to supporting
> 802.11n.

Pulled and pushed back out to net-2.6.25, thanks John.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread Eric Dumazet

Herbert Xu a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 58dfa82..731f0a8 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
   return ret;
}

+static inline int alg_len(struct xfrm_algo *alg)


Could you please add an xfrm prefix to this?


Sure, thanks for the suggestion :)

[XFRM]: xfrm_algo_clone() allocates too much memory

alg_key_len is the length in bits of the key, not in bytes.

Best way to fix this is to move alg_len() function from net/xfrm/xfrm_user.c 
to include/net/xfrm.h, and to use it in xfrm_algo_clone()


alg_len() is renamed to xfrm_alg_len() because of its global exposition.

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

 include/net/xfrm.h   |7 ++-
 net/xfrm/xfrm_user.c |   17 ++---
 2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 58dfa82..1dd20cf 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
return ret;
 }
 
+static inline int xfrm_alg_len(struct xfrm_algo *alg)
+{
+   return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
+}
+
 #ifdef CONFIG_XFRM_MIGRATE
 static inline struct xfrm_algo *xfrm_algo_clone(struct xfrm_algo *orig)
 {
-   return (struct xfrm_algo *)kmemdup(orig, sizeof(*orig) + 
orig->alg_key_len, GFP_KERNEL);
+   return kmemdup(orig, xfrm_alg_len(orig), GFP_KERNEL);
 }
 
 static inline void xfrm_states_put(struct xfrm_state **states, int n)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index e75dbdc..c4f6419 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -31,11 +31,6 @@
 #include 
 #endif
 
-static inline int alg_len(struct xfrm_algo *alg)
-{
-   return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
-}
-
 static int verify_one_alg(struct nlattr **attrs, enum xfrm_attr_type_t type)
 {
struct nlattr *rt = attrs[type];
@@ -45,7 +40,7 @@ static int verify_one_alg(struct nlattr **attrs, enum 
xfrm_attr_type_t type)
return 0;
 
algp = nla_data(rt);
-   if (nla_len(rt) < alg_len(algp))
+   if (nla_len(rt) < xfrm_alg_len(algp))
return -EINVAL;
 
switch (type) {
@@ -204,7 +199,7 @@ static int attach_one_algo(struct xfrm_algo **algpp, u8 
*props,
return -ENOSYS;
*props = algo->desc.sadb_alg_id;
 
-   p = kmemdup(ualg, alg_len(ualg), GFP_KERNEL);
+   p = kmemdup(ualg, xfrm_alg_len(ualg), GFP_KERNEL);
if (!p)
return -ENOMEM;
 
@@ -516,9 +511,9 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
NLA_PUT_U64(skb, XFRMA_LASTUSED, x->lastused);
 
if (x->aalg)
-   NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x->aalg), x->aalg);
+   NLA_PUT(skb, XFRMA_ALG_AUTH, xfrm_alg_len(x->aalg), x->aalg);
if (x->ealg)
-   NLA_PUT(skb, XFRMA_ALG_CRYPT, alg_len(x->ealg), x->ealg);
+   NLA_PUT(skb, XFRMA_ALG_CRYPT, xfrm_alg_len(x->ealg), x->ealg);
if (x->calg)
NLA_PUT(skb, XFRMA_ALG_COMP, sizeof(*(x->calg)), x->calg);
 
@@ -1978,9 +1973,9 @@ static inline size_t xfrm_sa_len(struct xfrm_state *x)
 {
size_t l = 0;
if (x->aalg)
-   l += nla_total_size(alg_len(x->aalg));
+   l += nla_total_size(xfrm_alg_len(x->aalg));
if (x->ealg)
-   l += nla_total_size(alg_len(x->ealg));
+   l += nla_total_size(xfrm_alg_len(x->ealg));
if (x->calg)
l += nla_total_size(sizeof(*x->calg));
if (x->encap)


Re: SACK scoreboard

2008-01-08 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 9 Jan 2008 08:03:18 +0100

> Also even freeing a lot of objects doesn't have to be
> that expensive.  I suspect the most cost is in taking
> the slab locks, but that could be batched.

We're touching SKB struct members, doing atomics on them, etc. for
objects we haven't referenced for at least two RTTs so are guarenteed
to be cache cold.

Let's say best case we can get it down to 2 cache line misses, and as
a very aggressive goal we can get the cost down to 100 cycles.  For
500,000 packets this is 500 million cpu cycles to free them all up.

That's 1/4 of a second even on a 2 GHZ cpu.

And yes there are inherent costs in handling TCP windows that are
500,000 packets in size.  But, that freeing cost should be spread
throughout the handling of the RTT feedback, not handled all at once.

> Your hand waved numbers on inline sizes there were definitely worse 
> than mine. 

Your primary objective seems to be "being right", and that's fine but
realize that it makes discussing anything with you about as fun as
picking one's toe nails with an ice axe.

Eventually you will be ignored by most folks who get fed up by this
style of argument.

So, have fun being right rather than being pleasant to work with.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread Andi Kleen
> It adds severe spikes in CPU utilization that are even moderate
> line rates begins to affect RTTs.
> 
> Or do you think it's OK to process 500,000 SKBs while locked
> in a software interrupt.

You can always push it into a work queue.  Even put it to
other cores if you want. 

In fact this is already done partly for the ->completion_queue.
Wouldn't be a big change to queue it another level down.

Also even freeing a lot of objects doesn't have to be
that expensive. I suspect the most cost is in taking
the slab locks, but that could be batched. Without
that the kmem_free fast path isn't particularly
expensive, as long as the headers are still in cache.

Long ago I had sk_buff optimized for the routing case so that freeing
can be all done with a single cache line. That is 
long gone, but could be probably restored.

But asking for protocol changes just to work around such a
internal implementation detail is ...  

> Perhaps you have another broken awk script to prove this :-)

Your hand waved numbers on inline sizes there were definitely worse 
than mine. 

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7]: [NET]: Fix drivers to handle napi_disable() disabling interrupts.

2008-01-08 Thread David Miller
From: Don Fry <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 22:14:29 -0800

> Tested pcnet32 on x86_64 box and see no problems with the change.
> The code is only exercised if doing loopback testing, or changing
> the ring size during a receive storm.
> 
> Acked-by:  Don Fry <[EMAIL PROTECTED]>

Thanks for testing and the feedback.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] bridge-utils 1.4

2008-01-08 Thread Denys Fedoryshchenko

As mentioned in
http://marc.info/?l=linux-bridge&m=113105949718826&w=2

Released package doesn't contain ./configure script

For people who know what is make on, it is easy to run autoconf , but some 
know only how to use ./configure :-)

Other than this, it is works fine with me, but i didn't test it deeply yet.

On Tue, 8 Jan 2008 08:56:07 -0800, Stephen Hemminger wrote
> Minor update to bridge-utils. Mostly fixing bugs in usage of sysfs.
> 
> Release tarball:
>   http://downloads.sourceforge.net/bridge/bridge-utils-1.4.tar.gz
> 
> Alon Bar-Lev (1):
>   Allow bridge-utils to run when no TCP/IP is available
> 
> Denys Vlasenko (1):
>   fix use of sysfs (affects 32/64 bit compat)
> 
> Jeremy Jackson (1):
>   Fix parsing of port_id's (hex).
> 
> Stephen Hemminger (3):
>   Add ignore for generated files.
>   Update gitignore
>   Use linux/if.h rather than net/if.h
> 
> -- 
> Stephen Hemminger <[EMAIL PROTECTED]>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread David Miller
From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 23:27:08 -0500

> I also wonder how much of a problem this is (for now, with window sizes 
> of order 1 packets.  My understanding is that the biggest problems 
> arise from O(N^2) time for recovery because every ack was expensive. 
> Have current tests shown the final ack to be a major source of problems?

Yes, several people have reported this.

It's a real issue, and it's only going to get worse.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 09 Jan 2008 03:25:05 +0100

> David Miller <[EMAIL PROTECTED]> writes:
> >
> > The big problem is that recovery from even a single packet loss in a
> > window makes us run kfree_skb() for a all the packets in a full
> > window's worth of data when recovery completes.
> 
> Why exactly is it a problem to free them all at once? Are you worried
> about kernel preemption latencies?

If the cpu is there spinning freeing up 500,000 SKBs, it isn't
processing RX packets.

It adds severe spikes in CPU utilization that are even moderate
line rates begins to affect RTTs.

Or do you think it's OK to process 500,000 SKBs while locked
in a software interrupt.

Perhaps you have another broken awk script to prove this :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Krzysztof Oledzki



On Tue, 8 Jan 2008, Jay Vosburgh wrote:


Krzysztof Oledzki <[EMAIL PROTECTED]> wrote:


On Mon, 7 Jan 2008, Jay Vosburgh wrote:


Following are three fixes to fix locking problems and
silence locking-related warnings in the current 2.6.24-rc.

patch 1: fix locking in sysfs primary/active selection

Call core network functions with expected locks to
eliminate potential deadlock and silence warnings.

patch 2: fix ASSERT_RTNL that produces spurious warnings

Relocate ASSERT_RTNL to remove a false warning; after patch,
ASSERT is located in code that holds only RTNL (additional locks were
causing the ASSERT to trip)

patch 3: fix locking during alb failover and slave removal

Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
Eliminates deadlock and silences warnings.

Patches are against the current netdev-2.6#upstream branch.

Please apply for 2.6.24.


2.6.24-rc7 + patches #1, #2, #3:

bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth0.
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: RX/TX
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: Adding slave eth1.
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready

=
[ INFO: possible irq lock inversion dependency detected ]
2.6.24-rc7 #1
-
events/0/9 just changed the state of lock:
(&mc->mca_lock){-+..}, at: [] mld_ifc_timer_expire+0x130/0x1fb
but this lock took another, soft-read-irq-unsafe lock in the past:
(&bond->lock){-.--}

and interrupts could create inverse lock ordering between them.


Just to be clear: the patch set I posted yesterday was not
intended to resolve the lockdep problem; I haven't studied that one yet.


Fine. Just let you know that someone test your patches and everything 
works, except mentioned problem.


Best regards,

Krzysztof Olędzki

Re: SACK scoreboard

2008-01-08 Thread David Miller
From: "Lachlan Andrew" <[EMAIL PROTECTED]>
Date: Tue, 8 Jan 2008 17:34:03 -0800

> John also suggested freeing the packets as a lower priority task, just
> doing it after they're acknowledged.
> 
> When the ACK finally comes, you could do something like moving John's
> entire list of packets to a "to be freed" list, and free a few every
> time (say) another ACK comes in.

You could, but this requires extra state and the operations to
properly queue such items to the deferred task and wake it up.

And none of this would be necessary if we could make SACK
indications hard.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7]: [NET]: Fix drivers to handle napi_disable() disabling interrupts.

2008-01-08 Thread Don Fry
Tested pcnet32 on x86_64 box and see no problems with the change.
The code is only exercised if doing loopback testing, or changing
the ring size during a receive storm.

Acked-by:  Don Fry <[EMAIL PROTECTED]>
---

[NET]: Fix drivers to handle napi_disable() disabling interrupts.

When we add the generic napi_disable_pending() breakout
logic to net_rx_action() it means that napi_disable()
can cause NAPI poll interrupt events to be disabled.

And this is exactly what we want.  If a napi_disable()
is pending, and we are looping in the ->poll(), we want
->poll() event interrupts to stay disabled and we want
to complete the NAPI poll ASAP.

When ->poll() break out during device down was being handled on a
per-driver basis, often these drivers would turn interrupts back on
when '!netif_running()' was detected.

And this would just cause a reschedule of the NAPI ->poll() in the
interrupt handler before the napi_disable() could get in there and
grab the NAPI_STATE_SCHED bit.

The vast majority of drivers don't care if napi_disable() might have
the side effect of disabling NAPI ->poll() event interrupts.  In all
such cases, when a napi_disable() is performed, the driver just
disabled interrupts or is about to.

However there were three exceptions to this in PCNET32, R8169, and
SKY2.  To fix those cases, at the subsequent napi_enable() points, I
added code to ensure that the ->poll() interrupt events are enabled in
the hardware.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
---
 drivers/net/pcnet32.c |5 +
 drivers/net/r8169.c   |2 ++
 drivers/net/sky2.c|3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/pcnet32.c b/drivers/net/pcnet32.c
index ff92aca..90498ff 100644
--- a/drivers/net/pcnet32.c
+++ b/drivers/net/pcnet32.c
@@ -455,9 +455,14 @@ static void pcnet32_netif_start(struct net_device
*dev)
 {
 #ifdef CONFIG_PCNET32_NAPI
struct pcnet32_private *lp = netdev_priv(dev);
+   ulong ioaddr = dev->base_addr;
+   u16 val;
 #endif
netif_wake_queue(dev);
 #ifdef CONFIG_PCNET32_NAPI
+   val = lp->a.read_csr(ioaddr, CSR3);
+   val &= 0x00ff;
+   lp->a.write_csr(ioaddr, CSR3, val);
napi_enable(&lp->napi);
 #endif
 }
-- 
1.5.4.rc2.38.gd6da3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SOCK]: Adds a rcu_dereference() in sk_filter

2008-01-08 Thread Herbert Xu
Eric Dumazet <[EMAIL PROTECTED]> wrote:
> 
> It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU 
> protection for sk_filter(), without a rcu_dereference()
> 
> Either we need a rcu_dereference(), either a comment should explain why we 
> dont need it. I vote for the former.
> 
> Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

Acked-by: Herbert Xu <[EMAIL PROTECTED]>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread Herbert Xu
Eric Dumazet <[EMAIL PROTECTED]> wrote:
> 
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 58dfa82..731f0a8 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
>return ret;
> }
> 
> +static inline int alg_len(struct xfrm_algo *alg)

Could you please add an xfrm prefix to this?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Herbert Xu
Andy Gospodarek <[EMAIL PROTECTED]> wrote:
> 
> Jay's patches will not fix this issue.  I think something like this did
> it for me, but as I mentioned to Jay in the last thread, I'm not
> convinced it doesn't violate some of the locking expectations we have.
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 423298c..3c6619a 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3915,7 +3915,7 @@ static void bond_set_multicast_list(struct net_device 
> *bond_dev)
>struct bonding *bond = bond_dev->priv;
>struct dev_mc_list *dmi;
> 
> -   write_lock_bh(&bond->lock);
> +   read_lock(&bond->lock);

This is wrong.  As I said before, the correct fix (until set_multicast
no longer gets called from BH context) is to change all you process
context read_lock's to read_lock_bh.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread linux
Just some idle brainstorming on the subject...

It seems the only way to handle network pipes sigificantly larger (delay *
bandwidth product) than the processor cache is to make freeing retransmit
data o(n).

Now, there are some ways to reduce the constant factor.  The one that
comes to mind first is to not queue sk_buffs.  Throw away the struct
sk_buff after transmission and just queue skb_frag_structs, pages, or
maybe even higher-order pages of data.  Then freeing the data when it's
acked has a much smaller constant factor, particularly d-cache footprint,
and no slab operations.

The downside is more work to recreate the skb if you do have to
retransmit, but optimizing for retransmits is silly.

Some implementations could leave large chunks of memory locked until
all of the sk_buff->skb_shared_info->skb_frag_structs referencing them
have gone away, but you can look at the transmit window when deciding
how big a chunk size to use.


Then, to actually get below O(n), you want to keep the queued data in
a data structure known to the memory manager.  Basically, splice the
retransmit queue onto the free list.

It may require some kludgery in the memory manager.  In particular, doing
that in O(1) time obviously means that you can't coalesce adjacent free
regions to build higher-order pages.  So you'd have to have a threshold
for uncoalesced pages and a way to force coalescing under memory pressure.

You're just deferring work until the page is allocated, but the point
is that then it's okay to bring it into cache when it's about to be
used again.  It's the redundant round trip just because an ack arrived
that's annoying.

I've done thins sort of thing with specialized fixed-block-size allocators
before (an alpha-beta minimax search tree allocates nodes one at a
time, but frees whole subtrees at once), but might it be feasible for
kernel use?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV4] ROUTE: ip_rt_dump() is unecessary slow

2008-01-08 Thread Herbert Xu
Eric Dumazet <[EMAIL PROTECTED]> wrote:
> 
> Very good question, but honestly I really dont see why it was there at the 
> first place :

It was there because someone went through this file and robotically
replaced all conditional read barriers with rcu_dereference, even when
it made zero sense.

Basically you can add a conditional barrier either at the point where
the pointer gets read, or where it gets derferenced.  Previously we
did the latter (except that the show function didn't have a barrier
at all which is technically a bug though harmless in pratice).  This
patch moves it to the spot where it gets read which is also OK.

> static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable 
> *r)
> {
> -   struct rt_cache_iter_state *st = rcu_dereference(seq->private);
> +   struct rt_cache_iter_state *st = seq->private;
> 
> -   r = r->u.dst.rt_next;
> +   r = rcu_dereference(r->u.dst.rt_next);
>while (!r) {
>rcu_read_unlock_bh();
>if (--st->bucket < 0)
>break;
>rcu_read_lock_bh();
> -   r = rt_hash_table[st->bucket].chain;
> +   r = rcu_dereference(rt_hash_table[st->bucket].chain);
>}
>return r;

Slight optimisation: please move both barriers onto the return statement,
i.e.,

return rcu_dereference(r);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread John Heffner

Andi Kleen wrote:

David Miller <[EMAIL PROTECTED]> writes:

The big problem is that recovery from even a single packet loss in a
window makes us run kfree_skb() for a all the packets in a full
window's worth of data when recovery completes.


Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi



I also wonder how much of a problem this is (for now, with window sizes 
of order 1 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please pull 'upstream-jgarzik' branch of wireless-2.6 (use this one)

2008-01-08 Thread John W. Linville
[ 2nd try...more thoroughly checked... ]

Jeff,

Another round of patches intended for 2.6.25...the biggest factions are
rt2x00 and b43 updates, as well as some Viro-isms... :-)

Please let me know if there are any problems!

John

P.S.  Copying Dave in case he is handling these requests...FWIW, it
will definitely depend on the patches already in netdev-2.6#upstream...

---

Individual patches are available here:


http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik

---

The following changes since commit 65d0aa09c183ee45dc1786675209313fa75cf4ec:
  Jeff Garzik (1):
wireless/iwl: fix namespace breakage

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-jgarzik

Al Viro (35):
  eliminate byteswapping in struct ieee80211_qos_parameters
  several missing cpu_to_le16() in ieee80211softmac_capabilities()
  ieee80211softmac_auth_resp() fix
  ieee80211: fix misannotations
  ieee80211: beacon->capability is little-endian
  airo: fix transmit_802_11_packet()
  airo: fix endianness bug in ->dBm handling
  airo: bug in airo_interrupt() handling on incoming 802.11
  airo endianness bug: cap_rid.extSoftCap
  airo: fix writerids() endianness
  hostap: fix endianness with txdesc->sw_support
  p54common annotations and fixes
  ipw2100 annotations and fixes
  ray_cs fixes
  ipw2200 fix: struct ieee80211_radiotap_header is little-endian
  ipw2200 fix: ->rt_chbitmask is le16
  ipw2200: ipw_tx_skb() endianness bug
  airo: trivial endianness annotations
  airo: sanitize handling of SSID_rid
  bap_read()/bap_write() work with fixed-endian buffers
  airo: sanitize BSSListRid handling
  airo: sanitize handling of WepKeyRid
  airo: sanitize handling of StatsRid
  airo: sanitize handling of CapabilityRid
  airo: sanitize APListRid handling
  airo: sanitize handling of StatusRid
  airo: last of endianness annotations
  hostap annotations
  hostap: don't mess with mixed-endian even for internal skb queues
  p54pci: endianness annotations and fixes
  bcm43xx annotations
  prism54 trivial annotations
  ipw2200 trivial annotations
  ipw2200: do not byteswap struct ipw_associate
  misc wireless annotations

Daniel Walker (1):
  prism54: remove questionable down_interruptible usage

Ivo van Doorn (12):
  rt2x00: Fix chipset debugfs file
  rt2x00: Always call ieee80211_stop_queue() when return NETDEV_TX_BUSY
  rt2x00: Only set the TBCN flag when the interface is configured to send 
beacons.
  rt2x00: Store queue idx and entry idx in data_ring and data_entry
  rt2x00: Move start() and stop() handlers into rt2x00lib.c
  rt2x00: Put 802.11 data on 4 byte boundary
  rt2x00: Move packet filter flags
  rt2x00: Cleanup write_tx_desc() arguments
  rt2x00: Determine MY_BSS from descriptor
  rt2x00: Move init_txring and init_rxring into rt2x00lib
  rt2x00: Correctly initialize data and desc pointer
  rt2x00: Release rt2x00 2.0.14

John W. Linville (1):
  Revert "rtl8187: fix tx power reading"

Michael Buesch (10):
  ssb: Fix extraction of values from SPROM
  b43: Only select allowed TX and RX antennas
  b43: Fix chip access validation for new devices
  ssb: Fix PCMCIA lowlevel register access
  b43: Remove PIO support
  b43: Add definitions for MAC Control register
  b43-ssb-bridge: Add PCI ID for BCM43XG
  b43: Add NPHY kconfig option
  b43: Fix any N-PHY related WARN_ON() in the attach stage.
  zd1211rw: fix alignment for QOS and WDS frames

Miguel Botón (3):
  ssb: add 'ssb_pcihost_set_power_state' function
  b44: power down PHY when interface down
  iwlwifi: fix compilation warning in 'iwl-4965.c'

Zhu Yi (1):
  iwlwifi: fix typo in 'drivers/net/wireless/iwlwifi/Kconfig'

 drivers/net/b44.c |   28 +-
 drivers/net/wireless/adm8211.c|8 +-
 drivers/net/wireless/airo.c   | 1233 -
 drivers/net/wireless/atmel.c  |   30 +-
 drivers/net/wireless/b43/Kconfig  |   58 +-
 drivers/net/wireless/b43/Makefile |9 +-
 drivers/net/wireless/b43/b43.h|   78 +-
 drivers/net/wireless/b43/debugfs.c|1 -
 drivers/net/wireless/b43/dma.c|   19 +-
 drivers/net/wireless/b43/dma.h|   50 -
 drivers/net/wireless/b43/main.c   |  357 +++-
 drivers/net/wireless/b43/main.h   |3 +
 drivers/net/wireless/b43/xmit.c   |   26 +-
 drivers/net/wireless/b43legacy/main.c |5 -
 drivers/net/wireless/b43legacy/phy.c  |2 +-
 drivers/net/wireless/bcm43xx/bcm43xx.h|6 +-
 drivers/net/wireless/bcm43xx/bcm43xx_main.c   |   40 +-
 drivers/net/wireless/bcm43xx/

Re: [PATCH net-2.6.25 2/6][IPVS] Switch to using ctl_paths.

2008-01-08 Thread Simon Horman
On Tue, Jan 08, 2008 at 06:58:11PM +0300, Pavel Emelyanov wrote:
> The feature of ipvs ctls is that the net/ipv4/vs path
> is common for core ipvs ctls and for two schedulers,
> so I make it exported and re-use it in modules.
> 
> Two other .c files required linux/sysctl.h to make the
> extern declaration of this path compile well.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

Thanks, this looks good to me and I've confirmed that
the same entires with the same permissions exist under
/proc/sys/net/ipv4/vs before and after the change.

Acked-by: Simon Horman <[EMAIL PROTECTED]>

-- 
Horms

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread Andi Kleen
David Miller <[EMAIL PROTECTED]> writes:
>
> The big problem is that recovery from even a single packet loss in a
> window makes us run kfree_skb() for a all the packets in a full
> window's worth of data when recovery completes.

Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Belay that... -- Re: Please pull 'upstream-jgarzik' branch of wireless-2.6

2008-01-08 Thread John W. Linville
Please don't pull yet -- I let a patch get in out of order.

I'll post a new pull request when I straighten this out...

John

On Tue, Jan 08, 2008 at 05:42:02PM -0500, John W. Linville wrote:
> On Tue, Jan 08, 2008 at 05:23:05PM -0500, John W. Linville wrote:
> > Jeff,
> > 
> > Another round of patches intended for 2.6.25...the biggest factions are
> > rt2x00 and b43 updates, as well as some Viro-isms... :-)
> > 
> > Please let me know if there are any problems!
> > 
> > John
> > 
> > P.S.  Copying Dave in case he is handling these requests...FWIW, it
> > will definitely depend on the patches already in netdev-2.6#upstream...
> 
> I left out a patch.  I have pushed it on top of the previous request.
> 
> Let me know if there are problems!
> 
> Thanks,
> 
> John
-- 
John W. Linville
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread Lachlan Andrew
Greetings David,

On 08/01/2008, David Miller <[EMAIL PROTECTED]> wrote:
> From: John Heffner <[EMAIL PROTECTED]>
>
> > I haven't thought about this too hard, but can we approximate this by
> > moving scaked data into a sacked queue, then if something bad happens
> > merge this back into the retransmit queue?
>
> That defeats the impetus for the change.
>
> We want to free up the data, say, 2 packets at a time as
> ACKs come in.  The key goal is smooth liberation of
> retransmit queue packets over time.

John also suggested freeing the packets as a lower priority task, just
doing it after they're acknowledged.

When the ACK finally comes, you could do something like moving John's
entire list of packets to a "to be freed" list, and free a few every
time (say) another ACK comes in.

$0.02,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6-mm1

2008-01-08 Thread Andrew Morton
On Wed, 09 Jan 2008 09:54:45 +0900
FUJITA Tomonori <[EMAIL PROTECTED]> wrote:

> > > --- a/lib/iommu-helper.c~a
> > > +++ a/lib/iommu-helper.c
> > > @@ -8,15 +8,20 @@
> > >  static unsigned long find_next_zero_area(unsigned long *map,
> > >unsigned long size,
> > >unsigned long start,
> > > -  unsigned int nr)
> > > +  unsigned int nr,
> > > +  unsigned long align_mask)
> > >  {
> > >   unsigned long index, end, i;
> > >  again:
> > >   index = find_next_zero_bit(map, size, start);
> > > +
> > > + /* Align allocation */
> > > + index = (index + align_mask) & ~align_mask;
> > 
> > The ALIGN() macro is the approved way of doing this.
> > 
> > (I don't think ALIGN adds much value really, especially given that you've
> > commented what's going on, but I guess it does make reviewing and reading a
> > little easier).
> 
> Would be better to use __ALIGN_MASK? I can find only one user who
> directly use __ALIGN_MASK. The POWER IOMMU calculates align_mask by
> itself so it's easier to pass align_mask as an argument.

ALIGN() should be OK - its aditional type coercion isn't useful in this
case but ALIGN() is the official interface.

I don't see any reason why vermilion.c had to reach for __ALIGN_MASK.  I'll
switch it to ALIGN().

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6-mm1

2008-01-08 Thread FUJITA Tomonori
On Tue, 8 Jan 2008 16:27:39 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Wed, 09 Jan 2008 08:57:53 +0900
> FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> 
> > Andrew, can you replace
> > 
> > iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch
> > 
> > with the updated patch:
> > 
> > http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048997.html
> > 
> > For your convenience I've attached the updated patch too.
> 
> 

Thanks for putting the fix to -mm.


> > --- a/lib/iommu-helper.c~a
> > +++ a/lib/iommu-helper.c
> > @@ -8,15 +8,20 @@
> >  static unsigned long find_next_zero_area(unsigned long *map,
> >  unsigned long size,
> >  unsigned long start,
> > -unsigned int nr)
> > +unsigned int nr,
> > +unsigned long align_mask)
> >  {
> > unsigned long index, end, i;
> >  again:
> > index = find_next_zero_bit(map, size, start);
> > +
> > +   /* Align allocation */
> > +   index = (index + align_mask) & ~align_mask;
> 
> The ALIGN() macro is the approved way of doing this.
> 
> (I don't think ALIGN adds much value really, especially given that you've
> commented what's going on, but I guess it does make reviewing and reading a
> little easier).

Would be better to use __ALIGN_MASK? I can find only one user who
directly use __ALIGN_MASK. The POWER IOMMU calculates align_mask by
itself so it's easier to pass align_mask as an argument.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread linux
> Can you try the patch below ?

Testing now... (I presume you noticed the one-character typo in my
earlier patch.  That should be "mc = mc->next", not "mv = mc->next".)

That doesn't seem to do it.  Not entirely, at least.  After downloading
and partially re-uploading an 800M file, slabtop reports:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
341576 341574  99%0.50K  426978170788K kmalloc-512
342006 341953  99%0.19K  16286   21 65144K kmalloc-192
 30592  30575  99%2.00K   76484 61184K kmalloc-2048
 30213  30193  99%0.44K   33579 13428K skbuff_fclone_cache
  7650   7643  99%0.08K150   51   600K sysfs_dir_cache
  4000   3938  98%0.12K125   32   500K kmalloc-128
   258258 100%1.15K 436   344K raid5-md5
   232221  95%1.00K 584   232K kmalloc-1024
  3136   3110  99%0.06K 49   64   196K kmalloc-64
   264 80  30%0.68K 24   11   192K ext3_inode_cache

The "kmalloc-2048" was down in the noise before the upload started.
This is in single-user mode, after sync and echo 3 > /proc/sys/vm/drop_caches.


I'll have to try this after this evening's social plans, but I'm thinking
of implementing more rapid bug detection: explicitly zero the sp->TxBuff
slot when the skb is freed, and check that it is zero before putting
anything else in there.  (And likewise for RxBuff.)

That way, I don't have to use up a noticeable amount of memory to see
the bug and reboot to clear up the damage each test cycle.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6-mm1

2008-01-08 Thread Andrew Morton
On Wed, 09 Jan 2008 08:57:53 +0900
FUJITA Tomonori <[EMAIL PROTECTED]> wrote:

> Andrew, can you replace
> 
> iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch
> 
> with the updated patch:
> 
> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048997.html
> 
> For your convenience I've attached the updated patch too.




> --- a/lib/iommu-helper.c~a
> +++ a/lib/iommu-helper.c
> @@ -8,15 +8,20 @@
>  static unsigned long find_next_zero_area(unsigned long *map,
>unsigned long size,
>unsigned long start,
> -  unsigned int nr)
> +  unsigned int nr,
> +  unsigned long align_mask)
>  {
>   unsigned long index, end, i;
>  again:
>   index = find_next_zero_bit(map, size, start);
> +
> + /* Align allocation */
> + index = (index + align_mask) & ~align_mask;

The ALIGN() macro is the approved way of doing this.

(I don't think ALIGN adds much value really, especially given that you've
commented what's going on, but I guess it does make reviewing and reading a
little easier).


>   end = index + nr;
> - if (end > size)
> + if (end >= size)
>   return -1;
> - for (i = index + 1; i < end; i++) {
> + for (i = index; i < end; i++) {
>   if (test_bit(i, map)) {
>   start = i+1;
>   goto again;
> @@ -50,9 +55,8 @@ unsigned long iommu_area_alloc(unsigned 
>  {
>   unsigned long index;
>  again:
> - index = find_next_zero_area(map, size, start, nr);
> + index = find_next_zero_area(map, size, start, nr, align_mask);
>   if (index != -1) {
> - index = (index + align_mask) & ~align_mask;
>   if (is_span_boundary(index, nr, shift, boundary_size)) {
>   /* we could do more effectively */
>   start = index + 1;
> _
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Brent Casavant
On Tue, 8 Jan 2008, Tom Spink wrote:

> Ach.  I *am* missing something... and what I'm missing is my
> understanding of the sendmsg and recvmsg calls.

No problem.  We all have those days. :)

Brent

-- 
Brent Casavant  All music is folk music.  I ain't
[EMAIL PROTECTED]never heard a horse sing a song.
Silicon Graphics, Inc.-- Louis Armstrong
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6-mm1

2008-01-08 Thread FUJITA Tomonori
On Tue, 8 Jan 2008 16:59:48 +0100
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> 
> > The patches are available at:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/tomo/iommu/
> > 
> > Or if you prefer the git tree:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git 
> > iommu-sg-fixes
> 
> btw., these improvements to the IOMMU code are in -mm and will go into 
> v2.6.25, right? The changes look robust to me.

Thanks, they have been in -mm though the iommu helper fix hasn't
yet. Balbir Singh found the bug in 2.6.24-rc6-mm1. I've just check
mmotm and found that the IOMMU helper patch doesn't include the fix.

Andrew, can you replace

iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch

with the updated patch:

http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048997.html

For your convenience I've attached the updated patch too.

Hopefully, they will go into v2.6.25. At least, I hope that the
patches (0001-0011) that make the IOMMUs respect segment size limits
when merging sg lists will be merged. They are simple and I got ACKs
on POWER and PARISC.


Thanks,

=
From: FUJITA Tomonori <[EMAIL PROTECTED]>
Subject: [PATCH] add IOMMU helper functions for the free area management

This adds IOMMU helper functions for the free area management.  These
functions take care of LLD's segment boundary limit for IOMMUs.  They would be
useful for IOMMUs that use bitmap for the free area management.

Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
---
 include/linux/iommu-helper.h |7 
 lib/Makefile |1 +
 lib/iommu-helper.c   |   80 ++
 3 files changed, 88 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/iommu-helper.h
 create mode 100644 lib/iommu-helper.c

diff --git a/include/linux/iommu-helper.h b/include/linux/iommu-helper.h
new file mode 100644
index 000..4dd4c04
--- /dev/null
+++ b/include/linux/iommu-helper.h
@@ -0,0 +1,7 @@
+extern unsigned long iommu_area_alloc(unsigned long *map, unsigned long size,
+ unsigned long start, unsigned int nr,
+ unsigned long shift,
+ unsigned long boundary_size,
+ unsigned long align_mask);
+extern void iommu_area_free(unsigned long *map, unsigned long start,
+   unsigned int nr);
diff --git a/lib/Makefile b/lib/Makefile
index b6793ed..0e7383f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -64,6 +64,7 @@ obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 
 lib-$(CONFIG_GENERIC_BUG) += bug.o
diff --git a/lib/iommu-helper.c b/lib/iommu-helper.c
new file mode 100644
index 000..495575a
--- /dev/null
+++ b/lib/iommu-helper.c
@@ -0,0 +1,80 @@
+/*
+ * IOMMU helper functions for the free area management
+ */
+
+#include 
+#include 
+
+static unsigned long find_next_zero_area(unsigned long *map,
+unsigned long size,
+unsigned long start,
+unsigned int nr,
+unsigned long align_mask)
+{
+   unsigned long index, end, i;
+again:
+   index = find_next_zero_bit(map, size, start);
+
+   /* Align allocation */
+   index = (index + align_mask) & ~align_mask;
+
+   end = index + nr;
+   if (end >= size)
+   return -1;
+   for (i = index; i < end; i++) {
+   if (test_bit(i, map)) {
+   start = i+1;
+   goto again;
+   }
+   }
+   return index;
+}
+
+static inline void set_bit_area(unsigned long *map, unsigned long i,
+   int len)
+{
+   unsigned long end = i + len;
+   while (i < end) {
+   __set_bit(i, map);
+   i++;
+   }
+}
+
+static inline int is_span_boundary(unsigned int index, unsigned int nr,
+  unsigned long shift,
+  unsigned long boundary_size)
+{
+   shift = (shift + index) & (boundary_size - 1);
+   return shift + nr > boundary_size;
+}
+
+unsigned long iommu_area_alloc(unsigned long *map, unsigned long size,
+  unsigned long start, unsigned int nr,
+  unsigned long shift, unsigned long boundary_size,
+  unsigned long align_mask)
+{
+   unsigned long index;
+again:
+   index = find_next_zero_area(map, size, start, nr, align_mask);
+   if (index != -1) {
+   if (is_span_boundary(index, nr, shift, boundary_size)) {
+   /* we could do more ef

Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Tom Spink
On 08/01/2008, Tom Spink <[EMAIL PROTECTED]> wrote:
> On 08/01/2008, Brent Casavant <[EMAIL PROTECTED]> wrote:
> > On Tue, 8 Jan 2008, Tom Spink wrote:
> >
> > > Where in the code is the message length being sent across the socket?
> >
> > In do_producer(), there are the following lines in the main loop:
> >
> > /* Send random lengths of data */
> > messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
> > iov[i].iov_len = messages[i].length;
> >
> > The entire "struct sockmsg" is sent across the socket, so the first
> > size_t in each message contains the length of the entire message
> > (including the size_t).  This size gets picked up at the
> > recv(...,MSG_PEEK) line in do_consumer().
> >
> > Thanks,
> > Brent
> >
> > --
> > Brent Casavant  All music is folk music.  I ain't
> > [EMAIL PROTECTED]never heard a horse sing a song.
> > Silicon Graphics, Inc.-- Louis Armstrong
> >
>
> Hi,
>
> But you're not consuming the size_t on the other end.  You're only
> peeking it, i.e. you're doing the recv to peek at the message, but
> never calling recv to remove that data from the queue... or am I
> missing something?
>
> --
> Regards,
> Tom Spink
> University of Edinburgh
>

Ach.  I *am* missing something... and what I'm missing is my
understanding of the sendmsg and recvmsg calls.

A quick Google has sorted me out.

-- 
Regards,
Tom Spink
University of Edinburgh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread Francois Romieu
David Miller <[EMAIL PROTECTED]> :
[...]
> Same kind of bug as the RX side :-)  I bet this fixes his
> problem...

I am not sure but the Rx side is probably just here to distract
from the real problem. Please don't ask... :o)

Anyway I'll poke an adapter in the test computer and give it a
try tomorrow. Nobody will complain if I crash it.

-- 
Ueimor
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Tom Spink
On 08/01/2008, Brent Casavant <[EMAIL PROTECTED]> wrote:
> On Tue, 8 Jan 2008, Tom Spink wrote:
>
> > Where in the code is the message length being sent across the socket?
>
> In do_producer(), there are the following lines in the main loop:
>
> /* Send random lengths of data */
> messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
> iov[i].iov_len = messages[i].length;
>
> The entire "struct sockmsg" is sent across the socket, so the first
> size_t in each message contains the length of the entire message
> (including the size_t).  This size gets picked up at the
> recv(...,MSG_PEEK) line in do_consumer().
>
> Thanks,
> Brent
>
> --
> Brent Casavant  All music is folk music.  I ain't
> [EMAIL PROTECTED]never heard a horse sing a song.
> Silicon Graphics, Inc.-- Louis Armstrong
>

Hi,

But you're not consuming the size_t on the other end.  You're only
peeking it, i.e. you're doing the recv to peek at the message, but
never calling recv to remove that data from the queue... or am I
missing something?

-- 
Regards,
Tom Spink
University of Edinburgh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/net/ipg.c: fix horrible mdio_read and _write

2008-01-08 Thread linux
>> +do {
>> +/* IPG_PC_MGMTDATA is a power of 2; compiler knows to shift */
>> +u8 d = ((data >> --len) & 1) * IPG_PC_MGMTDATA;
>> +/* + rather than | lets compiler microoptimize better */
>> +ipg_drive_phy_ctl_low_high(ioaddr, d + otherbits);
>> +} while (len);

> Imho something is not quite right when the code needs a comment every line
> and I am mildly convinced that we really want to honk an "optimizing mdio
> methods is ok" signal around.

Oh, but those are SPACE-saving optimiztions. :-)
I know it's not time-critical; it's really pure hack value, but is it
that evil?

> "while (len--) {" is probably more akpm-ish btw.

Well spotted.

[...]
>>  static int mdio_read(struct net_device * dev, int phy_id, int phy_reg)
>>  {
>>  void __iomem *ioaddr = ipg_ioaddr(dev);
>> +u8 const polarity = ipg_r8(PHY_CTRL) &
>> +(IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY);

> (IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY) appears twice. I would not
> mind a #define for it.

I'm hardly going to go to war over over the matter, but actually I disagree.

There's a non-zero mental cost to keeping track of an additional name,
and when it's only used two times, and is pretty simple, I think reducing
the number of layers of #defines to understand is a positive advantage.
The above reads "the two polarity bits from the PHY_CTRL register"
to a person who's never read ipg.h.  Adding IPG_PC_POLARITY_BITS just
requires mentally dereferencing another layer of pointers.

Think of it as a function small enough that it can be inlined.

>> @@ -221,75 +240,30 @@ static int mdio_read(struct net_device * dev, int 
>> phy_id, int phy_reg)
>[...]
>> -for (i = 0; i < p[6].len; i++) {
>> -p[6].field |=
>> -(read_phy_bit(ioaddr, polarity) << (p[6].len - 1 - i));
>> -}
>> +send_three_state(ioaddr, polarity); /* TA first bit */
>> +(void)read_phy_bit(ioaddr, polarity);   /* TA second bit */
>> +
>> +for (i = 0; i < 16; i++)
>> +data += data + read_phy_bit(ioaddr, polarity);

> Huh ?

Okay, I guess you prefer
+   data = 2*data + read_phy_bit(ioaddr, polarity);

That's only one character longer and easier to understand.
Or even four characters:
+   data = (data<<1) + read_phy_bit(ioaddr, polarity);

That's just the synonym that happened to come out of my fingers at the
time.  There's no particular meaning to it.

>> @@ -299,11 +273,13 @@ static int mdio_read(struct net_device * dev, int 
>> phy_id, int phy_reg)
>>  static void mdio_write(struct net_device *dev, int phy_id, int phy_reg, int 
>> val)
[...]
>> +mdio_write_bits(ioaddr, polarity, GMII_PREAMBLE, 32);
>> +mdio_write_bits(ioaddr, polarity, GMII_ST << 14 | GMII_WRITE << 12 |
>> +  phy_id << 7 | phy_reg << 2 |
>> +  0x2, 16);

> Use the 80 cols luke:
> phy_id << 7 | phy_reg << 2 | 0x2, 16);

Good spotting, thanks.

Here's a revised patch:

drivers/net/ipg.c: Fixed style problems that AKPM noticed.

(And a few more while at it.  Including an actual bug in enabling multicast
due to & vs. && confusion.)

diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index 3860fcd..fb69374 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -188,9 +188,9 @@ static void send_end(void __iomem *ioaddr, u8 
phyctrlpolarity)
phyctrlpolarity) & IPG_PC_RSVD_MASK, PHY_CTRL);
 }
 
-static u16 read_phy_bit(void __iomem * ioaddr, u8 phyctrlpolarity)
+static unsigned read_phy_bit(void __iomem * ioaddr, u8 phyctrlpolarity)
 {
-   u16 bit_data;
+   unsigned bit_data;
 
ipg_write_phy_ctl(ioaddr, IPG_PC_MGMTCLK_LO | phyctrlpolarity);
 
@@ -202,12 +202,31 @@ static u16 read_phy_bit(void __iomem * ioaddr, u8 
phyctrlpolarity)
 }
 
 /*
+ * Transmit the given bits, MSB-first, through the MgmtData bit (bit 1)
+ * of the PhyCtrl register. 1 <= len <= 32.  "ioaddr" is the register
+ * address, and "otherbits" are the values of the other bits.
+ */
+static void mdio_write_bits(void __iomem *ioaddr, u8 otherbits, u32 data, 
unsigned len)
+{
+   otherbits |= IPG_PC_MGMTDIR;
+   while (len--) {
+   /* IPG_PC_MGMTDATA is a power of 2; compiler knows to shift */
+   u8 d = ((data >> len) & 1) * IPG_PC_MGMTDATA;
+   /* + rather than | allows slight code size microoptimization */
+   ipg_drive_phy_ctl_low_high(ioaddr, d + otherbits);
+   }
+}
+
+/*
  * Read a register from the Physical Layer device located
  * on the IPG NIC, using the IPG PHYCTRL register.
  */
 static int mdio_read(struct net_device * dev, int phy_id, int phy_reg)
 {
void __iomem *ioaddr = ipg_ioaddr(dev);
+   u8 const polarity = ipg_r8(PHY_CTRL) &
+   (IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY);

Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Brent Casavant
On Tue, 8 Jan 2008, Rick Jones wrote:

> Potential bugs notwithstanding, given that this is a STREAM socket, and as
> such shouldn't (I hope, or I'm eating toes for dinner again) have side effects
> like tossing the rest of a datagram, why are you using MSG_PEEK?  Why not
> simply read the N bytes of the message that will have the message length with
> a normal read/recv, and then read that many bytes in the next call?

That's entirely reasonable, and probably a worthwhile change to make.
But, as you say, it doesn't change whether or not this is a bug in
the MSG_PEEK code.

With a small bit of complication I certainly can do what you suggest.

The initial reasoning was that this made it easy to handle the case where
the caller of the library routine (my code which stumbled on this was
part of a small library I wrote as part of the application) did not supply
a sufficiently sized buffer for reception of the next message.  The "easy"
way to do this was a MSG_PEEK to validate the buffer size against the
message size, followed by a regular recv() if the buffer is large enough.
To do what you suggest (which effectively works around the bug) is certainly
possible, but also requires maintaining state between calls to the
"get message" routine, as I need to track whether or not the size has already
been read from the next message on the stream socket.

The change isn't terribly difficult, but wasn't the initial idea.

Plus it's always good to flush a bug out of hiding, right? ;)

Brent

-- 
Brent Casavant  All music is folk music.  I ain't
[EMAIL PROTECTED]never heard a horse sing a song.
Silicon Graphics, Inc.-- Louis Armstrong
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Brent Casavant
On Tue, 8 Jan 2008, Tom Spink wrote:

> Where in the code is the message length being sent across the socket?

In do_producer(), there are the following lines in the main loop:

/* Send random lengths of data */
messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
iov[i].iov_len = messages[i].length;

The entire "struct sockmsg" is sent across the socket, so the first
size_t in each message contains the length of the entire message
(including the size_t).  This size gets picked up at the
recv(...,MSG_PEEK) line in do_consumer().

Thanks,
Brent

-- 
Brent Casavant  All music is folk music.  I ain't
[EMAIL PROTECTED]never heard a horse sing a song.
Silicon Graphics, Inc.-- Louis Armstrong
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()

2008-01-08 Thread Ramkrishna Vepa
Dave,

Got it. These new napi interface changes were introduced by someone else
and we assumed it to be correct. We will make the fix and submit.

Thanks,
Ram

> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 08, 2008 3:08 PM
> To: Ramkrishna Vepa
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH 3/7]: [NET]: Do not check netif_running() and
carrier
> state in ->poll()
> 
> From: "Ramkrishna Vepa" <[EMAIL PROTECTED]>
> Date: Tue, 8 Jan 2008 18:01:32 -0500
> 
> > Dave,
> > Sorry, should have been clearer. When I meant "brought down" did not
> > mean close, but when a adapter reset is initiated. The
napi_disable() is
> > called only on a close. When the driver does a reset, napi_disable()
is
> > not called.
> 
> You should be doing a napi_disable() during a reset, like every
> other driver does.
> 
> It is the only reliable way to prevent the code path from running.
> 
> Otherwise, you can start resetting the device right after
> that check in the ->poll() routine, and thus still touch
> the device during the reset sequence.
> 
> In short the check is wrong, because it doesn't fully prevent
> what you want it to prevent.  Only a napi_disable() would do
> that fully for you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


my NAPI patches

2008-01-08 Thread David Miller

Can people do me a favor and do more constructive things like test
that my patches really do fix the "device down during packet flood"
bug instead of nit-picking all the individual driver fixups I made?

That's what is useful at this time, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()

2008-01-08 Thread David Miller
From: "Ramkrishna Vepa" <[EMAIL PROTECTED]>
Date: Tue, 8 Jan 2008 18:01:32 -0500

> Dave,
> Sorry, should have been clearer. When I meant "brought down" did not
> mean close, but when a adapter reset is initiated. The napi_disable() is
> called only on a close. When the driver does a reset, napi_disable() is
> not called. 

You should be doing a napi_disable() during a reset, like every
other driver does.

It is the only reliable way to prevent the code path from running.

Otherwise, you can start resetting the device right after
that check in the ->poll() routine, and thus still touch
the device during the reset sequence.

In short the check is wrong, because it doesn't fully prevent
what you want it to prevent.  Only a napi_disable() would do
that fully for you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()

2008-01-08 Thread Ramkrishna Vepa
Dave,
Sorry, should have been clearer. When I meant "brought down" did not
mean close, but when a adapter reset is initiated. The napi_disable() is
called only on a close. When the driver does a reset, napi_disable() is
not called. 

Ram
> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 08, 2008 2:48 PM
> To: Ramkrishna Vepa
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH 3/7]: [NET]: Do not check netif_running() and
carrier
> state in ->poll()
> 
> From: "Ramkrishna Vepa" <[EMAIL PROTECTED]>
> Date: Tue, 8 Jan 2008 13:17:03 -0500
> 
> > Dave,
> >
> > This change is not required as the macro, is_s2io_card_up() checks
for
> > an internal state of the adapter and not netif's state. We want to
make
> > sure that the adapter registers are not accessed when the adapter is
> > being brought down.
> 
> If the adapter is being brought down, you would have done
> a napi_disable() first and therefore never reach this code
> path.
> 
> The removal is correct, I read how your driver works.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread David Miller
From: Francois Romieu <[EMAIL PROTECTED]>
Date: Tue, 8 Jan 2008 22:36:40 +0100

> [EMAIL PROTECTED] <[EMAIL PROTECTED]> :
> > I take that back.  This patch does NOT fix the leak, at least if
> > ping: sendmsg: No buffer space available
> > is any indication...
> 
> Can you try the patch below ?

Same kind of bug as the RX side :-)  I bet this fixes his
problem...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7]: [NET]: Make ->poll() breakout consistent in Intel ethernet drivers.

2008-01-08 Thread David Miller
From: "Kok, Auke" <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 11:04:59 -0800

> this is exactly the change I was eyeballing and indeed this seems to be the
> general use case in most drivers anyway.
> 
> I'll try to see how this impacts the (especially 4-port) TX performance issue 
> with
> e1000e, but this should be just fine for now, and we can address later anyway.
> 
> Acked-by: Auke Kok <[EMAIL PROTECTED]>
> 

Ok, thanks for reviewing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Rafael J. Wysocki
On Tuesday, 8 of January 2008, Linus Torvalds wrote:
> 
> On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> > 
> > ok done; I had to fizzle a bit because some things aren't *exactly* a 
> > BUG() statement but I track them anyway (things like the "sleeping in 
> > invalid context" check), so I had to somewhat arbitrarily assign 
> > categories for those. I might fine tune these over time some; if you or 
> > someone else sees problems with categorization please let me know
> 
> Looking good. I wonder if we could also have some way to cross-ref these 
> things with the regression list (notably try to get pointers to them in 
> the regression list). 

I'm thinking about that, but haven't invented any automated solution yet.
I only can manually add references to kerneloops.org, for now.

Greetings,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Tom Spink
On 08/01/2008, Rick Jones <[EMAIL PROTECTED]> wrote:
> Potential bugs notwithstanding, given that this is a STREAM socket, and
> as such shouldn't (I hope, or I'm eating toes for dinner again) have
> side effects like tossing the rest of a datagram, why are you using
> MSG_PEEK?  Why not simply read the N bytes of the message that will have
> the message length with a normal read/recv, and then read that many
> bytes in the next call?
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Hi,

Where in the code is the message length being sent across the socket?

-- 
Regards,
Tom Spink
University of Edinburgh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


One more patch... -- Re: Please pull 'upstream-jgarzik' branch of wireless-2.6

2008-01-08 Thread John W. Linville
On Tue, Jan 08, 2008 at 05:23:05PM -0500, John W. Linville wrote:
> Jeff,
> 
> Another round of patches intended for 2.6.25...the biggest factions are
> rt2x00 and b43 updates, as well as some Viro-isms... :-)
> 
> Please let me know if there are any problems!
> 
> John
> 
> P.S.  Copying Dave in case he is handling these requests...FWIW, it
> will definitely depend on the patches already in netdev-2.6#upstream...

I left out a patch.  I have pushed it on top of the previous request.

Let me know if there are problems!

Thanks,

John

---

The following changes since commit f94de7b013f78ad8bbe1064c108dd55141efb177:
  Miguel Botón (1):
iwlwifi: fix compilation warning in 'iwl-4965.c'

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-jgarzik

Michael Buesch (1):
  zd1211rw: fix alignment for QOS and WDS frames

 drivers/net/wireless/zd1211rw/zd_mac.c |   17 -
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wireless/zd1211rw/zd_mac.c 
b/drivers/net/wireless/zd1211rw/zd_mac.c
index 14fb727..7b86930 100644
--- a/drivers/net/wireless/zd1211rw/zd_mac.c
+++ b/drivers/net/wireless/zd1211rw/zd_mac.c
@@ -623,6 +623,8 @@ int zd_mac_rx(struct ieee80211_hw *hw, const u8 *buffer, 
unsigned int length)
const struct rx_status *status;
struct sk_buff *skb;
int bad_frame = 0;
+   u16 fc;
+   bool is_qos, is_4addr, need_padding;
 
if (length < ZD_PLCP_HEADER_SIZE + 10 /* IEEE80211_1ADDR_LEN */ +
 FCS_LEN + sizeof(struct rx_status))
@@ -674,9 +676,22 @@ int zd_mac_rx(struct ieee80211_hw *hw, const u8 *buffer, 
unsigned int length)
&& !mac->pass_ctrl)
return 0;
 
-   skb = dev_alloc_skb(length);
+   fc = le16_to_cpu(*((__le16 *) buffer));
+
+   is_qos = ((fc & IEEE80211_FCTL_FTYPE) == IEEE80211_FTYPE_DATA) &&
+((fc & IEEE80211_FCTL_STYPE) == IEEE80211_STYPE_QOS_DATA);
+   is_4addr = (fc & (IEEE80211_FCTL_TODS | IEEE80211_FCTL_FROMDS)) ==
+  (IEEE80211_FCTL_TODS | IEEE80211_FCTL_FROMDS);
+   need_padding = is_qos ^ is_4addr;
+
+   skb = dev_alloc_skb(length + (need_padding ? 2 : 0));
if (skb == NULL)
return -ENOMEM;
+   if (need_padding) {
+   /* Make sure the the payload data is 4 byte aligned. */
+   skb_reserve(skb, 2);
+   }
+
memcpy(skb_put(skb, length), buffer, length);
 
ieee80211_rx_irqsafe(hw, skb, &stats);
-- 
John W. Linville
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()

2008-01-08 Thread David Miller
From: "Ramkrishna Vepa" <[EMAIL PROTECTED]>
Date: Tue, 8 Jan 2008 13:17:03 -0500

> Dave,
> 
> This change is not required as the macro, is_s2io_card_up() checks for
> an internal state of the adapter and not netif's state. We want to make
> sure that the adapter registers are not accessed when the adapter is
> being brought down.

If the adapter is being brought down, you would have done
a napi_disable() first and therefore never reach this code
path.

The removal is correct, I read how your driver works.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread David Miller
From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 11:51:53 -0500

> I haven't thought about this too hard, but can we approximate this by 
> moving scaked data into a sacked queue, then if something bad happens 
> merge this back into the retransmit queue?

That defeats the impetus for the change.

We want to free up the data, say, 2 packets at a time as
ACKs come in.  The key goal is smooth liberation of
retransmit queue packets over time.

The big problem is that recovery from even a single packet loss in a
window makes us run kfree_skb() for a all the packets in a full
window's worth of data when recovery completes.

If we just move such packets to a seperate list, we still have to
iterate over all of them when the cumulative ACK arrives.

This problem, that retransmit queue liberation is not smooth, is the
biggest flaw in how SACK is specified.  I mean, consider Ilpo's
mentioned case of 500,000 packet windows.  The issue cannot be
ignored.  SACK is clearly broken.

You speak of a path in Linux where we can reneg on SACKs, but I doubt
it really ever runs because of how aggressive the queue collapser is.
Alexey even has a comment there:

 * This must not ever occur. */

To be honest this code sits here because it was written before the
queue collapser was coded up.

Really, find me a box where the LINUX_MIB_OFOPRUNED or
LINUX_MIB_RECVPRUNED counters are anything other than zero.

So this is a non-issue and I did consider it before proposing that we
redefine SACK.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Rick Jones
Potential bugs notwithstanding, given that this is a STREAM socket, and 
as such shouldn't (I hope, or I'm eating toes for dinner again) have 
side effects like tossing the rest of a datagram, why are you using 
MSG_PEEK?  Why not simply read the N bytes of the message that will have 
the message length with a normal read/recv, and then read that many 
bytes in the next call?


rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Brent Casavant
Hello,

I was coding an application which passes variable-length messages
over an AF_UNIX SOCK_STREAM socket.  As such I would pass a message
length embedded as the first element in the message, use recv(...,MSG_PEEK)
to determine the message length, and perform the necessary allocation
on the receiving end before performing the real recv().

However, the program would occasionally get into a situation where
a call to recv(sockfd, &buf, len, MSG_PEEK) returns some number
of bytes less than the requested length, and persists in this state
(i.e. retrying the call continues to return the same amount of data)
even when more than sufficient data is known to have been successfully
written to the socket.

By my reading of the af_unix.c kernel code, I believe the problem lies in
unix_stream_recvmsg().  Here we find that, under MSG_PEEK, only a single
skb is pulled from the receive queue and the data copied into the user
buffer.  This fails to account for the fact that this skb could contain less
data than is necessary to satisfy the MSG_PEEK request.  The "short" skb
is then pushed back onto the queue, and thus a retried MSG_PEEK will again
return a short response.  The only way to clear this condition is to
perform a recv() without specifying MSG_PEEK.

I'll be happy to submit a patch to fix this, but would like to first
confirm with others that A) this is incorrect behavior for MSG_PEEK
semantics and B) my approach to the fix sounds reasonable.

What I propose as a fix is, for MSG_PEEK operations, dequeueing enough
skbs to satisfy the operation (checking of course that there is enough
data in the queue in the first place), copying out the requested data
to the user buffer, then pushing those skbs back onto the receive queue.

I'm not terribly familiar with the Linux networking code, so if the
above approach sounds misguided, please feel free to let me know.

Below you'll find an example program (not the real program, I know
there's some no-nos in here, this was written simply to demonstrate
the bug) that demonstrates the problem.  It fails quite reliably and
quickly on an SGI Altix IA64, though given the nature of the problem I
don't believe it's architecture-specific.  Running the program with
no arguments produces output similar to the following:

consumer: recv of message size is not progressing at time 1199829455094356.
consumer: want 8 bytes, but only receiving 5.
consumer: SIOCOUTQ=0, SIOCINQ=202421
consumer: had 36985 successes
producer: sendmsg failed at time 1199829455094571

The meaning of the first two lines is obvious (UNIX time in microseconds).
The third line gives the output and input queue lengths for the AF_UNIX
socket at the time of the error message, to confirm that there is indeed
sufficient data available for satisfying the MSG_PEEK operation.  The
fourth line details how many iterations of the main loop of the program
were successful until we got hung up on the MSG_PEEK.  The final line
simply confirms that the producer failed as a result of the consumer
closing the socket, not the other way around.

Thanks,
Brent Casavant

#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

#include 

#define SOCKNAME "/tmp/peektest.sock"
#define MAXLEN 2048
#define NUMMSG 64
#define ERRLOOPS 5

struct sockmsg {
size_t length;
char payload[MAXLEN];
};

struct sockmsg messages[NUMMSG];
struct iovec iov[NUMMSG];

void
do_consumer(char *socketname) {
struct stat sb;
int status;
int sockfd, fd;
struct sockaddr_un su;
ssize_t bytes;
struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0,
  .msg_iov = iov, .msg_iovlen = 1,
  .msg_control = NULL, .msg_controllen = 0,
  .msg_flags = 0 };
int successes = 0;

/* Wait for socket to become available */
do {
sleep(1);
status = stat(socketname, &sb);
} while ((status < 0) && (ENOENT == errno));
if ((status < 0) && (ENOENT != errno)) {
printf("consumer: stat failed\n");
return;
}
if (!S_ISSOCK(sb.st_mode)) {
printf("consumer: %s is not a socket\n", socketname);
return;
}

/* Connect to the producer */
sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd < 0) {
printf("consumer: socket() failed\n");
return;
}
su.sun_family = PF_UNIX;
strcpy(su.sun_path, socketname);
if (connect(sockfd, (struct sockaddr*) &su, SUN_LEN(&su)) < 0) {
printf("consumer: connect() failed\n");
return;
}

/* Main loop here.  Inspect the first size_t in the incoming
 * message to determine the message length.  Then read in
 * the entire message.
 */
do {

FW: ccid2/ccid3 oopses

2008-01-08 Thread devzero
Hello !

as suggested by Ian McDonald, i`m forwarding this to dccp and netdev mailing 
lists.


> hi !
>
> i know dccp_ccid2 and ccid3 modules are considered experimental, but i`m 
> unsure if i probably triggered a bug inside or outside that modules here (i`m 
> no kernel developer)
>
> apparently, i got crashes when loading/unloading other driver modules just 
> after ccid2 or ccid3 had been loaded/unloaded _once_ (have not used them at 
> all, just modprobe module;modprobe -r module)
>
> this was detected during some hardcore module load/unload testing session and 
> apparently these modules seem to be the root cause of other modules crashing, 
> so they seem to leave the system in an inconsistent state after load/unload.
>
> this can be reproduced with recent 2.6.24rc6 kernel which was mostly built 
> with allmodconfig.
> i could not reproduce this with a more minimalistic configuration, e.g. the 
> suse kernel of the day runs fine.
>
> the easiest way to reproduce is:
>
> while true;do modprobe dccp_ccid2/3;modprobe -r dccp_ccid2/3;done
> after short time, the kernel oopses (messages below)
>
> i`m not sure if this is worth to be filed at kernel bugzilla, so i`m 
> contacting you personally first.
>
> i`d happily assist in helping debug this or provide more input (.config etc) 
> if you want to take a look.
>
> regards
> Roland 
>
>
> [ 2322.177054] CCID: Unregistered CCID 2 (ccid2)
> [ 2322.377927] CCID: Registered CCID 2 (ccid2)
>
> [ 2322.413793] BUG: unable to handle kernel paging request at virtual address 
> 4864
> [ 2322.425066] printing eip: c01792e1 *pde = 
> [ 2322.431523] Oops:  [#1] SMP
> [ 2322.435249] Modules linked in: dccp_ccid2 dccp edd iptable_filter 
> ip_tables ip6table_filter ip6_tables x_tables ipv6 microcode firmware_class 
> fuse loop dm_mod ide_cd cdrom pata_acpi ata_piix ahci parport_pc floppy 
> ata_generic parport pcnet32 rtc_cmos libata rtc_core rtc_lib mii pcspkr 
> container thermal piix generic i2c_piix4 processor button ac i2c_core 
> power_supply shpchp ide_core intel_agp pci_hotplug agpgart mousedev evdev sg 
> ext3 jbd mbcache sd_mod mptspi mptscsih mptbase scsi_transport_spi ehci_hcd 
> uhci_hcd scsi_mod usbcore
> [ 2322.489115]
> [ 2322.491535] Pid: 1730, comm: kjournald Not tainted (2.6.24-rc6 #4)
> [ 2322.497266] EIP: 0060:[] EFLAGS: 00010002 CPU: 0
> [ 2322.503205] EIP is at kmem_cache_alloc+0x5d/0xa6
> [ 2322.508789] EAX:  EBX: 0282 ECX: c03750a0 EDX: 4864
> [ 2322.514864] ESI: c1408314 EDI: 4864 EBP: c03750a0 ESP: df9cfe94
> [ 2322.521110]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
> [ 2322.527346] Process kjournald (pid: 1730, ti=df9ce000 task=deaf31a0 
> task.ti=df9ce000)
> [ 2322.535722] Stack: c016094d c1408314 4864 00011200 df4032a0 df408e40 
> def45000 000f
> [ 2322.545350]00011210 c016094d 0010 e08cb15f  dead9c00 
> df161ab8 0004
> [ 2322.556833] def45000 000f  c019acdf  
> df402940 0010
> [ 2322.565168] Call Trace:
> [ 2322.568637]  [] mempool_alloc+0x24/0xc2
> [ 2322.573169]  [] mempool_alloc+0x24/0xc2
> [ 2322.577175]  [] __journal_file_buffer+0x9b/0x11c [jbd]
> [ 2322.585033]  [] bio_alloc_bioset+0x8c/0xe6
> [ 2322.589301]  [] bio_alloc+0xb/0x17
> [ 2322.593309]  [] submit_bh+0x6e/0xf8
> [ 2322.597358]  [] journal_commit_transaction+0x6de/0xbe8 [jbd]
> [ 2322.605109]  [] lock_timer_base+0x19/0x35
> [ 2322.610478]  [] kjournald+0xae/0x1dd [jbd]
> [ 2322.616182]  [] autoremove_wake_function+0x0/0x33
> [ 2322.621341]  [] kjournald+0x0/0x1dd [jbd]
> [ 2322.628588]  [] kthread+0x38/0x60
> [ 2322.633306]  [] kthread+0x0/0x60
> [ 2322.637365]  [] kernel_thread_helper+0x7/0x10
> [ 2322.645002]  ===
> [ 2322.649049] Code: 3e 85 ff 89 7c 24 08 75 1b 89 14 24 8b 54 24 0c 83 c9 ff 
> 89 e8 89 74 24 04 e8 2b fb ff ff 89 44 24 08 eb 0c 8b 54 24 08 8b 46 0c <8b> 
> 04 82 89 06 89 d8 50 9d 0f 1f 84 00 00 00 00 00 66 83 7c 24
> [ 2322.673340] EIP: [] kmem_cache_alloc+0x5d/0xa6 SS:ESP 
> 0068:df9cfe94
> [ 2322.681327] ---[ end trace 35dbcab07ee48cc5 ]---
> [ 2322.737700] [ cut here ]
> [ 2322.748822] Kernel BUG at c0199e6d [verbose debug info unavailable]
> [ 2322.755960] invalid opcode:  [#2] SMP
> [ 2322.760773] Modules linked in: dccp_ccid2 dccp edd iptable_filter 
> ip_tables ip6table_filter ip6_tables x_tables ipv6 microcode firmware_class 
> fuse loop dm_mod ide_cd cdrom pata_acpi ata_piix ahci parport_pc floppy 
> ata_generic parport pcnet32 rtc_cmos libata rtc_core rtc_lib mii pcspkr 
> container thermal piix generic i2c_piix4 processor button ac i2c_core 
> power_supply shpchp ide_core intel_agp pci_hotplug agpgart mousedev evdev sg 
> ext3 jbd mbcache sd_mod mptspi mptscsih mptbase scsi_transport_spi ehci_hcd 
> uhci_hcd scsi_mod usbcore
> [ 2322.813338]
> [ 2322.817134] Pid: 3125, comm: klogd Tainted: G  D (2.6.24-rc6 #4)
> [ 2322.821416] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
> [ 2322.828832] EIP

Re: [PATCH] Add Linksys WCF11 to hostap driver

2008-01-08 Thread John W. Linville
On Sun, Dec 30, 2007 at 11:39:31AM -0500, Pavel Roskin wrote:
> Quoting Marcin Juszkiewicz <[EMAIL PROTECTED]>:
>
>> +   PCMCIA_DEVICE_PROD_ID1233(
>> +   "The Linksys Group, Inc.", "Wireless Network CF  Card", 
>> "ISL37300P",
>> +   0xa5f472c2, 0x9c05598d, 0xc9049a39),
>
> Acked-by: Pavel Roskin <[EMAIL PROTECTED]>

The ID1234 version of this patch is already in Linus' tree.  Feel free
to submit a patch that redacts it to the ID123 form.

John
-- 
John W. Linville
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread Francois Romieu
[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
> I take that back.  This patch does NOT fix the leak, at least if
> ping: sendmsg: No buffer space available
> is any indication...

Can you try the patch below ?

diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..c304e5c 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -860,7 +860,7 @@ static void ipg_nic_txfree(struct net_device *dev)
void __iomem *ioaddr = sp->ioaddr;
unsigned int curr;
u64 txd_map;
-   unsigned int released, pending;
+   unsigned int released, pending, dirty;
 
txd_map = (u64)sp->txd_map;
curr = ipg_r32(TFD_LIST_PTR_0) -
@@ -869,9 +869,9 @@ static void ipg_nic_txfree(struct net_device *dev)
IPG_DEBUG_MSG("_nic_txfree\n");
 
pending = sp->tx_current - sp->tx_dirty;
+   dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
 
for (released = 0; released < pending; released++) {
-   unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
struct sk_buff *skb = sp->TxBuff[dirty];
struct ipg_tx *txfd = sp->txd + dirty;
 
@@ -898,6 +898,7 @@ static void ipg_nic_txfree(struct net_device *dev)
 
sp->TxBuff[dirty] = NULL;
}
+   dirty = (dirty + 1) % IPG_TFDLIST_LENGTH;
}
 
sp->tx_dirty += released;
-- 
1.5.3.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/net/ipg.c: fix horrible mdio_read and _write

2008-01-08 Thread Francois Romieu
[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
[...]
> diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
> index 3860fcd..b3d3fc8 100644
> --- a/drivers/net/ipg.c
> +++ b/drivers/net/ipg.c
> @@ -202,12 +202,31 @@ static u16 read_phy_bit(void __iomem * ioaddr, u8 
> phyctrlpolarity)
>  }
>  
>  /*
> + * Transmit the given bits, MSB-first, through the MgmtData bit (bit 1)
> + * of the PhyCtrl register. 1 <= len <= 32.  "ioaddr" is the register
> + * address, and "otherbits" are the values of the other bits.
> + */
> +static void mdio_write_bits(void __iomem *ioaddr, u8 otherbits, u32 data, 
> unsigned len)
> +{
> + otherbits |= IPG_PC_MGMTDIR;
> + do {
> + /* IPG_PC_MGMTDATA is a power of 2; compiler knows to shift */
> + u8 d = ((data >> --len) & 1) * IPG_PC_MGMTDATA;
> + /* + rather than | lets compiler microoptimize better */
> + ipg_drive_phy_ctl_low_high(ioaddr, d + otherbits);
> + } while (len);

Imho something is not quite right when the code needs a comment every line
and I am mildly convinced that we really want to honk an "optimizing mdio
methods is ok" signal around.

"while (len--) {" is probably more akpm-ish btw.

[...]
>  static int mdio_read(struct net_device * dev, int phy_id, int phy_reg)
>  {
>   void __iomem *ioaddr = ipg_ioaddr(dev);
> + u8 const polarity = ipg_r8(PHY_CTRL) &
> + (IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY);

(IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY) appears twice. I would not
mind a #define for it.

> @@ -221,75 +240,30 @@ static int mdio_read(struct net_device * dev, int 
> phy_id, int phy_reg)
[...]
> - for (i = 0; i < p[6].len; i++) {
> - p[6].field |=
> - (read_phy_bit(ioaddr, polarity) << (p[6].len - 1 - i));
> - }
> + send_three_state(ioaddr, polarity); /* TA first bit */
> + (void)read_phy_bit(ioaddr, polarity);   /* TA second bit */
> +
> + for (i = 0; i < 16; i++)
> + data += data + read_phy_bit(ioaddr, polarity);

Huh ?

> @@ -299,11 +273,13 @@ static int mdio_read(struct net_device * dev, int 
> phy_id, int phy_reg)
>  static void mdio_write(struct net_device *dev, int phy_id, int phy_reg, int 
> val)
[...]
> + mdio_write_bits(ioaddr, polarity, GMII_PREAMBLE, 32);
> + mdio_write_bits(ioaddr, polarity, GMII_ST << 14 | GMII_WRITE << 12 |
> +   phy_id << 7 | phy_reg << 2 |
> +   0x2, 16);

Use the 80 cols luke:
  phy_id << 7 | phy_reg << 2 | 0x2, 16);

It is a nice improvement otherwise.

-- 
Ueimor
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[SOCK]: Adds a rcu_dereference() in sk_filter

2008-01-08 Thread Eric Dumazet
It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU 
protection for sk_filter(), without a rcu_dereference()


Either we need a rcu_dereference(), either a comment should explain why we 
dont need it. I vote for the former.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff --git a/include/net/sock.h b/include/net/sock.h
index 67e35c7..6e1542d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -944,7 +944,7 @@ static inline int sk_filter(struct sock *sk, struct sk_buff 
*skb)
return err;

rcu_read_lock_bh();
-   filter = sk->sk_filter;
+   filter = rcu_dereference(sk->sk_filter);
if (filter) {
unsigned int pkt_len = sk_run_filter(skb, filter->insns,
filter->len);


Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Jay Vosburgh
Andy Gospodarek <[EMAIL PROTECTED]> wrote:
[...]
>Jay's patches will not fix this issue.  I think something like this did
>it for me, but as I mentioned to Jay in the last thread, I'm not
>convinced it doesn't violate some of the locking expectations we have.
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 423298c..3c6619a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3915,7 +3915,7 @@ static void bond_set_multicast_list(struct net_device 
>*bond_dev)
>   struct bonding *bond = bond_dev->priv;
>   struct dev_mc_list *dmi;
>
>-  write_lock_bh(&bond->lock);
>+  read_lock(&bond->lock);
>
>   /*
>* Do promisc before checking multicast_mode
>@@ -3957,7 +3957,7 @@ static void bond_set_multicast_list(struct net_device 
>*bond_dev)
>   bond_mc_list_destroy(bond);
>   bond_mc_list_copy(bond_dev->mc_list, bond, GFP_ATOMIC);
>
>-  write_unlock_bh(&bond->lock);
>+  read_unlock(&bond->lock);
> }
>
> /*

Actually, I think we might be good here with no locks at all, as
it appears that all of the accesses to and manipulations of the
bond->mc_list are protected under RTNL.  I haven't checked this 100%,
but it looks that way to me after 20 minutes of poking around.  I'm
pretty sure that bonding doesn't internally mess with the mc_lists
without RTNL, it's the outside callers that I'm not entirely sure of.

I delve into "no locks" because bond_set_multicast_list should
do a bunch of things with no extra locks beyond RTNL (all of the calls
to bond_set_promisc, and _allmulti), so simply removing the acquisition
of bond->lock would help there, too.  I don't think we'll go down the
promisc or allmulti paths when called from ipv6 (which holds extra locks
in addition to RTNL) because those (apparently) won't alter the
IFF_PROMISC or IFF_ALLMULTI flags.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Linus Torvalds


On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> ok done; I had to fizzle a bit because some things aren't *exactly* a 
> BUG() statement but I track them anyway (things like the "sleeping in 
> invalid context" check), so I had to somewhat arbitrarily assign 
> categories for those. I might fine tune these over time some; if you or 
> someone else sees problems with categorization please let me know

Looking good. I wonder if we could also have some way to cross-ref these 
things with the regression list (notably try to get pointers to them in 
the regression list). 

Right now, the regression list keeps track of the things that are closed, 
but kerneloops.org obviously doesn't see that, so it's not obvious which 
oopses are "uninteresting" since they've been fixed.

Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Jay Vosburgh
Krzysztof Oledzki <[EMAIL PROTECTED]> wrote:

>On Mon, 7 Jan 2008, Jay Vosburgh wrote:
>
>>  Following are three fixes to fix locking problems and
>> silence locking-related warnings in the current 2.6.24-rc.
>>
>>  patch 1: fix locking in sysfs primary/active selection
>>
>>  Call core network functions with expected locks to
>> eliminate potential deadlock and silence warnings.
>>
>>  patch 2: fix ASSERT_RTNL that produces spurious warnings
>>
>>  Relocate ASSERT_RTNL to remove a false warning; after patch,
>> ASSERT is located in code that holds only RTNL (additional locks were
>> causing the ASSERT to trip)
>>
>>  patch 3: fix locking during alb failover and slave removal
>>
>>  Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
>> Eliminates deadlock and silences warnings.
>>
>>  Patches are against the current netdev-2.6#upstream branch.
>>
>>  Please apply for 2.6.24.
>
>2.6.24-rc7 + patches #1, #2, #3:
>
>bonding: bond0: setting mode to active-backup (1).
>bonding: bond0: Setting MII monitoring interval to 100.
>ADDRCONF(NETDEV_UP): bond0: link is not ready
>bonding: bond0: Adding slave eth0.
>e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow 
>Control: RX/TX
>bonding: bond0: making interface eth0 the new active one.
>bonding: bond0: first active interface up!
>bonding: bond0: enslaving eth0 as an active interface with an up link.
>bonding: bond0: Adding slave eth1.
>ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
>
>=
>[ INFO: possible irq lock inversion dependency detected ]
>2.6.24-rc7 #1
>-
>events/0/9 just changed the state of lock:
> (&mc->mca_lock){-+..}, at: [] mld_ifc_timer_expire+0x130/0x1fb
>but this lock took another, soft-read-irq-unsafe lock in the past:
> (&bond->lock){-.--}
>
>and interrupts could create inverse lock ordering between them.

Just to be clear: the patch set I posted yesterday was not
intended to resolve the lockdep problem; I haven't studied that one yet.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Andy Gospodarek
On Tue, Jan 08, 2008 at 07:50:22PM +0100, Krzysztof Oledzki wrote:
> 
> 
> On Mon, 7 Jan 2008, Jay Vosburgh wrote:
> 
> > Following are three fixes to fix locking problems and
> >silence locking-related warnings in the current 2.6.24-rc.
> >
> > patch 1: fix locking in sysfs primary/active selection
> >
> > Call core network functions with expected locks to
> >eliminate potential deadlock and silence warnings.
> >
> > patch 2: fix ASSERT_RTNL that produces spurious warnings
> >
> > Relocate ASSERT_RTNL to remove a false warning; after patch,
> >ASSERT is located in code that holds only RTNL (additional locks were
> >causing the ASSERT to trip)
> >
> > patch 3: fix locking during alb failover and slave removal
> >
> > Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
> >Eliminates deadlock and silences warnings.
> >
> > Patches are against the current netdev-2.6#upstream branch.
> >
> > Please apply for 2.6.24.
> 
> 2.6.24-rc7 + patches #1, #2, #3:
> 
> bonding: bond0: setting mode to active-backup (1).
> bonding: bond0: Setting MII monitoring interval to 100.
> ADDRCONF(NETDEV_UP): bond0: link is not ready
> bonding: bond0: Adding slave eth0.
> e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow 
> Control: RX/TX
> bonding: bond0: making interface eth0 the new active one.
> bonding: bond0: first active interface up!
> bonding: bond0: enslaving eth0 as an active interface with an up link.
> bonding: bond0: Adding slave eth1.
> ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
> 
> =
> [ INFO: possible irq lock inversion dependency detected ]
> 2.6.24-rc7 #1
> -
> events/0/9 just changed the state of lock:
>  (&mc->mca_lock){-+..}, at: [] mld_ifc_timer_expire+0x130/0x1fb
> but this lock took another, soft-read-irq-unsafe lock in the past:
>  (&bond->lock){-.--}
> 
> and interrupts could create inverse lock ordering between them.
> 
> 

Jay's patches will not fix this issue.  I think something like this did
it for me, but as I mentioned to Jay in the last thread, I'm not
convinced it doesn't violate some of the locking expectations we have.

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 423298c..3c6619a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3915,7 +3915,7 @@ static void bond_set_multicast_list(struct net_device 
*bond_dev)
struct bonding *bond = bond_dev->priv;
struct dev_mc_list *dmi;
 
-   write_lock_bh(&bond->lock);
+   read_lock(&bond->lock);
 
/*
 * Do promisc before checking multicast_mode
@@ -3957,7 +3957,7 @@ static void bond_set_multicast_list(struct net_device 
*bond_dev)
bond_mc_list_destroy(bond);
bond_mc_list_copy(bond_dev->mc_list, bond, GFP_ATOMIC);
 
-   write_unlock_bh(&bond->lock);
+   read_unlock(&bond->lock);
 }
 
 /*


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Arjan van de Ven

Linus Torvalds wrote:


On Tue, 8 Jan 2008, Arjan van de Ven wrote:

the database has the information so it's just a matter of slightly different
php code ;)
Before I do that... do you want the BUG's separate, part of the warnings or
part of the oopses?
(I rather make this change once ;)


I'd like them all separate, they tend to be very different and contain 
different information.


Put the warnings last, as the least important. Oopses at the top, since 
they tend to be the ones that are less expected.


ok done; I had to fizzle a bit because some things aren't *exactly* a BUG() 
statement
but I track them anyway (things like the "sleeping in invalid context" check), 
so I
had to somewhat arbitrarily assign categories for those. I might fine tune 
these over time
some; if you or someone else sees problems with categorization please let me 
know

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7]: [NET]: Make ->poll() breakout consistent in Intel ethernet drivers.

2008-01-08 Thread Kok, Auke
David Miller wrote:
> [NET]: Make ->poll() breakout consistent in Intel ethernet drivers.
> 
> This makes the ->poll() routines of the E100, E1000, E1000E, IXGB, and
> IXGBE drivers complete ->poll() consistently.
> 
> Now they will all break out when the amount of RX work done is less
> than 'budget'.
> 
> At a later time, we may want put back code to include the TX work as
> well (as at least one other NAPI driver does, but by in large NAPI
> drivers do not do this).  But if so, it should be done consistently
> across the board to all of these drivers.
> 
> Signed-off-by: David S. Miller <[EMAIL PROTECTED]>


this is exactly the change I was eyeballing and indeed this seems to be the
general use case in most drivers anyway.

I'll try to see how this impacts the (especially 4-port) TX performance issue 
with
e1000e, but this should be just fine for now, and we can address later anyway.

Acked-by: Auke Kok <[EMAIL PROTECTED]>


Auke


> ---
>  drivers/net/e100.c |7 +++
>  drivers/net/e1000/e1000_main.c |   10 +-
>  drivers/net/e1000e/netdev.c|8 
>  drivers/net/ixgb/ixgb_main.c   |7 +++
>  drivers/net/ixgbe/ixgbe_main.c |8 
>  5 files changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
> index 68316f1..b87402b 100644
> --- a/drivers/net/e100.c
> +++ b/drivers/net/e100.c
> @@ -1991,13 +1991,12 @@ static int e100_poll(struct napi_struct *napi, int 
> budget)
>   struct nic *nic = container_of(napi, struct nic, napi);
>   struct net_device *netdev = nic->netdev;
>   unsigned int work_done = 0;
> - int tx_cleaned;
>  
>   e100_rx_clean(nic, &work_done, budget);
> - tx_cleaned = e100_tx_clean(nic);
> + e100_tx_clean(nic);
>  
> - /* If no Rx and Tx cleanup work was done, exit polling mode. */
> - if((!tx_cleaned && (work_done == 0))) {
> + /* If budget not fully consumed, exit the polling mode */
> + if (work_done < budget) {
>   netif_rx_complete(netdev, napi);
>   e100_enable_irq(nic);
>   }
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 9de7144..13d57b0 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
>  {
>   struct e1000_adapter *adapter = container_of(napi, struct 
> e1000_adapter, napi);
>   struct net_device *poll_dev = adapter->netdev;
> - int tx_cleaned = 0, work_done = 0;
> + int work_done = 0;
>  
>   /* Must NOT use netdev_priv macro here. */
>   adapter = poll_dev->priv;
> @@ -3929,16 +3929,16 @@ e1000_clean(struct napi_struct *napi, int budget)
>* simultaneously.  A failure obtaining the lock means
>* tx_ring[0] is currently being cleaned anyway. */
>   if (spin_trylock(&adapter->tx_queue_lock)) {
> - tx_cleaned = e1000_clean_tx_irq(adapter,
> - &adapter->tx_ring[0]);
> + e1000_clean_tx_irq(adapter,
> +&adapter->tx_ring[0]);
>   spin_unlock(&adapter->tx_queue_lock);
>   }
>  
>   adapter->clean_rx(adapter, &adapter->rx_ring[0],
> &work_done, budget);
>  
> - /* If no Tx and not enough Rx work done, exit the polling mode */
> - if ((!tx_cleaned && (work_done == 0))) {
> + /* If budget not fully consumed, exit the polling mode */
> + if (work_done < budget) {
>   if (likely(adapter->itr_setting & 3))
>   e1000_set_itr(adapter);
>   netif_rx_complete(poll_dev, napi);
> diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
> index dd9698c..4a6fc74 100644
> --- a/drivers/net/e1000e/netdev.c
> +++ b/drivers/net/e1000e/netdev.c
> @@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int 
> budget)
>  {
>   struct e1000_adapter *adapter = container_of(napi, struct 
> e1000_adapter, napi);
>   struct net_device *poll_dev = adapter->netdev;
> - int tx_cleaned = 0, work_done = 0;
> + int work_done = 0;
>  
>   /* Must NOT use netdev_priv macro here. */
>   adapter = poll_dev->priv;
> @@ -1394,14 +1394,14 @@ static int e1000_clean(struct napi_struct *napi, int 
> budget)
>* simultaneously.  A failure obtaining the lock means
>* tx_ring is currently being cleaned anyway. */
>   if (spin_trylock(&adapter->tx_queue_lock)) {
> - tx_cleaned = e1000_clean_tx_irq(adapter);
> + e1000_clean_tx_irq(adapter);
>   spin_unlock(&adapter->tx_queue_lock);
>   }
>  
>   adapter->clean_rx(adapter, &work_done, budget);
>  
> - /* If no Tx and not enough Rx work done, exit the polling mode */
> - if ((!tx_cleaned && (work_done < budget))) {
> + /* If budget not fully consumed, exit the polling mode */
> + if (work_done < 

Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24

2008-01-08 Thread Krzysztof Oledzki



On Mon, 7 Jan 2008, Jay Vosburgh wrote:


Following are three fixes to fix locking problems and
silence locking-related warnings in the current 2.6.24-rc.

patch 1: fix locking in sysfs primary/active selection

Call core network functions with expected locks to
eliminate potential deadlock and silence warnings.

patch 2: fix ASSERT_RTNL that produces spurious warnings

Relocate ASSERT_RTNL to remove a false warning; after patch,
ASSERT is located in code that holds only RTNL (additional locks were
causing the ASSERT to trip)

patch 3: fix locking during alb failover and slave removal

Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
Eliminates deadlock and silences warnings.

Patches are against the current netdev-2.6#upstream branch.

Please apply for 2.6.24.


2.6.24-rc7 + patches #1, #2, #3:

bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth0.
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: RX/TX
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: Adding slave eth1.
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready

=
[ INFO: possible irq lock inversion dependency detected ]
2.6.24-rc7 #1
-
events/0/9 just changed the state of lock:
 (&mc->mca_lock){-+..}, at: [] mld_ifc_timer_expire+0x130/0x1fb
but this lock took another, soft-read-irq-unsafe lock in the past:
 (&bond->lock){-.--}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
4 locks held by events/0/9:
 #0:  (events){--..}, at: [] run_workqueue+0x87/0x1b6
 #1:  ((linkwatch_work).work){--..}, at: [] run_workqueue+0x87/0x1b6
 #2:  (rtnl_mutex){--..}, at: [] linkwatch_event+0x5/0x22
 #3:  (&ndev->lock){-.-+}, at: [] mld_ifc_timer_expire+0x17/0x1fb

the first lock's dependencies:
-> (&mc->mca_lock){-+..} ops: 10 {
   initial-use  at:
[] dump_trace+0x83/0x8d
[] __lock_acquire+0x4ba/0xc07
[] save_stack_trace+0x20/0x3a
[] __lock_acquire+0xbbf/0xc07
[] ipv6_dev_mc_inc+0x24d/0x31c
[] lock_acquire+0x79/0x93
[] igmp6_group_added+0x18/0x11d
[] _spin_lock_bh+0x3b/0x64
[] igmp6_group_added+0x18/0x11d
[] igmp6_group_added+0x18/0x11d
[] trace_hardirqs_on+0x122/0x14c
[] ipv6_dev_mc_inc+0x2a3/0x31c
[] ipv6_dev_mc_inc+0x24d/0x31c
[] ipv6_dev_mc_inc+0x2d8/0x31c
[] ipv6_dev_mc_inc+0x0/0x31c
[] ipv6_add_dev+0x21c/0x24b
[] ndisc_ifinfo_sysctl_change+0x0/0x1ef
[] addrconf_init+0x13/0x193
[] proc_net_fops_create+0x10/0x21
[] ip6_flowlabel_init+0x1e/0x20
[] inet6_init+0x1f0/0x2ad
[] kernel_init+0x150/0x2b7
[] kernel_init+0x0/0x2b7
[] kernel_init+0x0/0x2b7
[] kernel_thread_helper+0x7/0x10
[] 0x
   in-softirq-W at:
[] mark_lock+0x64/0x451
[] __lock_acquire+0x440/0xc07
[] restore_nocheck+0x12/0x15
[] lock_acquire+0x79/0x93
[] mld_ifc_timer_expire+0x130/0x1fb
[] mld_ifc_timer_expire+0x0/0x1fb
[] _spin_lock_bh+0x3b/0x64
[] mld_ifc_timer_expire+0x130/0x1fb
[] mld_ifc_timer_expire+0x130/0x1fb
[] mld_ifc_timer_expire+0x0/0x1fb
[] trace_hardirqs_on+0x10c/0x14c
[] mld_ifc_timer_expire+0x0/0x1fb
[] run_timer_softirq+0xfa/0x15d
[] __do_softirq+0x56/0xdb
[] trace_hardirqs_on+0x10c/0x14c
[] __do_softirq+0x68/0xdb
[] do_softirq+0x36/0x51
[] local_bh_enable_ip+0xad/0xed
[] rt_run_flush+0x64/0x8b
[] fib_netdev_event+0x61/0x65
[] notifier_call_chain+0x2a/0x52
[] raw_notifier_call_chain+0x17/0x1a
[] netdev_state_change+0x18/0x29
[] __linkwatch_run_qu

Re: [PATCH 6/8] [PATCH] Split up rndis_host.c

2008-01-08 Thread David Brownell
On Tuesday 08 January 2008, Johannes Berg wrote:
> 
> > I see that the rndis_wext.c Kconfig won't kick in unless the
> > 802.11 stuff is available ... what additional dependencies
> > does that imply for a fatter rndis_host module?
> 
> No extra modules are currently required for just plain wext [1]. In the
> future, we hope to migrate stuff to cfg80211 though, which is a separate
> module.

Hmmm.  Well, go the current route then ... but come up with a good
plan to split it up later, so that connecting a typical non-wireless
RNDIS device won't incur such needless extra modules.

- Dave


> johannes
> 
> [1] mostly because all the code is simply compiled into the kernel...
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: WARNING: at kernel/softirq.c:139 local_bh_enable()

2008-01-08 Thread Jayakrishnan.Chathu
It is an embedded system, With a small file system and 2.6.23 kernel.
There are no link-monitoring tool running.  I will
Try to reproduce with 2.6.24-rc7. 

Chathu

-Original Message-
From: ext Kok, Auke [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 07, 2008 5:10 PM
To: Chathu Jayakrishnan (Nokia-S&S/MtView)
Cc: [EMAIL PROTECTED]; NetDev
Subject: Re: WARNING: at kernel/softirq.c:139 local_bh_enable()

[EMAIL PROTECTED] wrote:
> I am running 2.6.23 kernel on a DUAL core and QUAD core i386 boxes and

> after everyboot, when the ethernet traffic starts i get this warning.
> 
> All the ports in the system are e1000 and i am using the kernel e1000 
> driver.

[added netdev to the Cc:]

can you repro this with 2.6.24-rc7? What distro are you using? Is your
distro running a link-monitoring tool of some sorts?

Auke



> 
> Jan  7 22:31:00 localhost [warning] WARNING: at kernel/softirq.c:139
> local_bh_enable()
> Jan  7 22:31:00 localhost [warning] []
> local_bh_enable+0x49/0xa9
> Jan  7 22:31:00 localhost [warning] []
> dev_queue_xmit+0x26c/0x275
> Jan  7 22:31:00 localhost [warning] [] arp_xmit+0x4d/0x51 
> Jan  7 22:31:00 localhost [warning] [] 
> arp_solicit+0x156/0x174
> 
> Jan  7 22:31:00 localhost [warning] []
> neigh_timer_handler+0x1e0/0x224
> Jan  7 22:31:00 localhost [warning] []
> run_timer_softirq+0x113/0x172
> Jan  7 22:31:00 localhost [warning] [] WARNING: at
> kernel/softirq.c:139 local_bh_enable() Jan  7 22:31:00 localhost 
> [warning] hrtimer_interrupt+0x19c/0x1c4 Jan  7 22:31:00 localhost 
> [warning] []  []
> local_bh_enable+0x49/0xa9
> Jan  7 22:31:00 localhost [warning] []
> dev_queue_xmit+0x26c/0x275
> Jan  7 22:31:00 localhost [warning] [] 
> neigh_resolve_output+0x12c/0x15e Jan  7 22:31:00 localhost [warning] 
> [] neigh_update+0x246/0x2cb Jan  7 22:31:00 localhost 
> [warning] [] neigh_lookup+0xa9/0xb3 Jan  7 22:31:00 
> localhost [warning] [] arp_process+0x43c/0x477
> 
> Jan  7 22:31:00 localhost [warning] [] 
> enqueue_task_fair+0x2d/0x30 Jan  7 22:31:00 localhost [warning] 
> tick_sched_timer+0x0/0xba Jan  7 22:31:00 localhost [warning] 
> [] arp_rcv+0x104/0x119 Jan  7 22:31:00 localhost [warning] 
> []  [] netif_receive_skb+0x1c5/0x1de Jan  7 
> 22:31:00 localhost [warning] [] 
> e1000_clean_rx_irq+0x40e/0x4ca [e1000] Jan  7 22:31:00 localhost 
> [warning] []
> getnstimeofday+0x36/0x10c
> Jan  7 22:31:00 localhost [warning] neigh_timer_handler+0x0/0x224 Jan

> 7 22:31:00 localhost [warning] [] __do_softirq+0x60/0xc1 Jan

> 7 22:31:00 localhost [warning] [] e1000_clean+0x74/0x119 
> [e1000] Jan  7 22:31:00 localhost [warning] []  []
> net_rx_action+0x5a/0xd3
> Jan  7 22:31:00 localhost [warning] [] 
> __do_softirq+0x60/0xc1 Jan  7 22:31:00 localhost [warning] 
> do_softirq+0x31/0x35 Jan  7 22:31:00 localhost [warning] [] 
> do_softirq+0x31/0x35 Jan  7 22:31:00 localhost [warning] [] 
> irq_exit+0x38/0x6b Jan  7 22:31:00 localhost [warning] []  
> []
> do_IRQ+0x80/0x93
> Jan  7 22:31:00 localhost [warning] irq_exit+0x38/0x6b Jan  7 22:31:00

> localhost [warning] []
> common_interrupt+0x23/0x28
> Jan  7 22:31:00 localhost [warning] []  []
> get_swap_page+0xe7/0x215
> Jan  7 22:31:00 localhost [warning] []
> mwait_idle_with_hints+0x34/0x38 Jan  7 22:31:00 localhost [warning] 
> [] mwait_idle+0x0/0xa Jan  7 22:31:00 localhost [warning] 
> [] cpu_idle+0x98/0xb9 Jan  7 22:31:00 localhost [warning] 
> smp_apic_timer_interrupt+0x2c/0x35
> Jan  7 22:31:00 localhost [warning] === Jan  7 
> 22:31:00 localhost [warning] [] 
> apic_timer_interrupt+0x28/0x30 Jan  7 22:31:00 localhost [warning] 
> []
> get_swap_page+0xe7/0x215
> Jan  7 22:31:00 localhost [warning] []
> mwait_idle_with_hints+0x34/0x38
> Jan  7 22:31:00 localhost [warning] [] mwait_idle+0x0/0xa 
> Jan  7 22:31:00 localhost [warning] [] cpu_idle+0x98/0xb9 
> Jan  7 22:31:00 localhost [warning] ===
> 
> 
> Thanks
> Jayakrishnan Chathu
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[XFRM]: xfrm_algo_clone() allocates too much memory

2008-01-08 Thread Eric Dumazet

alg_key_len is the length in bits of the key, not in bytes.

Best way to fix this is to move alg_len() function from net/xfrm/xfrm_user.c 
to include/net/xfrm.h, and to use it in xfrm_algo_clone()


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

 include/net/xfrm.h   |7 ++-
 net/xfrm/xfrm_user.c |5 -
 2 files changed, 6 insertions(+), 6 deletions(-)




diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 58dfa82..731f0a8 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
return ret;
 }
 
+static inline int alg_len(struct xfrm_algo *alg)
+{
+   return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
+}
+
 #ifdef CONFIG_XFRM_MIGRATE
 static inline struct xfrm_algo *xfrm_algo_clone(struct xfrm_algo *orig)
 {
-   return (struct xfrm_algo *)kmemdup(orig, sizeof(*orig) + 
orig->alg_key_len, GFP_KERNEL);
+   return kmemdup(orig, alg_len(orig), GFP_KERNEL);
 }
 
 static inline void xfrm_states_put(struct xfrm_state **states, int n)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index e75dbdc..aa667a4 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -31,11 +31,6 @@
 #include 
 #endif
 
-static inline int alg_len(struct xfrm_algo *alg)
-{
-   return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
-}
-
 static int verify_one_alg(struct nlattr **attrs, enum xfrm_attr_type_t type)
 {
struct nlattr *rt = attrs[type];


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Linus Torvalds


On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> the database has the information so it's just a matter of slightly different
> php code ;)
> Before I do that... do you want the BUG's separate, part of the warnings or
> part of the oopses?
> (I rather make this change once ;)

I'd like them all separate, they tend to be very different and contain 
different information.

Put the warnings last, as the least important. Oopses at the top, since 
they tend to be the ones that are less expected.

> > and notice how the Code: never made it into the raw message (and thus there
> > is also no instruction disassembly).
> 
> ok I'll fix this; I can fix this for all new entries at least, fixing
> retroactive is going to be near impossible I suspect.

Oh well..

Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Arjan van de Ven

Linus Torvalds wrote:

Cool.

One thing I wonder about - could you separate out the bug-ons and warnings 
from the oopses? They really are different issues, and an oops with 
register information etc is very different from a BUG() with line numbers, 
which in turn is very different from a WARN_ON().



and in fact three of those five entries are really WARN_ON's. It would be 
nicer if it would look more along the lines of


Backtraces reported for kernel 2.6.24-rc7


4 oopses reported

hfsplus_releasepage 3
__hfs_brec_find 1


3 warnings repored

enqueue_task1
lock_acquire1
__ieee80211_rx  1

because those things really don't have the same kind of impact at all, and 
tend to be very different to debug (a "BUG_ON()" is perhaps somewhat 
closer to an oops, but a WARN_ON() is definitely in a class of its own).


the database has the information so it's just a matter of slightly different 
php code ;)
Before I do that... do you want the BUG's separate, part of the warnings or 
part of the oopses?
(I rather make this change once ;)



On that "Code:" side, it seems there is still some problem with oops 
parsing. See for example:



http://www.kerneloops.org/raw.php?rawid=1521&msgid=http://mid.gmane.org/[EMAIL 
PROTECTED]

and notice how the Code: never made it into the raw message (and thus 
there is also no instruction disassembly).


ok I'll fix this; I can fix this for all new entries at least, fixing 
retroactive is going to be
near impossible I suspect.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()

2008-01-08 Thread Ramkrishna Vepa
Dave,

This change is not required as the macro, is_s2io_card_up() checks for
an internal state of the adapter and not netif's state. We want to make
sure that the adapter registers are not accessed when the adapter is
being brought down.

> @@ -2704,9 +2704,6 @@ static int s2io_poll(struct napi_struct *napi,
int
> budget)
>   struct XENA_dev_config __iomem *bar0 = nic->bar0;
>   int i;
> 
> - if (!is_s2io_card_up(nic))
> - return 0;
> -
>   mac_control = &nic->mac_control;
>   config = &nic->config;
> 

Ram
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Linus Torvalds


On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> I've made life easier for those using the www.kerneloops.org website;
> at least for x86 oopses the website now does this for you and shows
> the decoded Code: line in the raw oops data:
> 
> http://www.kerneloops.org/raw.php?rawid=2716

Cool.

One thing I wonder about - could you separate out the bug-ons and warnings 
from the oopses? They really are different issues, and an oops with 
register information etc is very different from a BUG() with line numbers, 
which in turn is very different from a WARN_ON().

Right now, it says

Oopses reported for kernel 2.6.24-rc7


7 oopses reported

hfsplus_releasepage 3
enqueue_task1
lock_acquire1
__hfs_brec_find 1
__ieee80211_rx  1

and in fact three of those five entries are really WARN_ON's. It would be 
nicer if it would look more along the lines of

Backtraces reported for kernel 2.6.24-rc7


4 oopses reported

hfsplus_releasepage 3
__hfs_brec_find 1


3 warnings repored

enqueue_task1
lock_acquire1
__ieee80211_rx  1

because those things really don't have the same kind of impact at all, and 
tend to be very different to debug (a "BUG_ON()" is perhaps somewhat 
closer to an oops, but a WARN_ON() is definitely in a class of its own).

On that "Code:" side, it seems there is still some problem with oops 
parsing. See for example:


http://www.kerneloops.org/raw.php?rawid=1521&msgid=http://mid.gmane.org/[EMAIL 
PROTECTED]

and notice how the Code: never made it into the raw message (and thus 
there is also no instruction disassembly).

Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Arjan van de Ven

Randy Dunlap wrote:


(You can do it other and smarter ways too, I'm not claiming that's a 
particularly good way to do it, and the old "ksymoops" program used to do 
a pretty good job of this, but I'm used to that particular idiotic way 
myself, since it's how I've basically always done it)


One other way to do it (at least for x86-32/64) is to use
$kerneltree/scripts/decodecode.  It may work on other $arches also,
but I haven't tested it on others.


I've made life easier for those using the www.kerneloops.org website;
at least for x86 oopses the website now does this for you and shows
the decoded Code: line in the raw oops data:

http://www.kerneloops.org/raw.php?rawid=2716
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] [PATCH] Split up rndis_host.c

2008-01-08 Thread Johannes Berg

> I see that the rndis_wext.c Kconfig won't kick in unless the
> 802.11 stuff is available ... what additional dependencies
> does that imply for a fatter rndis_host module?

No extra modules are currently required for just plain wext [1]. In the
future, we hope to migrate stuff to cfg80211 though, which is a separate
module.

johannes

[1] mostly because all the code is simply compiled into the kernel...


signature.asc
Description: This is a digitally signed message part


[ANNOUNCE] iproute2-2.6.24-rc7

2008-01-08 Thread Stephen Hemminger
This is a preliminary release that includes all the changes for new
features in 2.6.24.  It should be backward compatible with older kernels.

   
http://devresources.linux-foundation.org/dev/iproute2/download/iproute2-2.6.24-rc7.tar.bz2

Note: This release is for validation (don't put it in your distros), therefore
I didn't bother signing it.

Changelog since v2.6.23 release (edited).

Alexander Wirt (2):
  Fix various typos and nitpicks
  Add parameters to usage help text.

Andreas Barth (1):
  Remove bogus reference to tc-filters(8) from tc(8) manpage.

Andreas Henriksson (4):
  Fix corruption when using batch files with comments and broken lines.
  iproute2: support dotted-quad netmask notation.
  iproute2: revert syntax help text mistake.
  iproute2: add synonyms for ip rule options to ip(8) manpage.

Denys Fedoryshchenko (1):
  iptables compatiablity

François Delawarde (1):
  tc mask patch

Herbert Xu:
  Fix typo in tunnel code (o_key vs. i_key).
  Add NAT action

Jesper Dangaard Brouer (3):
  Overhead calculation is now done in the kernel.
  Cleanup: tc_calc_rtable().
  Change the rate table calc of transmit cost to use upper bound value.

Patrick McHardy (1):
  iproute 2.6.23 incompatibility

Pavel Emelyanov (1):
  iplink_parse() routine

Stephen Hemminger
  2.6.24-rc3 headers
  Fix off by one in nested attribute management.
  Fix dotted quad for bit order
  veth: use kernel header file
  snapshot target for makefile
  veth.h move to linux/
  Manual page fixes
  add decode of match rules
  Use netinet/tcp.h (with correction) rather than kernel headers
  add include/netinet/tcp.h
  Revert "TC action parsing bug fix"

Tomas Janousek (1):
  Correct documentation regarding PROMISC and ALLMULTI.

Vitaliy Gusev (2):
  Fix lost export-dynamic
  veth device link management

YOSHIFUJI Hideaki / 吉藤英明 (1):
  rto_min value display overflow

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Andi Kleen
Linus Torvalds <[EMAIL PROTECTED]> writes:
>
> I usually just compile a small program like

Just use scripts/decodecode and cat the Code line into that.

> particularly good way to do it, and the old "ksymoops" program used to do 
> a pretty good job of this, but I'm used to that particular idiotic way 
> myself, since it's how I've basically always done it)
>
> After that, you still need to try to match up the assembly code with the 
> source code and figure out what variables the register contents actually 
> are all about. You can often try to do a
>
>   make the/affected/file.s


IMHO better is  

make the/file/xyz.lst

which gives you a listing with binary data in there which can be
grepped for.

But you should install a very recent binutils because older objdump -S
couldn't deal with unit-at-a-time compilers.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] bridge-utils 1.4

2008-01-08 Thread Stephen Hemminger
Minor update to bridge-utils. Mostly fixing bugs in usage of sysfs.

Release tarball:
  http://downloads.sourceforge.net/bridge/bridge-utils-1.4.tar.gz

Alon Bar-Lev (1):
  Allow bridge-utils to run when no TCP/IP is available

Denys Vlasenko (1):
  fix use of sysfs (affects 32/64 bit compat)

Jeremy Jackson (1):
  Fix parsing of port_id's (hex).

Stephen Hemminger (3):
  Add ignore for generated files.
  Update gitignore
  Use linux/if.h rather than net/if.h


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread John Heffner

David Miller wrote:

Ilpo, just trying to keep an old conversation from dying off.

Did you happen to read a recent blog posting of mine?

http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead

I've been thinking more and more and I think we might be able
to get away with enforcing that SACKs are always increasing in
coverage.

I doubt there are any real systems out there that drop out of order
packets that are properly formed and are in window, even though the
SACK specification (foolishly, in my opinion) allows this.

If we could free packets as SACK blocks cover them, all the problems
go away.

For one thing, this will allow the retransmit queue liberation during
loss recovery to be spread out over the event, instead of batched up
like crazy to the point where the cumulative ACK finally moves and
releases an entire window's worth of data.

Next, it would simplify all of this scanning code trying to figure out
which holes to fill during recovery.

And for SACK scoreboard marking, the RB trie would become very nearly
unecessary as far as I can tell.

I would not even entertain this kind of crazy idea unless I thought
the fundamental complexity simplification payback was enormous.  And
in this case I think it is.

What we could do is put some experimental hack in there for developers
to start playing with, which would enforce that SACKs always increase
in coverage.  If violated the connection reset and a verbose log
message is logged so we can analyze any cases that occur.

Sounds crazy, but maybe has potential.  What do you think?



Linux has a code path where this can happen under memory over-commit, in 
tcp_prune_queue().  Also, I think one of the motivations for making SACK 
strictly advisory is there was some concern about buggy SACK 
implementations.  Keeping data in your retransmit queue allows you to 
fall back to timeout and go-back-n if things completely fall apart.  For 
better or worse, we have to deal with the spec the way it is.


Even if you made this assumption of "hard" SACKs, you still have to 
worry about large ACKs if SACK is disabled, though I guess you could say 
people running with large windows without SACK deserve what they get. :)



I haven't thought about this too hard, but can we approximate this by 
moving scaked data into a sacked queue, then if something bad happens 
merge this back into the retransmit queue?  The code will have to deal 
with non-contiguous data in the retransmit queue; I'm not sure offhand 
if that violates any assumptions.  You still have a single expensive ACK 
at the end of recovery, though I wonder how much this really hurts.  If 
you want to ameliorate this, you could save this sacked queue to be 
batch processed later, in application context for instance.


  -John


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip[6]_tables.c: remove some inlines

2008-01-08 Thread Patrick McHardy

Denys Vlasenko wrote:

On Monday 31 December 2007 17:00, Patrick McHardy wrote:

Denys Vlasenko wrote:

ip[6]_tables.c seem to abuse inline.

This patch removes inlines except those which are used
by packet matching code and thus are performance-critical.
  

Some people also consider the ruleset replacement path performance
critical, but overall I agree with your patch. I'm travelling currently
though so it will take a few days until I'll apply it.

Do you have some numbers that show the actual difference these
changes make?


Before:

$ size */*/*/ip*tables*.o
   textdata bss dec hex filename
   6402 500  1669181b06 net/ipv4/netfilter/ip_tables.o
   7130 500  1676461dde net/ipv6/netfilter/ip6_tables.o

After:

$ size */*/*/ip*tables*.o
   textdata bss dec hex filename
   6307 500  1668231aa7 net/ipv4/netfilter/ip_tables.o
   7010 500  1675261d66 net/ipv6/netfilter/ip6_tables.o



Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 6/6][NETFILTER] Use the ctl paths instead of hand-made analogue

2008-01-08 Thread Patrick McHardy

Pavel Emelyanov wrote:

The conntracks subsystem has a similar infrastructure
to maintain ctl_paths, but since we already have it
on the generic level, I think it's OK to switch to
using it.

So, basically, this patch just replaces the ctl_table-s
with ctl_path-s, nf_register_sysctl_table with 
register_sysctl_paths() and removes no longer needed code.


Also looks good, thanks. But please remeber to CC netfilter-devel
on patches affecting netfilter.

After this the net/netfilter/nf_sysctl.c file contains 
the paths only.


That sounds like they should be moved to net/netfilter/core.c
instead and the file removed. I can take care of that once
your patches are in Dave's tree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 5/6][NETFILTER] Switch to using ctl_paths in nf_queue and conntrack modules

2008-01-08 Thread Patrick McHardy

Pavel Emelyanov wrote:

This includes the most simple cases for netfilter.

The first part is tne queue modules for ipv4 and ipv6,
on which the net/ipv4/ and net/ipv6/ paths are reused 
from the appropriate ipv4 and ipv6 code.


The conntrack module is also patched, but this hunk is 
very small and simple.


Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>



Looks good to me.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-08 Thread Randy Dunlap
On Mon, 7 Jan 2008 19:26:12 -0800 (PST) Linus Torvalds wrote:

> On Mon, 7 Jan 2008, Kevin Winchester wrote:
> 
> > J. Bruce Fields wrote:
> > > 
> > > Is there any good basic documentation on this to point people at?
> > 
> > I would second this question.  I see people "decode" oops on lkml often 
> > enough, but I've never been entirely sure how its done.  Is it somewhere 
> > in Documentation?
> 
> It's actually not necessarily at all that trivial, unless you have a deep 
> understanding of the code generated for the architecture in question (and 
> even then, some oopses take more time to figure out than others, thanks 
> to inlining and tailcalls etc).
> 
> If the oops happened with a kernel you generated yourself, it's usually 
> rather easy. Especially if you said "y" to the "generate debugging info" 
> question at configuration time. Because, in that case, you really just do 
> a simple
> 
>   gdb vmlinux
> 
> and then you can do (for example) something like setting a breakpoint at 
> the EIP that was reported for the oops, and it will tell you what line it 
> came from.
> 
> However, if you don't have the exact binary - which is the common case for 
> random oopses reported on lkml - you will generally have to disassemble 
> the hex sequence given in the oops (the "Code:" line), and try to match it 
> up against the source code to try to figure out what is going on.
> 
> Even just the disassembly is not entirely trivial, since the oops will 
> give you the eip that it happened at, but you often want to also 
> disassemble *backwards* in order to get more of a context (the "Code:" 
> line will mark the particular EIP that starts the oopsing instruction by 
> enclosing it in , but with non-constant instruction lengths, you need 
> to use a bit of trial-and-error to figure it out.
> 
> I usually just compile a small program like
> 
>   const char array[]="\xnn\xnn\xnn...";
> 
>   int main(int argc, char **argv)
>   {
>   printf("%p\n", array);
>   *(int *)0=0;
>   }
> 
> and run it under gdb, and then when it gets the SIGSEGV (due to the 
> obvious NULL pointer dereference), I can just ask gdb to disassemble 
> around the array that contains the code[] stuff. Try a few offsets, to see 
> when the disassembly makes sense (and gives the reported EIP as the 
> beginning of one of the disassembled instructions).
> 
> (You can do it other and smarter ways too, I'm not claiming that's a 
> particularly good way to do it, and the old "ksymoops" program used to do 
> a pretty good job of this, but I'm used to that particular idiotic way 
> myself, since it's how I've basically always done it)

One other way to do it (at least for x86-32/64) is to use
$kerneltree/scripts/decodecode.  It may work on other $arches also,
but I haven't tested it on others.

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.25 6/6][NETFILTER] Use the ctl paths instead of hand-made analogue

2008-01-08 Thread Pavel Emelyanov
The conntracks subsystem has a similar infrastructure
to maintain ctl_paths, but since we already have it
on the generic level, I think it's OK to switch to
using it.

So, basically, this patch just replaces the ctl_table-s
with ctl_path-s, nf_register_sysctl_table with 
register_sysctl_paths() and removes no longer needed code.

After this the net/netfilter/nf_sysctl.c file contains 
the paths only.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

 include/linux/netfilter.h|8 -
 include/net/netfilter/nf_conntrack_l3proto.h |2 
 net/netfilter/nf_conntrack_proto.c   |7 -
 net/netfilter/nf_sysctl.c|  127 +--
 4 files changed, 16 insertions(+), 128 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d190d56..c41f643 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -120,12 +120,8 @@ void nf_unregister_sockopt(struct nf_sockopt_ops *reg);
 
 #ifdef CONFIG_SYSCTL
 /* Sysctl registration */
-struct ctl_table_header *nf_register_sysctl_table(struct ctl_table *path,
- struct ctl_table *table);
-void nf_unregister_sysctl_table(struct ctl_table_header *header,
-   struct ctl_table *table);
-extern struct ctl_table nf_net_netfilter_sysctl_path[];
-extern struct ctl_table nf_net_ipv4_netfilter_sysctl_path[];
+extern struct ctl_path nf_net_netfilter_sysctl_path[];
+extern struct ctl_path nf_net_ipv4_netfilter_sysctl_path[];
 #endif /* CONFIG_SYSCTL */
 
 extern struct list_head nf_hooks[NPROTO][NF_MAX_HOOKS];
diff --git a/include/net/netfilter/nf_conntrack_l3proto.h 
b/include/net/netfilter/nf_conntrack_l3proto.h
index 15888fc..875c6d4 100644
--- a/include/net/netfilter/nf_conntrack_l3proto.h
+++ b/include/net/netfilter/nf_conntrack_l3proto.h
@@ -73,7 +73,7 @@ struct nf_conntrack_l3proto
 
 #ifdef CONFIG_SYSCTL
struct ctl_table_header *ctl_table_header;
-   struct ctl_table*ctl_table_path;
+   struct ctl_path *ctl_table_path;
struct ctl_table*ctl_table;
 #endif /* CONFIG_SYSCTL */
 
diff --git a/net/netfilter/nf_conntrack_proto.c 
b/net/netfilter/nf_conntrack_proto.c
index 6d94706..8595b59 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -36,11 +36,11 @@ static DEFINE_MUTEX(nf_ct_proto_mutex);
 
 #ifdef CONFIG_SYSCTL
 static int
-nf_ct_register_sysctl(struct ctl_table_header **header, struct ctl_table *path,
+nf_ct_register_sysctl(struct ctl_table_header **header, struct ctl_path *path,
  struct ctl_table *table, unsigned int *users)
 {
if (*header == NULL) {
-   *header = nf_register_sysctl_table(path, table);
+   *header = register_sysctl_paths(path, table);
if (*header == NULL)
return -ENOMEM;
}
@@ -55,7 +55,8 @@ nf_ct_unregister_sysctl(struct ctl_table_header **header,
 {
if (users != NULL && --*users > 0)
return;
-   nf_unregister_sysctl_table(*header, table);
+
+   unregister_sysctl_table(*header);
*header = NULL;
 }
 #endif
diff --git a/net/netfilter/nf_sysctl.c b/net/netfilter/nf_sysctl.c
index ee34589..d9fcc89 100644
--- a/net/netfilter/nf_sysctl.c
+++ b/net/netfilter/nf_sysctl.c
@@ -7,128 +7,19 @@
 #include 
 #include 
 
-static void
-path_free(struct ctl_table *path, struct ctl_table *table)
-{
-   struct ctl_table *t, *next;
-
-   for (t = path; t != NULL && t != table; t = next) {
-   next = t->child;
-   kfree(t);
-   }
-}
-
-static struct ctl_table *
-path_dup(struct ctl_table *path, struct ctl_table *table)
-{
-   struct ctl_table *t, *last = NULL, *tmp;
-
-   for (t = path; t != NULL; t = t->child) {
-   /* twice the size since path elements are terminated by an
-* empty element */
-   tmp = kmemdup(t, 2 * sizeof(*t), GFP_KERNEL);
-   if (tmp == NULL) {
-   if (last != NULL)
-   path_free(path, table);
-   return NULL;
-   }
-
-   if (last != NULL)
-   last->child = tmp;
-   else
-   path = tmp;
-   last = tmp;
-   }
-
-   if (last != NULL)
-   last->child = table;
-   else
-   path = table;
-
-   return path;
-}
-
-struct ctl_table_header *
-nf_register_sysctl_table(struct ctl_table *path, struct ctl_table *table)
-{
-   struct ctl_table_header *header;
-
-   path = path_dup(path, table);
-   if (path == NULL)
-   return NULL;
-   header = register_sysctl_table(path);
-   if (header == NULL)
-   path_free(path, table);
-   return header;
-}
-EXPORT_SYMBOL_GPL(nf_register_sysctl_table);
-
-void
-nf_unregister_sysctl_ta

[PATCH net-2.6.25 5/6][NETFILTER] Switch to using ctl_paths in nf_queue and conntrack modules

2008-01-08 Thread Pavel Emelyanov
This includes the most simple cases for netfilter.

The first part is tne queue modules for ipv4 and ipv6,
on which the net/ipv4/ and net/ipv6/ paths are reused 
from the appropriate ipv4 and ipv6 code.

The conntrack module is also patched, but this hunk is 
very small and simple.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

diff --git a/include/net/ip.h b/include/net/ip.h
index 8be48c8..2ad4d2f 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -177,6 +177,8 @@ extern void inet_get_local_port_range(int *low, int *high);
 extern int sysctl_ip_default_ttl;
 extern int sysctl_ip_nonlocal_bind;
 
+extern struct ctl_path net_ipv4_ctl_path[];
+
 /* From ip_fragment.c */
 struct inet_frags_ctl;
 extern struct inet_frags_ctl ip4_frags_ctl;
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f2adedf..e371f32 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -112,6 +112,8 @@ struct frag_hdr {
 extern int sysctl_ipv6_bindv6only;
 extern int sysctl_mld_max_msf;
 
+extern struct ctl_path net_ipv6_ctl_path[];
+
 #define _DEVINC(statname, modifier, idev, field)   \
 ({ \
struct inet6_dev *_idev = (idev);   \
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
index 68b12ce..7361315 100644
--- a/net/ipv4/netfilter/ip_queue.c
+++ b/net/ipv4/netfilter/ip_queue.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IPQ_QMAX_DEFAULT 1024
 #define IPQ_PROC_FS_NAME "ip_queue"
@@ -525,26 +526,6 @@ static ctl_table ipq_table[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table ipq_dir_table[] = {
-   {
-   .ctl_name   = NET_IPV4,
-   .procname   = "ipv4",
-   .mode   = 0555,
-   .child  = ipq_table
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table ipq_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = ipq_dir_table
-   },
-   { .ctl_name = 0 }
-};
-
 static int ip_queue_show(struct seq_file *m, void *v)
 {
read_lock_bh(&queue_lock);
@@ -610,7 +591,7 @@ static int __init ip_queue_init(void)
}
 
register_netdevice_notifier(&ipq_dev_notifier);
-   ipq_sysctl_header = register_sysctl_table(ipq_root_table);
+   ipq_sysctl_header = register_sysctl_paths(net_ipv4_ctl_path, ipq_table);
 
status = nf_register_queue_handler(PF_INET, &nfqh);
if (status < 0) {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a5a9f8e..45536a9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -846,17 +846,18 @@ static struct ctl_table ipv4_table[] = {
{ .ctl_name = 0 }
 };
 
-static __initdata struct ctl_path net_ipv4_path[] = {
+struct ctl_path net_ipv4_ctl_path[] = {
{ .procname = "net", .ctl_name = CTL_NET, },
{ .procname = "ipv4", .ctl_name = NET_IPV4, },
{ },
 };
+EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 
 static __init int sysctl_ipv4_init(void)
 {
struct ctl_table_header *hdr;
 
-   hdr = register_sysctl_paths(net_ipv4_path, ipv4_table);
+   hdr = register_sysctl_paths(net_ipv4_ctl_path, ipv4_table);
return hdr == NULL ? -ENOMEM : 0;
 }
 
diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
index e5b0059..a20db0b 100644
--- a/net/ipv6/netfilter/ip6_queue.c
+++ b/net/ipv6/netfilter/ip6_queue.c
@@ -529,26 +529,6 @@ static ctl_table ipq_table[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table ipq_dir_table[] = {
-   {
-   .ctl_name   = NET_IPV6,
-   .procname   = "ipv6",
-   .mode   = 0555,
-   .child  = ipq_table
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table ipq_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = ipq_dir_table
-   },
-   { .ctl_name = 0 }
-};
-
 static int ip6_queue_show(struct seq_file *m, void *v)
 {
read_lock_bh(&queue_lock);
@@ -614,7 +594,7 @@ static int __init ip6_queue_init(void)
}
 
register_netdevice_notifier(&ipq_dev_notifier);
-   ipq_sysctl_header = register_sysctl_table(ipq_root_table);
+   ipq_sysctl_header = register_sysctl_paths(net_ipv6_ctl_path, ipq_table);
 
status = nf_register_queue_handler(PF_INET6, &nfqh);
if (status < 0) {
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 0b5bec3..4ad8d9d 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -82,17 +82,19 @@ static ctl_table ipv6_table[] = {
{ .ctl_name = 0 }
 };
 
-static struct ctl_path ipv6_ctl_path[] = {
+struct ctl_path net_

[PATCH net-2.6.25 4/6][AX25] Switch to using ctl_paths.

2008-01-08 Thread Pavel Emelyanov
This one is almost the same as the hunks in the
first patch, but ax25 tables are created dynamically.

So this patch differs a bit to handle this case.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

 net/ax25/sysctl_net_ax25.c |   27 +--
 1 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/net/ax25/sysctl_net_ax25.c b/net/ax25/sysctl_net_ax25.c
index 443a836..f597987 100644
--- a/net/ax25/sysctl_net_ax25.c
+++ b/net/ax25/sysctl_net_ax25.c
@@ -31,25 +31,11 @@ static struct ctl_table_header *ax25_table_header;
 static ctl_table *ax25_table;
 static int ax25_table_size;
 
-static ctl_table ax25_dir_table[] = {
-   {
-   .ctl_name   = NET_AX25,
-   .procname   = "ax25",
-   .mode   = 0555,
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table ax25_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = ax25_dir_table
-   },
-   { .ctl_name = 0 }
+static struct ctl_path ax25_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "ax25", .ctl_name = NET_AX25, },
+   { }
 };
-
 static const ctl_table ax25_param_table[] = {
{
.ctl_name   = NET_AX25_IP_DEFAULT_MODE,
@@ -243,9 +229,7 @@ void ax25_register_sysctl(void)
}
spin_unlock_bh(&ax25_dev_lock);
 
-   ax25_dir_table[0].child = ax25_table;
-
-   ax25_table_header = register_sysctl_table(ax25_root_table);
+   ax25_table_header = register_sysctl_paths(ax25_path, ax25_table);
 }
 
 void ax25_unregister_sysctl(void)
@@ -253,7 +237,6 @@ void ax25_unregister_sysctl(void)
ctl_table *p;
unregister_sysctl_table(ax25_table_header);
 
-   ax25_dir_table[0].child = NULL;
for (p = ax25_table; p->ctl_name; p++)
kfree(p->child);
kfree(ax25_table);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.25 3/6][DECNET] Switch to using ctl_paths.

2008-01-08 Thread Pavel Emelyanov
The decnet includes two places to patch. The first one is 
the net/decnet table itself, and it is patched just like 
other subsystems in the first patch in this series.

The second place is a bit more complex - it is the 
net/decnet/conf/xxx entries,. similar to those in 
ipv4/devinet.c and ipv6/addrconf.c. This code is made similar 
to those in ipv[46].

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

 net/decnet/dn_dev.c|   52 +++--
 net/decnet/sysctl_net_decnet.c |   23 +++---
 2 files changed, 20 insertions(+), 55 deletions(-)

diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c
index 39c89c6..fb884e5 100644
--- a/net/decnet/dn_dev.c
+++ b/net/decnet/dn_dev.c
@@ -173,10 +173,6 @@ static int dn_forwarding_sysctl(ctl_table *table, int 
__user *name, int nlen,
 static struct dn_dev_sysctl_table {
struct ctl_table_header *sysctl_header;
ctl_table dn_dev_vars[5];
-   ctl_table dn_dev_dev[2];
-   ctl_table dn_dev_conf_dir[2];
-   ctl_table dn_dev_proto_dir[2];
-   ctl_table dn_dev_root_dir[2];
 } dn_dev_sysctl = {
NULL,
{
@@ -224,30 +220,6 @@ static struct dn_dev_sysctl_table {
},
{0}
},
-   {{
-   .ctl_name = 0,
-   .procname = "",
-   .mode = 0555,
-   .child = dn_dev_sysctl.dn_dev_vars
-   }, {0}},
-   {{
-   .ctl_name = NET_DECNET_CONF,
-   .procname = "conf",
-   .mode = 0555,
-   .child = dn_dev_sysctl.dn_dev_dev
-   }, {0}},
-   {{
-   .ctl_name = NET_DECNET,
-   .procname = "decnet",
-   .mode = 0555,
-   .child = dn_dev_sysctl.dn_dev_conf_dir
-   }, {0}},
-   {{
-   .ctl_name = CTL_NET,
-   .procname = "net",
-   .mode = 0555,
-   .child = dn_dev_sysctl.dn_dev_proto_dir
-   }, {0}}
 };
 
 static void dn_dev_sysctl_register(struct net_device *dev, struct dn_dev_parms 
*parms)
@@ -255,6 +227,16 @@ static void dn_dev_sysctl_register(struct net_device *dev, 
struct dn_dev_parms *
struct dn_dev_sysctl_table *t;
int i;
 
+#define DN_CTL_PATH_DEV3
+   
+   struct ctl_path dn_ctl_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "decnet", .ctl_name = NET_DECNET, },
+   { .procname = "conf", .ctl_name = NET_DECNET_CONF, },
+   { /* to be set */ },
+   { },
+   };
+
t = kmemdup(&dn_dev_sysctl, sizeof(*t), GFP_KERNEL);
if (t == NULL)
return;
@@ -265,20 +247,16 @@ static void dn_dev_sysctl_register(struct net_device 
*dev, struct dn_dev_parms *
}
 
if (dev) {
-   t->dn_dev_dev[0].procname = dev->name;
-   t->dn_dev_dev[0].ctl_name = dev->ifindex;
+   dn_ctl_path[DN_CTL_PATH_DEV].procname = dev->name;
+   dn_ctl_path[DN_CTL_PATH_DEV].ctl_name = dev->ifindex;
} else {
-   t->dn_dev_dev[0].procname = parms->name;
-   t->dn_dev_dev[0].ctl_name = parms->ctl_name;
+   dn_ctl_path[DN_CTL_PATH_DEV].procname = parms->name;
+   dn_ctl_path[DN_CTL_PATH_DEV].ctl_name = parms->ctl_name;
}
 
-   t->dn_dev_dev[0].child = t->dn_dev_vars;
-   t->dn_dev_conf_dir[0].child = t->dn_dev_dev;
-   t->dn_dev_proto_dir[0].child = t->dn_dev_conf_dir;
-   t->dn_dev_root_dir[0].child = t->dn_dev_proto_dir;
t->dn_dev_vars[0].extra1 = (void *)dev;
 
-   t->sysctl_header = register_sysctl_table(t->dn_dev_root_dir);
+   t->sysctl_header = register_sysctl_paths(dn_ctl_path, t->dn_dev_vars);
if (t->sysctl_header == NULL)
kfree(t);
else
diff --git a/net/decnet/sysctl_net_decnet.c b/net/decnet/sysctl_net_decnet.c
index ae354a4..228067c 100644
--- a/net/decnet/sysctl_net_decnet.c
+++ b/net/decnet/sysctl_net_decnet.c
@@ -470,28 +470,15 @@ static ctl_table dn_table[] = {
{0}
 };
 
-static ctl_table dn_dir_table[] = {
-   {
-   .ctl_name = NET_DECNET,
-   .procname = "decnet",
-   .mode = 0555,
-   .child = dn_table},
-   {0}
-};
-
-static ctl_table dn_root_table[] = {
-   {
-   .ctl_name = CTL_NET,
-   .procname = "net",
-   .mode = 0555,
-   .child = dn_dir_table
-   },
-   {0}
+static struct ctl_path dn_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "decnet", .ctl_name = NET_DECNET, },
+   { }
 };
 
 void dn_register_sysctl(void)
 {
-   dn_table_header = register_sysctl_table(dn_root_table);
+   dn_table_header = register_sysctl_paths(dn_path, dn_table);
 }
 
 void dn_unregister_sysctl(void)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of

Re: 2.6.24-rc6-mm1

2008-01-08 Thread Ingo Molnar

* FUJITA Tomonori <[EMAIL PROTECTED]> wrote:

> The patches are available at:
> 
> http://www.kernel.org/pub/linux/kernel/people/tomo/iommu/
> 
> Or if you prefer the git tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git 
> iommu-sg-fixes

btw., these improvements to the IOMMU code are in -mm and will go into 
v2.6.25, right? The changes look robust to me.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.25 2/6][IPVS] Switch to using ctl_paths.

2008-01-08 Thread Pavel Emelyanov
The feature of ipvs ctls is that the net/ipv4/vs path
is common for core ipvs ctls and for two schedulers,
so I make it exported and re-use it in modules.

Two other .c files required linux/sysctl.h to make the
extern declaration of this path compile well.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

 include/net/ip_vs.h |1 +
 net/ipv4/ipvs/ip_vs_ctl.c   |   35 +++
 net/ipv4/ipvs/ip_vs_est.c   |1 +
 net/ipv4/ipvs/ip_vs_lblc.c  |   31 +--
 net/ipv4/ipvs/ip_vs_lblcr.c |   31 +--
 net/ipv4/ipvs/ip_vs_sched.c |1 +
 6 files changed, 12 insertions(+), 88 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 3de6d1e..02ab7ca 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -854,6 +854,7 @@ extern int sysctl_ip_vs_expire_quiescent_template;
 extern int sysctl_ip_vs_sync_threshold[2];
 extern int sysctl_ip_vs_nat_icmp_send;
 extern struct ip_vs_stats ip_vs_stats;
+extern struct ctl_path net_vs_ctl_path[];
 
 extern struct ip_vs_service *
 ip_vs_service_get(__u32 fwmark, __u16 protocol, __be32 vaddr, __be16 vport);
diff --git a/net/ipv4/ipvs/ip_vs_ctl.c b/net/ipv4/ipvs/ip_vs_ctl.c
index 693d924..9fecfe7 100644
--- a/net/ipv4/ipvs/ip_vs_ctl.c
+++ b/net/ipv4/ipvs/ip_vs_ctl.c
@@ -1591,34 +1591,13 @@ static struct ctl_table vs_vars[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table vs_table[] = {
-   {
-   .procname   = "vs",
-   .mode   = 0555,
-   .child  = vs_vars
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table ipvs_ipv4_table[] = {
-   {
-   .ctl_name   = NET_IPV4,
-   .procname   = "ipv4",
-   .mode   = 0555,
-   .child  = vs_table,
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table vs_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = ipvs_ipv4_table,
-   },
-   { .ctl_name = 0 }
+struct ctl_path net_vs_ctl_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "ipv4", .ctl_name = NET_IPV4, },
+   { .procname = "vs", },
+   { }
 };
+EXPORT_SYMBOL_GPL(net_vs_ctl_path);
 
 static struct ctl_table_header * sysctl_header;
 
@@ -2345,7 +2324,7 @@ int ip_vs_control_init(void)
proc_net_fops_create(&init_net, "ip_vs", 0, &ip_vs_info_fops);
proc_net_fops_create(&init_net, "ip_vs_stats",0, &ip_vs_stats_fops);
 
-   sysctl_header = register_sysctl_table(vs_root_table);
+   sysctl_header = register_sysctl_paths(net_vs_ctl_path, vs_vars);
 
/* Initialize ip_vs_svc_table, ip_vs_svc_fwm_table, ip_vs_rtable */
for(idx = 0; idx < IP_VS_SVC_TAB_SIZE; idx++)  {
diff --git a/net/ipv4/ipvs/ip_vs_est.c b/net/ipv4/ipvs/ip_vs_est.c
index efdd74e..dfa0d71 100644
--- a/net/ipv4/ipvs/ip_vs_est.c
+++ b/net/ipv4/ipvs/ip_vs_est.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
index bf8c04a..3888642 100644
--- a/net/ipv4/ipvs/ip_vs_lblc.c
+++ b/net/ipv4/ipvs/ip_vs_lblc.c
@@ -123,35 +123,6 @@ static ctl_table vs_vars_table[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table vs_table[] = {
-   {
-   .procname   = "vs",
-   .mode   = 0555,
-   .child  = vs_vars_table
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table ipvs_ipv4_table[] = {
-   {
-   .ctl_name   = NET_IPV4,
-   .procname   = "ipv4",
-   .mode   = 0555,
-   .child  = vs_table
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table lblc_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = ipvs_ipv4_table
-   },
-   { .ctl_name = 0 }
-};
-
 static struct ctl_table_header * sysctl_header;
 
 /*
@@ -582,7 +553,7 @@ static int __init ip_vs_lblc_init(void)
int ret;
 
INIT_LIST_HEAD(&ip_vs_lblc_scheduler.n_list);
-   sysctl_header = register_sysctl_table(lblc_root_table);
+   sysctl_header = register_sysctl_paths(net_vs_ctl_path, vs_vars_table);
ret = register_ip_vs_scheduler(&ip_vs_lblc_scheduler);
if (ret)
unregister_sysctl_table(sysctl_header);
diff --git a/net/ipv4/ipvs/ip_vs_lblcr.c b/net/ipv4/ipvs/ip_vs_lblcr.c
index f50da64..daa260e 100644
--- a/net/ipv4/ipvs/ip_vs_lblcr.c
+++ b/net/ipv4/ipvs/ip_vs_lblcr.c
@@ -311,35 +311,6 @@ static ctl_table vs_vars_table[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table vs_table[] = {
-   {
-   .procname   = "vs",
-   .mode   = 055

[PATCH net-2.6.25 1/6][NET] Simple ctl_table to ctl_path conversions.

2008-01-08 Thread Pavel Emelyanov
This patch includes many places, that only required
replacing the ctl_table-s with appropriate ctl_paths
and call register_sysctl_paths().

Nothing special was done with them.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

 net/appletalk/sysctl_net_atalk.c |   24 +---
 net/bridge/br_netfilter.c|   24 +---
 net/dccp/sysctl.c|   36 +++-
 net/ipx/sysctl_net_ipx.c |   24 +---
 net/irda/irsysctl.c  |   28 +---
 net/llc/sysctl_net_llc.c |   24 +---
 net/netrom/sysctl_net_netrom.c   |   24 +---
 net/rose/sysctl_net_rose.c   |   24 +---
 net/sctp/sysctl.c|   24 +---
 net/x25/sysctl_net_x25.c |   24 +---
 10 files changed, 52 insertions(+), 204 deletions(-)

diff --git a/net/appletalk/sysctl_net_atalk.c b/net/appletalk/sysctl_net_atalk.c
index 7df1778..621805d 100644
--- a/net/appletalk/sysctl_net_atalk.c
+++ b/net/appletalk/sysctl_net_atalk.c
@@ -49,31 +49,17 @@ static struct ctl_table atalk_table[] = {
{ 0 },
 };
 
-static struct ctl_table atalk_dir_table[] = {
-   {
-   .ctl_name   = NET_ATALK,
-   .procname   = "appletalk",
-   .mode   = 0555,
-   .child  = atalk_table,
-   },
-   { 0 },
-};
-
-static struct ctl_table atalk_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = atalk_dir_table,
-   },
-   { 0 },
+static struct ctl_path atalk_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "appletalk", .ctl_name = NET_ATALK, },
+   { }
 };
 
 static struct ctl_table_header *atalk_table_header;
 
 void atalk_register_sysctl(void)
 {
-   atalk_table_header = register_sysctl_table(atalk_root_table);
+   atalk_table_header = register_sysctl_paths(atalk_path, atalk_table);
 }
 
 void atalk_unregister_sysctl(void)
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 32ac035..91a180a 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -934,24 +934,10 @@ static ctl_table brnf_table[] = {
{ .ctl_name = 0 }
 };
 
-static ctl_table brnf_bridge_table[] = {
-   {
-   .ctl_name   = NET_BRIDGE,
-   .procname   = "bridge",
-   .mode   = 0555,
-   .child  = brnf_table,
-   },
-   { .ctl_name = 0 }
-};
-
-static ctl_table brnf_net_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = brnf_bridge_table,
-   },
-   { .ctl_name = 0 }
+static struct ctl_path brnf_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "bridge", .ctl_name = NET_BRIDGE, },
+   { }
 };
 #endif
 
@@ -963,7 +949,7 @@ int __init br_netfilter_init(void)
if (ret < 0)
return ret;
 #ifdef CONFIG_SYSCTL
-   brnf_sysctl_header = register_sysctl_table(brnf_net_table);
+   brnf_sysctl_header = register_sysctl_paths(brnf_path, brnf_table);
if (brnf_sysctl_header == NULL) {
printk(KERN_WARNING
   "br_netfilter: can't register to sysctl.\n");
diff --git a/net/dccp/sysctl.c b/net/dccp/sysctl.c
index c62c050..2129599 100644
--- a/net/dccp/sysctl.c
+++ b/net/dccp/sysctl.c
@@ -100,41 +100,19 @@ static struct ctl_table dccp_default_table[] = {
{ .ctl_name = 0, }
 };
 
-static struct ctl_table dccp_table[] = {
-   {
-   .ctl_name   = NET_DCCP_DEFAULT,
-   .procname   = "default",
-   .mode   = 0555,
-   .child  = dccp_default_table,
-   },
-   { .ctl_name = 0, },
-};
-
-static struct ctl_table dccp_dir_table[] = {
-   {
-   .ctl_name   = NET_DCCP,
-   .procname   = "dccp",
-   .mode   = 0555,
-   .child  = dccp_table,
-   },
-   { .ctl_name = 0, },
-};
-
-static struct ctl_table dccp_root_table[] = {
-   {
-   .ctl_name   = CTL_NET,
-   .procname   = "net",
-   .mode   = 0555,
-   .child  = dccp_dir_table,
-   },
-   { .ctl_name = 0, },
+static struct ctl_path dccp_path[] = {
+   { .procname = "net", .ctl_name = CTL_NET, },
+   { .procname = "dccp", .ctl_name = NET_DCCP, },
+   { .procname = "default", .ctl_name = NET_DCCP_DEFAULT, },
+   { }
 };
 
 static struct ctl_table_header *dccp_table_header;
 
 int __init dccp_sysctl_init(void)
 {
-   dccp_tabl

[PATCH net-2.6.25 0/6] Use ctl paths in the networking code.

2008-01-08 Thread Pavel Emelyanov
This set almost completes the ctl paths usage in the 
networking code. The first patches doing this were accepted 
in the last year :), but they tuned only core, ipv4 and ipv6.

I thought, that splitting this into many subsystem would
produce too many patches, so I splitted it so, that most 
subsystems that are patched in a very similar way are 
merged into one patch to make the review easier. Hope this 
is OK.

After this set the vmlinux size becomes almost 4Kb less (when
all patched files are built-in):

add/remove: 16/40 grow/shrink: 18/6 up/down: 486/-4394 (-3908)
function old new   delta
net_vs_ctl_path-  32 +32
dccp_path  -  32 +32
x25_path   -  24 +24
sctp_path  -  24 +24
rose_path  -  24 +24
nr_path-  24 +24
net_ipv6_ctl_path  -  24 +24
net_ipv4_ctl_path  -  24 +24
llc_path   -  24 +24
irda_path  -  24 +24
ipx_path   -  24 +24
dn_path-  24 +24
brnf_path  -  24 +24
ax25_path  -  24 +24
atalk_path -  24 +24
nf_ct_path -  16 +16
dn_dev_sysctl_register   164 173  +9
x25_register_sysctl   16  21  +5
sctp_sysctl_register  16  21  +5
rose_register_sysctl  16  21  +5
nr_register_sysctl16  21  +5
nf_conntrack_standalone_init 166 171  +5
llc_sysctl_init   24  29  +5
irda_sysctl_register  24  29  +5
ipx_register_sysctl   16  21  +5
ip_vs_lblcr_init  66  71  +5
ip_vs_control_init   209 214  +5
ip_queue_init288 293  +5
ip6_queue_init   288 293  +5
dn_register_sysctl16  21  +5
dccp_sysctl_init  24  29  +5
br_netfilter_init 91  96  +5
atalk_register_sysctl 16  21  +5
ax25_register_sysctl 294 290  -4
ax25_unregister_sysctl56  46 -10
nf_unregister_sysctl_table19   - -19
net_ipv4_path 48  24 -24
ipv6_ctl_path 24   - -24
path_free 27   - -27
nf_net_ipv4_netfilter_sysctl_path 88  32 -56
nf_net_netfilter_sysctl_path  88  24 -64
x25_root_table88   - -88
x25_dir_table 88   - -88
vs_root_table 88   - -88
sctp_root_table   88   - -88
sctp_net_table88   - -88
rose_root_table   88   - -88
rose_dir_table88   - -88
nr_root_table 88   - -88
nr_dir_table  88   - -88
nf_net_netfilter_table88   - -88
nf_net_ipv4_table 88   - -88
nf_net_ipv4_netfilter_table   88   - -88
nf_ct_net_table   88   - -88
llc_root_table88   - -88
llc_dir_table 88   - -88
lblcr_root_table  88   - -88
lblc_root_table   88   - -88
irda_root_table   88   - -88
irda_net_table88   - -88
ipx_root_table88   - -88
ipx_dir_table 88   - -88
dn_root_table 88   - -88
dn_dir_table  88   - -88
dccp_table88   - -88
dccp_root_table   88   - -88
dccp_dir_table88   - -88
brnf

[PATCH net-2.6.25] [BRIDGE] Remove unused macros from ebt_vlan.c

2008-01-08 Thread Rami Rosen
Hi,

 Remove two unused macros, INV_FLAG and SET_BITMASK
 from net/bridge/netfilter/ebt_vlan.c.

Regards,
Rami Rosen


Signed-off-by: Rami Rosen <[EMAIL PROTECTED]>
diff --git a/net/bridge/netfilter/ebt_vlan.c b/net/bridge/netfilter/ebt_vlan.c
index a43c697..0ddf749 100644
--- a/net/bridge/netfilter/ebt_vlan.c
+++ b/net/bridge/netfilter/ebt_vlan.c
@@ -37,9 +37,7 @@ MODULE_LICENSE("GPL");
 
 
 #define DEBUG_MSG(args...) if (debug) printk (KERN_DEBUG "ebt_vlan: " args)
-#define INV_FLAG(_inv_flag_) (info->invflags & _inv_flag_) ? "!" : ""
 #define GET_BITMASK(_BIT_MASK_) info->bitmask & _BIT_MASK_
-#define SET_BITMASK(_BIT_MASK_) info->bitmask |= _BIT_MASK_
 #define EXIT_ON_MISMATCH(_MATCH_,_MASK_) {if (!((info->_MATCH_ == 
_MATCH_)^!!(info->invflags & _MASK_))) return EBT_NOMATCH;}
 
 static int


[PATCH 2/4] [POWERPC][NET] ucc_geth_mii and users: get rid of device_type

2008-01-08 Thread Anton Vorontsov
device_type property is bogus, thus use proper compatible.

Also change compatible property to "fsl,ucc-mdio".

Per http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048388.html

Signed-off-by: Anton Vorontsov <[EMAIL PROTECTED]>
---
 arch/powerpc/boot/dts/mpc832x_mds.dts |3 +--
 arch/powerpc/boot/dts/mpc832x_rdb.dts |3 +--
 arch/powerpc/boot/dts/mpc836x_mds.dts |3 +--
 arch/powerpc/boot/dts/mpc8568mds.dts  |2 +-
 drivers/net/ucc_geth_mii.c|3 +++
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts 
b/arch/powerpc/boot/dts/mpc832x_mds.dts
index 010c8d9..cf41194 100644
--- a/arch/powerpc/boot/dts/mpc832x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc832x_mds.dts
@@ -255,8 +255,7 @@
#address-cells = <1>;
#size-cells = <0>;
reg = <2320 18>;
-   device_type = "mdio";
-   compatible = "ucc_geth_phy";
+   compatible = "fsl,ucc-mdio";
 
phy3: [EMAIL PROTECTED] {
interrupt-parent = < &ipic >;
diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts 
b/arch/powerpc/boot/dts/mpc832x_rdb.dts
index 3a73134..09301c1 100644
--- a/arch/powerpc/boot/dts/mpc832x_rdb.dts
+++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts
@@ -236,8 +236,7 @@
#address-cells = <1>;
#size-cells = <0>;
reg = <3120 18>;
-   device_type = "mdio";
-   compatible = "ucc_geth_phy";
+   compatible = "fsl,ucc-mdio";
 
phy00:[EMAIL PROTECTED] {
interrupt-parent = <&pic>;
diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts 
b/arch/powerpc/boot/dts/mpc836x_mds.dts
index 2986860..3eb8f72 100644
--- a/arch/powerpc/boot/dts/mpc836x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc836x_mds.dts
@@ -288,8 +288,7 @@
#address-cells = <1>;
#size-cells = <0>;
reg = <2120 18>;
-   device_type = "mdio";
-   compatible = "ucc_geth_phy";
+   compatible = "fsl,ucc-mdio";
 
phy0: [EMAIL PROTECTED] {
interrupt-parent = < &ipic >;
diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts 
b/arch/powerpc/boot/dts/mpc8568mds.dts
index 7440347..2bc147f 100644
--- a/arch/powerpc/boot/dts/mpc8568mds.dts
+++ b/arch/powerpc/boot/dts/mpc8568mds.dts
@@ -356,7 +356,7 @@
#address-cells = <1>;
#size-cells = <0>;
reg = <2120 18>;
-   compatible = "ucc_geth_phy";
+   compatible = "fsl,ucc-mdio";
 
/* These are the same PHYs as on
 * gianfar's MDIO bus */
diff --git a/drivers/net/ucc_geth_mii.c b/drivers/net/ucc_geth_mii.c
index df884f0..e3ba14a 100644
--- a/drivers/net/ucc_geth_mii.c
+++ b/drivers/net/ucc_geth_mii.c
@@ -256,6 +256,9 @@ static struct of_device_id uec_mdio_match[] = {
.type = "mdio",
.compatible = "ucc_geth_phy",
},
+   {
+   .compatible = "fsl,ucc-mdio",
+   },
{},
 };
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] [CCID2]: Add option-parsing code to process Ack Vectors at the HC-sender

2008-01-08 Thread Gerrit Renker
This is patch 2 in the set and uses the routines provided by the previous
patch to implement parsing of received Ack Vectors, replacing duplicate code.

Signed-off-by: Gerrit Renker <[EMAIL PROTECTED]>
---
 net/dccp/ccids/ccid2.c |  132 
 net/dccp/ccids/ccid2.h |2 +
 2 files changed, 46 insertions(+), 88 deletions(-)

--- a/net/dccp/ccids/ccid2.h
+++ b/net/dccp/ccids/ccid2.h
@@ -47,6 +47,7 @@ struct ccid2_seq {
  * @ccid2hctx_lastrtt -time RTT was last measured
  * @ccid2hctx_rpseq - last consecutive seqno
  * @ccid2hctx_rpdupack - dupacks since rpseq
+ * @ccid2hctx_parsed_ackvecs: list of Ack Vectors received on current skb
 */
 struct ccid2_hc_tx_sock {
u32 ccid2hctx_cwnd;
@@ -66,6 +67,7 @@ struct ccid2_hc_tx_sock {
int ccid2hctx_rpdupack;
unsigned long   ccid2hctx_last_cong;
u64 ccid2hctx_high_ack;
+   struct list_headccid2hctx_parsed_ackvecs;
 };
 
 struct ccid2_hc_rx_sock {
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -320,68 +320,6 @@ static void ccid2_hc_tx_packet_sent(struct sock *sk, int 
more, unsigned int len)
 #endif
 }
 
-/* XXX Lame code duplication!
- * returns -1 if none was found.
- * else returns the next offset to use in the function call.
- */
-static int ccid2_ackvector(struct sock *sk, struct sk_buff *skb, int offset,
-  unsigned char **vec, unsigned char *veclen)
-{
-   const struct dccp_hdr *dh = dccp_hdr(skb);
-   unsigned char *options = (unsigned char *)dh + dccp_hdr_len(skb);
-   unsigned char *opt_ptr;
-   const unsigned char *opt_end = (unsigned char *)dh +
-   (dh->dccph_doff * 4);
-   unsigned char opt, len;
-   unsigned char *value;
-
-   BUG_ON(offset < 0);
-   options += offset;
-   opt_ptr = options;
-   if (opt_ptr >= opt_end)
-   return -1;
-
-   while (opt_ptr != opt_end) {
-   opt   = *opt_ptr++;
-   len   = 0;
-   value = NULL;
-
-   /* Check if this isn't a single byte option */
-   if (opt > DCCPO_MAX_RESERVED) {
-   if (opt_ptr == opt_end)
-   goto out_invalid_option;
-
-   len = *opt_ptr++;
-   if (len < 3)
-   goto out_invalid_option;
-   /*
-* Remove the type and len fields, leaving
-* just the value size
-*/
-   len -= 2;
-   value   = opt_ptr;
-   opt_ptr += len;
-
-   if (opt_ptr > opt_end)
-   goto out_invalid_option;
-   }
-
-   switch (opt) {
-   case DCCPO_ACK_VECTOR_0:
-   case DCCPO_ACK_VECTOR_1:
-   *vec= value;
-   *veclen = len;
-   return offset + (opt_ptr - options);
-   }
-   }
-
-   return -1;
-
-out_invalid_option:
-   DCCP_BUG("Invalid option - this should not happen (previous parsing)!");
-   return -1;
-}
-
 static void ccid2_hc_tx_kill_rto_timer(struct sock *sk)
 {
struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk);
@@ -502,15 +440,30 @@ static void ccid2_congestion_event(struct sock *sk, 
struct ccid2_seq *seqp)
ccid2_change_l_ack_ratio(sk, hctx->ccid2hctx_cwnd);
 }
 
+static int ccid2_hc_tx_parse_options(struct sock *sk, unsigned char option,
+unsigned char len, u16 idx,
+unsigned char *value)
+{
+   struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk);
+
+   switch (option) {
+   case DCCPO_ACK_VECTOR_0:
+   return dccp_ackvec_parsed_add(&hctx->ccid2hctx_parsed_ackvecs,
+ value, len, 0);
+   case DCCPO_ACK_VECTOR_1:
+   return dccp_ackvec_parsed_add(&hctx->ccid2hctx_parsed_ackvecs,
+ value, len, 1);
+   }
+   return 0;
+}
+
 static void ccid2_hc_tx_packet_recv(struct sock *sk, struct sk_buff *skb)
 {
struct dccp_sock *dp = dccp_sk(sk);
struct ccid2_hc_tx_sock *hctx = ccid2_hc_tx_sk(sk);
+   struct dccp_ackvec_parsed *avp;
u64 ackno, seqno;
struct ccid2_seq *seqp;
-   unsigned char *vector;
-   unsigned char veclen;
-   int offset = 0;
int done = 0;
unsigned int maxincr = 0;
 
@@ -545,13 +498,13 @@ static void ccid2_hc_tx_packet_recv(struct sock *sk, 
struct sk_buff *skb)
}
 
/* check forward path congestion */
-   /* still didn't send out new data packets */
-   if (hctx->ccid2hctx_seqh == hctx->ccid2hctx_seqt)
-

[PATCH 1/3] [ACKVEC]: Schedule SyncAck when running out of space

2008-01-08 Thread Gerrit Renker
The problem with Ack Vectors is that

  i) their length is variable and can in principle grow quite large,
 ii) it is hard to predict exactly how large they will be.

 Due to the second point it seems not a good idea to reduce the MPS;
i particular when on average there is enough room for the Ack Vector
and an increase in length is momentarily due to some burst loss, after
which the Ack Vector returns to its normal/average length.

The solution taken by this patch to address the outstanding FIXME is
to schedule a separate Sync when running out of space on the skb, and to
log a warning into the syslog.

The mechanism can also be used for other out-of-band signalling: it does
quicker signalling than scheduling an Ack, since it does not need to wait
for new data.

 Additional Note:
 
 It is possible to lower MPS according to the average length of Ack Vectors;
 the following argues why this does not seem to be a good idea.

 When determining the average Ack Vector length, a moving-average is more
 useful than a normal average, since sudden peaks (burst losses) are better
 dampened. The Ack Vector buffer would have a field `av_avg_len' which tracks
 this moving average and MPS would be reduced by this value (plus 2 bytes for
 type/value for each full Ack Vector).

 However, this means that the MPS decreases in the middle of an established
 connection. For a user who has tuned his/her application to work with the
 MPS taken at the beginning of the connection this can be very counter-
 intuitive and annoying.

Signed-off-by: Gerrit Renker <[EMAIL PROTECTED]>
---
 include/linux/dccp.h |2 ++
 net/dccp/options.c   |   28 +++-
 net/dccp/output.c|8 
 3 files changed, 29 insertions(+), 9 deletions(-)

--- a/include/linux/dccp.h
+++ b/include/linux/dccp.h
@@ -475,6 +475,7 @@ struct dccp_ackvec;
  * @dccps_hc_rx_insert_options - receiver wants to add options when acking
  * @dccps_hc_tx_insert_options - sender wants to add options when sending
  * @dccps_server_timewait - server holds timewait state on close (RFC 4340, 
8.3)
+ * @dccps_sync_scheduled - flag which signals "send out-of-band message soon"
  * @dccps_xmit_timer - timer for when CCID is not ready to send
  * @dccps_syn_rtt - RTT sample from Request/Response exchange (in usecs)
  */
@@ -515,6 +516,7 @@ struct dccp_sock {
__u8dccps_hc_rx_insert_options:1;
__u8dccps_hc_tx_insert_options:1;
__u8dccps_server_timewait:1;
+   __u8dccps_sync_scheduled:1;
struct timer_list   dccps_xmit_timer;
 };
 
--- a/net/dccp/options.c
+++ b/net/dccp/options.c
@@ -426,7 +426,9 @@ static int dccp_insert_option_timestamp_echo(struct 
dccp_sock *dp,
 
 int dccp_insert_option_ackvec(struct sock *sk, struct sk_buff *skb)
 {
-   struct dccp_ackvec *av = dccp_sk(sk)->dccps_hc_rx_ackvec;
+   struct dccp_sock *dp = dccp_sk(sk);
+   struct dccp_ackvec *av = dp->dccps_hc_rx_ackvec;
+   struct dccp_skb_cb *dcb = DCCP_SKB_CB(skb);
const u16 buflen = dccp_ackvec_buflen(av);
/* Figure out how many options do we need to represent the ackvec */
const u16 nr_opts = DIV_ROUND_UP(buflen, DCCP_SINGLE_OPT_MAXLEN);
@@ -435,16 +437,24 @@ int dccp_insert_option_ackvec(struct sock *sk, struct 
sk_buff *skb)
const unsigned char *tail, *from;
unsigned char *to;
 
-   if (DCCP_SKB_CB(skb)->dccpd_opt_len + len > DCCP_MAX_OPT_LEN) {
-   /*
-* FIXME: when running out of option space while piggybacking on
-* DataAck, return 0 here and schedule a separate Ack instead.
-*/
+   if (dcb->dccpd_opt_len + len > DCCP_MAX_OPT_LEN) {
DCCP_WARN("Lacking space for %u bytes on %s packet\n", len,
- dccp_packet_name(DCCP_SKB_CB(skb)->dccpd_type));
+ dccp_packet_name(dcb->dccpd_type));
return -1;
}
-   DCCP_SKB_CB(skb)->dccpd_opt_len += len;
+   /*
+* Since Ack Vectors are variable-length, we can not always predict
+* their size. To catch exception cases where the space is running out
+* on the skb, a separate Sync is scheduled to carry the Ack Vector.
+*/
+   if (dcb->dccpd_opt_len + skb->len + len > dp->dccps_mss_cache) {
+   DCCP_WARN("No space left for Ack Vector (%u) on skb (%u+%u), "
+ "MPS=%u ==> reduce payload size?\n", len, skb->len,
+ dcb->dccpd_opt_len, dp->dccps_mss_cache);
+   dp->dccps_sync_scheduled = 1;
+   return 0;
+   }
+   dcb->dccpd_opt_len += len;
 
to   = skb_push(skb, len);
len  = buflen;
@@ -485,7 +495,7 @@ int dccp_insert_option_ackvec(struct sock *sk, struct 
sk_buff *skb)
/*
 * Each sent Ack Vector is recorded i

[PATCH 2/3] [ACKVEC]: Separate skb option parsing from CCID-specific code

2008-01-08 Thread Gerrit Renker
This patch replaces an almost identical reduplication of code; large parts of
dccp_parse_options() are repeated as ccid2_ackvector() in ccid2.c.

Two problems are involved, apart from the duplication:
 1. no separation of concerns: CCIDs should not need to be concerned with the
details of parsing header options;
 2. one can not assume that Ack Vectors appear as a contiguous area within an
skb, it is legal to insert other options and/or padding in between. The
current code would throw an error and stop reading in such a case. Not good.

Details:

 * received Ack Vectors are parsed by dccp_parse_options() alone, which passes
   the result on to the CCID-specific ccid_hc_tx_parse_options();
 * CCIDs interested in using/decoding Ack Vector information need to add code
   to receive parsed Ack Vectors via this interface;
 * a data structure, `struct dccp_ackvec_parsed' is provided, which arranges all
   Ack Vectors of a single skb into a list of parsed chunks;
 * a doubly-linked list was used since insertion needs to be at the tail end;
 * the associated list handling and list de-allocation routines.

Signed-off-by: Gerrit Renker <[EMAIL PROTECTED]>
---
 net/dccp/ackvec.c  |   28 
 net/dccp/ackvec.h  |   19 +++
 net/dccp/options.c |   17 ++---
 3 files changed, 57 insertions(+), 7 deletions(-)

--- a/net/dccp/ackvec.h
+++ b/net/dccp/ackvec.h
@@ -112,4 +112,23 @@ static inline bool dccp_ackvec_is_empty(const struct 
dccp_ackvec *av)
 {
return av->av_overflow == 0 && av->av_buf_head == av->av_buf_tail;
 }
+
+/**
+ * struct dccp_ackvec_parsed  -  Record offsets of Ack Vectors in skb
+ * @vec:   start of vector (offset into skb)
+ * @len:   length of @vec
+ * @nonce: whether @vec had an ECN nonce of 0 or 1
+ * @node:  FIFO - arranged in descending order of ack_ackno
+ * This structure is used by CCIDs to access Ack Vectors in a received skb.
+ */
+struct dccp_ackvec_parsed {
+   u8   *vec,
+len,
+nonce:1;
+   struct list_head node;
+};
+
+extern int dccp_ackvec_parsed_add(struct list_head *head,
+ u8 *vec, u8 len, u8 nonce);
+extern void dccp_ackvec_parsed_cleanup(struct list_head *parsed_chunks);
 #endif /* _ACKVEC_H */
--- a/net/dccp/options.c
+++ b/net/dccp/options.c
@@ -135,13 +135,6 @@ int dccp_parse_options(struct sock *sk, struct 
dccp_request_sock *dreq,
if (rc)
goto out_invalid_option;
break;
-   case DCCPO_ACK_VECTOR_0:
-   case DCCPO_ACK_VECTOR_1:
-   if (dccp_packet_without_ack(skb))   /* RFC 4340, 11.4 */
-   break;
-   dccp_pr_debug("%s Ack Vector (len=%u)\n", dccp_role(sk),
- len);
-   break;
case DCCPO_TIMESTAMP:
if (len != 4)
goto out_invalid_option;
@@ -229,6 +222,16 @@ int dccp_parse_options(struct sock *sk, struct 
dccp_request_sock *dreq,
 value) != 0)
goto out_invalid_option;
break;
+   case DCCPO_ACK_VECTOR_0:
+   case DCCPO_ACK_VECTOR_1:
+   if (dccp_packet_without_ack(skb))   /* RFC 4340, 11.4 */
+   break;
+   /*
+* Ack vectors are processed by the TX CCID if it is
+* interested. The RX CCID need not parse Ack Vectors,
+* since it is only interested in clearing old state.
+* Fall through.
+*/
case DCCPO_MIN_TX_CCID_SPECIFIC ... DCCPO_MAX_TX_CCID_SPECIFIC:
if (ccid_hc_tx_parse_options(dp->dccps_hc_tx_ccid, sk,
 opt, len, value - options,
--- a/net/dccp/ackvec.c
+++ b/net/dccp/ackvec.c
@@ -345,6 +345,34 @@ free_records:
}
 }
 
+/*
+ * Routines to keep track of Ack Vectors received in an skb
+ */
+int dccp_ackvec_parsed_add(struct list_head *head, u8 *vec, u8 len, u8 nonce)
+{
+   struct dccp_ackvec_parsed *new = kmalloc(sizeof(*new), GFP_ATOMIC);
+
+   if (new == NULL)
+   return -ENOBUFS;
+   new->vec   = vec;
+   new->len   = len;
+   new->nonce = nonce;
+
+   list_add_tail(&new->node, head);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(dccp_ackvec_parsed_add);
+
+void dccp_ackvec_parsed_cleanup(struct list_head *parsed_chunks)
+{
+   struct dccp_ackvec_parsed *cur, *next;
+
+   list_for_each_entry_safe(cur, next, parsed_chunks, node)
+   kfree(cur);
+   INIT_LIST_HEAD(parsed_chunks);
+}
+EXPORT_SYMBOL_GPL(dccp_ackvec_parsed_clea

[DCCP] [PATCH 0/3]: Finishing Ack Vector patch set

2008-01-08 Thread Gerrit Renker
This is the "new stuff" for Ack Vectors, completing the Ack Vector work.

All other patches are as before, with the single exception of the update sent
yesterday (the recovery strategy for dealing with suddenly large losses).

Arnaldo, can you please indicate whether I should resubmit the older patches.

After some more testing I am positive that these are now ready to be considered.

Patch #1: Addresses the pending FIXME, if Ack Vectors grow too large for a 
packet,
  their variable-length information is transported via a 
separate/scheduled
  Sync. This mechanism can also be used for other out-of-band signalling
  (e.g. slow-receiver option).

Patch #2: Since there is already a rich option-parsing infrastructure in 
options.c,
  it is not necessary to replicate the code to parse Ack Vectors. This 
patch
  adds utility routines to store parsed Ack Vectors, which are 
processed by
  the existing infrastructure via the hc_tx_parse_options() interface.

Patch #3: Integrates patch #3 for use with CCID2. 

All patches have been uploaded to git://eden-feed.erg.abdn.ac.uk/dccp_exp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread linux
I take that back.  This patch does NOT fix the leak, at least if
ping: sendmsg: No buffer space available
is any indication...

I think I was reading slabinfo wrong.
kmalloc-2048   42111  42112   204842 : tunables000 : 
slabdata  10528  10528  0

Sorry for the false hope.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak

2008-01-08 Thread linux
Prompted by davem, this attempt at fixing the memory leak
actually appears to work.  At least, leaving ping -f -s1472 -l64
running doesn't drop packets and doesn't show up in /proc/slabinfo.
---
diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index dbd23bb..a0dfba5 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -1110,10 +1110,9 @@ enum {
Frame_WithStart_WithEnd = 11
 };
 
-inline void ipg_nic_rx_free_skb(struct net_device *dev)
+inline void ipg_nic_rx_free_skb(struct net_device *dev, unsigned entry)
 {
struct ipg_nic_private *sp = netdev_priv(dev);
-   unsigned int entry = sp->rx_current % IPG_RFDLIST_LENGTH;
 
if (sp->RxBuff[entry]) {
struct ipg_rx *rxfd = sp->rxd + entry;
@@ -1308,7 +1307,7 @@ static void ipg_nic_rx_with_end(struct net_device *dev,
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
 
-   ipg_nic_rx_free_skb(dev);
+   ipg_nic_rx_free_skb(dev, entry);
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
jumbo->FoundStart = 0;
@@ -1337,7 +1336,7 @@ static void ipg_nic_rx_no_start_no_end(struct net_device 
*dev,
}
}
dev->last_rx = jiffies;
-   ipg_nic_rx_free_skb(dev);
+   ipg_nic_rx_free_skb(dev, entry);
}
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] drivers/net/ipg.c: convert Jumbo.FoundStart to bool

2008-01-08 Thread linux
This is a fairly basic code cleanup that annoyed me while working
on the first patch.
---
diff --git a/drivers/net/ipg.h b/drivers/net/ipg.h
index d5d092c..5d7cc84 100644
--- a/drivers/net/ipg.h
+++ b/drivers/net/ipg.h
@@ -789,11 +789,6 @@ struct ipg_rx {
__le64 frag_info;
 };
 
-struct SJumbo {
-   int FoundStart;
-   int CurrentSize;
-   struct sk_buff *skb;
-};
 /* Structure of IPG NIC specific data. */
 struct ipg_nic_private {
void __iomem *ioaddr;
@@ -809,7 +804,11 @@ struct ipg_nic_private {
unsigned int rx_dirty;
 // Add by Grace 2005/05/19
 #ifdef JUMBO_FRAME
-   struct SJumbo Jumbo;
+   struct SJumbo {
+   bool FoundStart;
+   int CurrentSize;
+   struct sk_buff *skb;
+   } Jumbo;
 #endif
unsigned int rx_buf_sz;
struct pci_dev *pdev;
diff --git a/drivers/net/ipg.h b/drivers/net/ipg.h
index d5d092c..5d7cc84 100644
--- a/drivers/net/ipg.h
+++ b/drivers/net/ipg.h
@@ -789,11 +789,6 @@ struct ipg_rx {
__le64 frag_info;
 };
 
-struct SJumbo {
-   int FoundStart;
-   int CurrentSize;
-   struct sk_buff *skb;
-};
 /* Structure of IPG NIC specific data. */
 struct ipg_nic_private {
void __iomem *ioaddr;
@@ -809,7 +804,11 @@ struct ipg_nic_private {
unsigned int rx_dirty;
 // Add by Grace 2005/05/19
 #ifdef JUMBO_FRAME
-   struct SJumbo Jumbo;
+   struct SJumbo {
+   bool FoundStart;
+   int CurrentSize;
+   struct sk_buff *skb;
+   } Jumbo;
 #endif
unsigned int rx_buf_sz;
struct pci_dev *pdev;
diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
index a0dfba5..3860fcd 100644
--- a/drivers/net/ipg.c
+++ b/drivers/net/ipg.c
@@ -1206,7 +1206,7 @@ static void ipg_nic_rx_with_start_and_end(struct 
net_device *dev,
 
if (jumbo->FoundStart) {
IPG_DEV_KFREE_SKB(jumbo->skb);
-   jumbo->FoundStart = 0;
+   jumbo->FoundStart = false;
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
}
@@ -1257,7 +1257,7 @@ static void ipg_nic_rx_with_start(struct net_device *dev,
 
skb_put(skb, IPG_RXFRAG_SIZE);
 
-   jumbo->FoundStart = 1;
+   jumbo->FoundStart = true;
jumbo->CurrentSize = IPG_RXFRAG_SIZE;
jumbo->skb = skb;
 
@@ -1303,14 +1303,14 @@ static void ipg_nic_rx_with_end(struct net_device *dev,
}
 
dev->last_rx = jiffies;
-   jumbo->FoundStart = 0;
+   jumbo->FoundStart = false;
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
 
ipg_nic_rx_free_skb(dev, entry);
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
-   jumbo->FoundStart = 0;
+   jumbo->FoundStart = false;
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
}
@@ -1340,7 +1340,7 @@ static void ipg_nic_rx_no_start_no_end(struct net_device 
*dev,
}
} else {
IPG_DEV_KFREE_SKB(jumbo->skb);
-   jumbo->FoundStart = 0;
+   jumbo->FoundStart = false;
jumbo->CurrentSize = 0;
jumbo->skb = NULL;
}
@@ -1840,7 +1840,7 @@ static int ipg_nic_open(struct net_device *dev)
 
 #ifdef JUMBO_FRAME
/* initialize JUMBO Frame control variable */
-   sp->Jumbo.FoundStart = 0;
+   sp->Jumbo.FoundStart = false;
sp->Jumbo.CurrentSize = 0;
sp->Jumbo.skb = 0;
dev->mtu = IPG_TXFRAG_SIZE;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >