Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote:
 From: Jarek Poplawski [EMAIL PROTECTED]
 Date: Thu, 11 Jan 2007 08:24:28 +0100
 
  Yesterday I did what I should do earlier - checked
  this simple way, with printk, and now I have no doubts
  it's a bug: if you add or remove vlan devices with
  vconfig, register_vlan_device and unregister_vlan_dev
  are called by ioctl and they use and change rcu
  procetded data without preemption disabled so vlan
  rcu hash lists could become corrupted or find results
  could be wrong.
 
 Those two operations do their modifications and changes under the RTNL
 semaphore, via rtnl_lock() and rtnl_unlock() which guarentees that no
 other modifications can occur.

Sure, but is this even legal to be preempted during
reading or modifying rcu list? Doesn't this disturb
rcu cycle and make possible memory release problems?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 09:29:58AM +0100, Jarek Poplawski wrote:
 On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote:
  From: Jarek Poplawski [EMAIL PROTECTED]
  Date: Thu, 11 Jan 2007 08:24:28 +0100
  
   Yesterday I did what I should do earlier - checked
   this simple way, with printk, and now I have no doubts
   it's a bug: if you add or remove vlan devices with
   vconfig, register_vlan_device and unregister_vlan_dev
   are called by ioctl and they use and change rcu
   procetded data without preemption disabled so vlan
   rcu hash lists could become corrupted or find results
   could be wrong.
  
  Those two operations do their modifications and changes under the RTNL
  semaphore, via rtnl_lock() and rtnl_unlock() which guarentees that no
  other modifications can occur.
 
 Sure, but is this even legal to be preempted during

I should even say:

... is this even legal to be blocked during ...

 reading or modifying rcu list? Doesn't this disturb
 rcu cycle and make possible memory release problems?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 09:35:26AM +0100, Jarek Poplawski wrote:
 On Thu, Jan 11, 2007 at 09:29:58AM +0100, Jarek Poplawski wrote:
  On Wed, Jan 10, 2007 at 11:40:35PM -0800, David Miller wrote:
   From: Jarek Poplawski [EMAIL PROTECTED]
   Date: Thu, 11 Jan 2007 08:24:28 +0100
   
Yesterday I did what I should do earlier - checked
this simple way, with printk, and now I have no doubts
it's a bug: if you add or remove vlan devices with
vconfig, register_vlan_device and unregister_vlan_dev
are called by ioctl and they use and change rcu
procetded data without preemption disabled so vlan
rcu hash lists could become corrupted or find results
could be wrong.
   
   Those two operations do their modifications and changes under the RTNL
   semaphore, via rtnl_lock() and rtnl_unlock() which guarentees that no
   other modifications can occur.
  
  Sure, but is this even legal to be preempted during
 
 I should even say:
 
 ... is this even legal to be blocked during ...
 
  reading or modifying rcu list? Doesn't this disturb
  rcu cycle and make possible memory release problems?

Sorry, one more time:

Sure, but is this even legal to be preempted during
reading or modifying rcu list or be blocked while 
holding rcu protected pointer? Doesn't this disturb
rcu cycle and make possible memory release problems?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-11 Thread Jarek Poplawski
On Thu, Jan 11, 2007 at 01:27:55AM -0800, David Miller wrote:
 From: Jarek Poplawski [EMAIL PROTECTED]
 Date: Thu, 11 Jan 2007 09:39:34 +0100
 
  Sure, but is this even legal to be preempted during
  reading or modifying rcu list or be blocked while 
  holding rcu protected pointer? Doesn't this disturb
  rcu cycle and make possible memory release problems?
 
 It's fine in this case.
 
 Since the list cannot be changed by anyone else, and the hash linked
 list (as seen by readers) is modified atomically by a single store, it
 all works out.
 
 Readers only look at foo-next in the hash traversal.  Since the
 preceeding element cannot change outside of the current writer,
 the -next pointer to update is protected.
 
 Readers therefore will either see the hash list with the entry or
 without.
 
 We then use call_rcu() to make sure any reading threads that happened
 to get a glimpse of the hash entry before the hlist_del_rcu()
 completed will go away and drop their references before we free that
 entry.
 
 I really don't see any problem here. :-)

Probably because you care more about internals and less 
about docs examples. It seems I'm too much about regulations. 
 
OK, I take your word and will try to stop annoy this list
with imagined RCU bugs, sorry.

Thanks for your precious (sleeping?) time
and explanations. Best regards,

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Tue, Jan 09, 2007 at 09:10:45AM +0100, Jarek Poplawski wrote:
 On Mon, Jan 08, 2007 at 10:03:50AM -0800, Stephen Hemminger wrote:
...
 * Must be invoked with RCU read lock (no preempt)
 */
struct net_device *__find_vlan_dev(struct net_device *real_dev,
...
   
But later in this file no sign of disabling preemption
for these calls and for hlist_add_head_rcu and hlist_del_rcu.
   
I can't imagine how this works?
  
  Preempt is already disabled on the receive path.
 
 I'm not sure you're talking about the same thing -

Hello Stephen,

It looks like you're talking about the right thing
and I'm a fool again! Now I try to find why I even 
had to pay for this. I read again and again adequate
chapters from R. Love and C. Benvenuti's books, see
a lot about kernel preemption in 2.6, but can't see
anything about preemption disabled in ioctls - maybe
I'm blind or they are badly translated. Now I look
into Linux Device Drivers, see ch. 6 about ioctls,
blocking I/O and RCU, but nothing about preemption
disabled again. Maybe this is omited because it's
obvious to people who started hacking with earlier
kernels?

When I added to this things like: If the mutex is
not available right now, it will sleep until it can
get it. and It is illegal to block while in an RCU
read-side critical section. I didn't even try to
think about mutex or malloc with GFP_KERNEL inside
RCU block.
 
I'm enormously grateful you didn't lose patience
in guiding me yet - I hope it'll save this list from
nervous breakdown.

Many thanks and regards as always,

Jarek P.

PS: probably you could profit from this some day 
and write something like Linux Internals for
Dummies - it would be simple cut  paste of my
discoveries and your responses!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 10:04:11AM +0100, Jarek Poplawski wrote:
...
 It looks like you're talking about the right thing
 and I'm a fool again! Now I try to find why I even 
 had to pay for this. I read again and again adequate
 chapters from R. Love and C. Benvenuti's books, see
 a lot about kernel preemption in 2.6, but can't see
 anything about preemption disabled in ioctls - maybe
 I'm blind or they are badly translated. Now I look
 into Linux Device Drivers, see ch. 6 about ioctls,
 blocking I/O and RCU, but nothing about preemption
 disabled again. Maybe this is omited because it's
 obvious to people who started hacking with earlier
 kernels?

... or maybe it's even more complicated...

For the time being, I revoke my critique of these books.

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Stephen Hemminger
On Wed, 10 Jan 2007 13:50:48 +0100
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Wed, Jan 10, 2007 at 10:04:11AM +0100, Jarek Poplawski wrote:
 ...
  It looks like you're talking about the right thing
  and I'm a fool again! Now I try to find why I even 
  had to pay for this. I read again and again adequate
  chapters from R. Love and C. Benvenuti's books, see
  a lot about kernel preemption in 2.6, but can't see
  anything about preemption disabled in ioctls - maybe
  I'm blind or they are badly translated. Now I look
  into Linux Device Drivers, see ch. 6 about ioctls,
  blocking I/O and RCU, but nothing about preemption
  disabled again. Maybe this is omited because it's
  obvious to people who started hacking with earlier
  kernels?
 
 ... or maybe it's even more complicated...
 
 For the time being, I revoke my critique of these books.
 
 Jarek P. 

Don't rely on books too heavily, they can get out of date
with a simple code change.

The path that I am talking about is the receive skb path. All data
received goes through netif_receive_skb and that does rcu_read_lock().
This is done so that receive protocol list can be used with RCU (lock
free). Since receiving is a time critical path, we want to process
without having to do any locked operations; locked operations cause a
processor force a cache miss and are one of the main CPU overheads.
RCU requires no locked operation, but does prevent preemption.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread Jarek Poplawski
On Wed, Jan 10, 2007 at 12:01:23PM -0800, Stephen Hemminger wrote:
...
 Don't rely on books too heavily, they can get out of date
 with a simple code change.

I've tried to find this in the code at the beginning
and got mislead by the path with PREEMPT_BKL.
I think the books are necessary to get general ideas
and I tried to check why would I get so wrong ideas.

 The path that I am talking about is the receive skb path. All data
 received goes through netif_receive_skb and that does rcu_read_lock().
 This is done so that receive protocol list can be used with RCU (lock
 free). Since receiving is a time critical path, we want to process
 without having to do any locked operations; locked operations cause a
 processor force a cache miss and are one of the main CPU overheads.
 RCU requires no locked operation, but does prevent preemption.

I again think we talk about different subjects. Maybe
it's because of this thread - but I don't talk about
Ben's original problem no more - it's a problem of
linux vlans.

Yesterday I did what I should do earlier - checked
this simple way, with printk, and now I have no doubts
it's a bug: if you add or remove vlan devices with
vconfig, register_vlan_device and unregister_vlan_dev
are called by ioctl and they use and change rcu
procetded data without preemption disabled so vlan
rcu hash lists could become corrupted or find results
could be wrong.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-10 Thread David Miller
From: Jarek Poplawski [EMAIL PROTECTED]
Date: Thu, 11 Jan 2007 08:24:28 +0100

 Yesterday I did what I should do earlier - checked
 this simple way, with printk, and now I have no doubts
 it's a bug: if you add or remove vlan devices with
 vconfig, register_vlan_device and unregister_vlan_dev
 are called by ioctl and they use and change rcu
 procetded data without preemption disabled so vlan
 rcu hash lists could become corrupted or find results
 could be wrong.

Those two operations do their modifications and changes under the RTNL
semaphore, via rtnl_lock() and rtnl_unlock() which guarentees that no
other modifications can occur.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-09 Thread Jarek Poplawski
On Mon, Jan 08, 2007 at 10:03:50AM -0800, Stephen Hemminger wrote:
 On Mon, 08 Jan 2007 08:57:10 -0800
 Ben Greear [EMAIL PROTECTED] wrote:
 
  Jarek Poplawski wrote:
   On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote:
   ...
 
   So, I do believe this was the problem we were hitting, and it seems 
   fixed.
   
  
   Congratulations!
  
   But I can see one strange thing in vlan.c:
  
   /* Must be invoked with RCU read lock (no preempt) */
   static struct vlan_group *__vlan_find_group(int real_dev_ifindex)
   ...
* Must be invoked with RCU read lock (no preempt)
*/
   struct net_device *__find_vlan_dev(struct net_device *real_dev,
   ...
  
   But later in this file no sign of disabling preemption
   for these calls and for hlist_add_head_rcu and hlist_del_rcu.
  
   I can't imagine how this works?
 
 Preempt is already disabled on the receive path.

I'm not sure you're talking about the same thing -
there is blocking possible inside register_vlan_dev
and unregister_vlan_dev, grp pointer is held during
this blocking - I've thought it's only possible in
sleepable RCU...

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-08 Thread Ben Greear

Jarek Poplawski wrote:

On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote:
...
  

So, I do believe this was the problem we were hitting, and it seems fixed.



Congratulations!

But I can see one strange thing in vlan.c:

/* Must be invoked with RCU read lock (no preempt) */
static struct vlan_group *__vlan_find_group(int real_dev_ifindex)
...
 * Must be invoked with RCU read lock (no preempt)
 */
struct net_device *__find_vlan_dev(struct net_device *real_dev,
...

But later in this file no sign of disabling preemption
for these calls and for hlist_add_head_rcu and hlist_del_rcu.

I can't imagine how this works?
  

Perhaps...I didn't RCU-ify VLANs, but I can take a look.

For the record, the soft lockup was using MAC-VLANs, not 802.1Q VLANs, 
so it wouldn't

have been affected by bugs in VLANs one way or the other.

Ben

Jarek P. 
  



--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-08 Thread Stephen Hemminger
On Mon, 08 Jan 2007 08:57:10 -0800
Ben Greear [EMAIL PROTECTED] wrote:

 Jarek Poplawski wrote:
  On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote:
  ...

  So, I do believe this was the problem we were hitting, and it seems fixed.
  
 
  Congratulations!
 
  But I can see one strange thing in vlan.c:
 
  /* Must be invoked with RCU read lock (no preempt) */
  static struct vlan_group *__vlan_find_group(int real_dev_ifindex)
  ...
   * Must be invoked with RCU read lock (no preempt)
   */
  struct net_device *__find_vlan_dev(struct net_device *real_dev,
  ...
 
  But later in this file no sign of disabling preemption
  for these calls and for hlist_add_head_rcu and hlist_del_rcu.
 
  I can't imagine how this works?

Preempt is already disabled on the receive path.


 Perhaps...I didn't RCU-ify VLANs, but I can take a look.
 
 For the record, the soft lockup was using MAC-VLANs, not 802.1Q VLANs, 
 so it wouldn't
 have been affected by bugs in VLANs one way or the other.
 
 Ben
 
  Jarek P. 

 
 


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-07 Thread Jarek Poplawski
On Fri, Jan 05, 2007 at 12:33:43PM -0800, Ben Greear wrote:
...
 So, I do believe this was the problem we were hitting, and it seems fixed.

Congratulations!

But I can see one strange thing in vlan.c:

/* Must be invoked with RCU read lock (no preempt) */
static struct vlan_group *__vlan_find_group(int real_dev_ifindex)
...
 * Must be invoked with RCU read lock (no preempt)
 */
struct net_device *__find_vlan_dev(struct net_device *real_dev,
...

But later in this file no sign of disabling preemption
for these calls and for hlist_add_head_rcu and hlist_del_rcu.

I can't imagine how this works?

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-05 Thread Herbert Xu
On Fri, Jan 05, 2007 at 07:38:44AM +0100, Jarek Poplawski wrote:
 
 I'd only suggest to change goto out; to
 return NULL; at the end of inetdev_init because
 now RCU is engaged unnecessarily.

I agree.  The RCU assignment should come before the out label.
Can you send a patch?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-05 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 09:04:29AM -0800, Ben Greear wrote:
 Jarek Poplawski wrote:
 On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote:
   
 On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote:
 
 Could you explain? I can see some inet_rtm_newaddr
 interrupted. For me it could be e.g.:
 
 after
 vconfig add eth0 9
 
 ip addr add dev eth0.9 ...
   
 Whether eth0.9 is up or not does not affect this at all.  The spin
 locks are initialised (and used) when the first IPv4 address is added,
 not when the device comes up.
 
 
 I understand this. I consider IFF_UP as a sign all 
 initialisations (open functions including) are
 completed and there is permission for working (so
 logically, if I would do eth0.9 down all traffic
 should be stopped, what probably isn't true now).
   
 It is certainly valid for an interface to be IF_UP, but have no IP 
 address.  My application
 does bring the network device up before it assigns the IP, for instance.

Yes, but I think in any case it isn't races safe
now with vlans. I thought more about the reverse
situation where skb-dev !IFF_UP could be
unnecessarily processed. But the same should be
valid according to the rest of initializations
which are done during address assigning. 

 There may be other issues with IF_UP, but that could be handled with a 
 different
 investigation.  If you have a particular test case that fails with 
 802.1Q VLANs, then
 I will be happy to work on it...

Sorry, I even didn't use this yet... 

Wish you sunny weekend,

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 07:29:30PM +1100, Herbert Xu wrote:
 On Thu, Jan 04, 2007 at 09:03:51AM +0100, Jarek Poplawski wrote:
  
  I doubt this is the right solution. It certainly
  could fix this particular situation but my main
  point was packets shouldn't get into kernel
  receive queues with skb-dev not IFF_UP.
 
 I think you misunderstood.  The device certainly is IFF_UP.  What
 happens is that the multicast spin locks are set up too late:

Could you explain? I can see some inet_rtm_newaddr
interrupted. For me it could be e.g.:

after
vconfig add eth0 9

ip addr add dev eth0.9 ...

before
ip link set dev eth0.9 up

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Herbert Xu
On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote:
 
 Could you explain? I can see some inet_rtm_newaddr
 interrupted. For me it could be e.g.:
 
 after
 vconfig add eth0 9
 
 ip addr add dev eth0.9 ...

Whether eth0.9 is up or not does not affect this at all.  The spin
locks are initialised (and used) when the first IPv4 address is added,
not when the device comes up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote:
 On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote:
  
  Could you explain? I can see some inet_rtm_newaddr
  interrupted. For me it could be e.g.:
  
  after
  vconfig add eth0 9
  
  ip addr add dev eth0.9 ...
 
 Whether eth0.9 is up or not does not affect this at all.  The spin
 locks are initialised (and used) when the first IPv4 address is added,
 not when the device comes up.

I understand this. I consider IFF_UP as a sign all 
initialisations (open functions including) are
completed and there is permission for working (so
logically, if I would do eth0.9 down all traffic
should be stopped, what probably isn't true now).

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Ben Greear

Jarek Poplawski wrote:

On Thu, Jan 04, 2007 at 09:27:07PM +1100, Herbert Xu wrote:
  

On Thu, Jan 04, 2007 at 09:50:14AM +0100, Jarek Poplawski wrote:


Could you explain? I can see some inet_rtm_newaddr
interrupted. For me it could be e.g.:

after
vconfig add eth0 9

ip addr add dev eth0.9 ...
  

Whether eth0.9 is up or not does not affect this at all.  The spin
locks are initialised (and used) when the first IPv4 address is added,
not when the device comes up.



I understand this. I consider IFF_UP as a sign all 
initialisations (open functions including) are

completed and there is permission for working (so
logically, if I would do eth0.9 down all traffic
should be stopped, what probably isn't true now).
  
It is certainly valid for an interface to be IF_UP, but have no IP 
address.  My application

does bring the network device up before it assigns the IP, for instance.

There may be other issues with IF_UP, but that could be handled with a 
different
investigation.  If you have a particular test case that fails with 
802.1Q VLANs, then

I will be happy to work on it...

Thanks,
Ben

Jarek P. 
-

To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  



--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Thu, 04 Jan 2007 17:26:27 +1100

 David Stevens [EMAIL PROTECTED] wrote:
 You're right, I don't know whether it'll fix the problem Ben saw
  or not, but it looks like the original code can do a receive before the
  in_device is fully initialized, and that, of course, is bad.
 If the device for ip_rcv() is not the same one we were
  initializing when the receive interrupted, then the patch should have
  no effect either way -- I don't think it'll hide other problems.
 If it's hard to reproduce (which I guess is true), then you're
  right, no soft lockup doesn't really tell us if it's fixed or not.
 
 Actually I missed your point that the multicast locks aren't even
 initialised at that point.  So this does explain the soft lock-up
 and therefore your patch is clearly the correct solution.

I agree too, therefore I've added David's patch as below.

I'll push this to the -stable branches as well.  This fix is
correct even if it does not entirely clear up the soft lockup
bug being discussed in this thread, but I think it will :-)

commit 30c4cf577fb5b68c16e5750d6bdbd7072e42b279
Author: David L Stevens [EMAIL PROTECTED]
Date:   Thu Jan 4 12:31:14 2007 -0800

[IPV4/IPV6]: Fix inet{,6} device initialization order.

It is important that we only assign dev-ip{,6}_ptr
only after all portions of the inet{,6} are setup.

Otherwise we can receive packets before the multicast
spinlocks et al. are initialized.

Signed-off-by: David L Stevens [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 84bed40..25c8a42 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -165,9 +165,8 @@ struct in_device *inetdev_init(struct net_device *dev)
  NET_IPV4_NEIGH, ipv4, NULL, NULL);
 #endif
 
-   /* Account for reference dev-ip_ptr */
+   /* Account for reference dev-ip_ptr (below) */
in_dev_hold(in_dev);
-   rcu_assign_pointer(dev-ip_ptr, in_dev);
 
 #ifdef CONFIG_SYSCTL
devinet_sysctl_register(in_dev, in_dev-cnf);
@@ -176,6 +175,8 @@ struct in_device *inetdev_init(struct net_device *dev)
if (dev-flags  IFF_UP)
ip_mc_up(in_dev);
 out:
+   /* we can receive as soon as ip_ptr is set -- do this last */
+   rcu_assign_pointer(dev-ip_ptr, in_dev);
return in_dev;
 out_kfree:
kfree(in_dev);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 9b0a906..171e5b5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -413,8 +413,6 @@ static struct inet6_dev * ipv6_add_dev(struct net_device 
*dev)
if (netif_carrier_ok(dev))
ndev-if_flags |= IF_READY;
 
-   /* protected by rtnl_lock */
-   rcu_assign_pointer(dev-ip6_ptr, ndev);
 
ipv6_mc_init_dev(ndev);
ndev-tstamp = jiffies;
@@ -425,6 +423,8 @@ static struct inet6_dev * ipv6_add_dev(struct net_device 
*dev)
  NULL);
addrconf_sysctl_register(ndev, ndev-cnf);
 #endif
+   /* protected by rtnl_lock */
+   rcu_assign_pointer(dev-ip6_ptr, ndev);
return ndev;
 }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-04 Thread Jarek Poplawski
On Thu, Jan 04, 2007 at 12:33:33PM -0800, David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Thu, 04 Jan 2007 17:26:27 +1100
 
  David Stevens [EMAIL PROTECTED] wrote:
  You're right, I don't know whether it'll fix the problem Ben saw
   or not, but it looks like the original code can do a receive before the
   in_device is fully initialized, and that, of course, is bad.
  If the device for ip_rcv() is not the same one we were
   initializing when the receive interrupted, then the patch should have
   no effect either way -- I don't think it'll hide other problems.
  If it's hard to reproduce (which I guess is true), then you're
   right, no soft lockup doesn't really tell us if it's fixed or not.
  
  Actually I missed your point that the multicast locks aren't even
  initialised at that point.  So this does explain the soft lock-up
  and therefore your patch is clearly the correct solution.
 
 I agree too, therefore I've added David's patch as below.
 
 I'll push this to the -stable branches as well.  This fix is
 correct even if it does not entirely clear up the soft lockup
 bug being discussed in this thread, but I think it will :-)

After rethinking I came to similar conclusion.  I've
thought the changes are done only to fix this particular
bug but now I see the previous order wasn't right
particularly considering RCU.

So, I apologize to David L Stevens for my harsh words.

I'd only suggest to change goto out; to
return NULL; at the end of inetdev_init because
now RCU is engaged unnecessarily.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote:
 I've looked at this a little too -- it'd be nice to know who holds
 the write lock.

If you mean mc_list_lock - probably nobody - it's
not initialized (so the timers) for this in_device
and rtnl mutex is preempted by irq.

Actually I wonder if lockdep isn't masking (or even
spoiling) something, so I'd try with:
Lock debugging: ... options off
(CONFIG_DEBUG_SPINLOCK = y
CONFIG_DEBUG_LOCK_ALLOC = n).

Jarek P. 

PS: because of unknown changes from those patches
this is guessing only.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Jarek Poplawski
On Wed, Jan 03, 2007 at 09:07:11AM +0100, Jarek Poplawski wrote:
 On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote:
  I've looked at this a little too -- it'd be nice to know who holds
  the write lock.
 
 If you mean mc_list_lock - probably nobody - it's
 not initialized (so the timers) for this in_device

I should say: ... probably not initialized 

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear

Jarek Poplawski wrote:

On Wed, Jan 03, 2007 at 09:07:11AM +0100, Jarek Poplawski wrote:
  

On Tue, Jan 02, 2007 at 03:35:39PM -0800, David Stevens wrote:


I've looked at this a little too -- it'd be nice to know who holds
the write lock.
  

If you mean mc_list_lock - probably nobody - it's
not initialized (so the timers) for this in_device



I should say: ... probably not initialized 
  
That should print out the debugging when you access an un-initialized 
lock, and I did not
see that print-out in the logs.   I looked at the code and could not 
explain how it could

be accessed un-initialized, so I'm not certain this is the problem.

If I can reproduce this in a controlled manner, I'll add debugging to 
print out who is holding
the lock (if anyone), as well as make sure it is initialized before the 
blocking method initializes
it.  It will likely be a few days before we can set up something to 
reproduce it, however.


If you can explain any code path that could leave the lock 
uninitialized, then that would be a

big help...but it looked ok to me...

Ben

Jarek P. 
-

To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  



--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben  Jarek,
Your analysis looks correct to me. It seems to me the problem is 
that
we don't want the in_device to be searchable until after the 
initialization is done.
What about moving the initialization of dev-ip_ptr in inetdev_init() to 
after the
out label?

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben,
Here's a patch that I think will fix it, assuming the receive is 
on the
same device as the initialization. Can you try this out?

+-DLS
[inline for viewing, attached for applying]

Signed-off-by: David L Stevens [EMAIL PROTECTED]

diff -ruNp linux-2.6.19.1/net/ipv4/devinet.c 
linux-2.6.19.1T1/net/ipv4/devinet.c
--- linux-2.6.19.1/net/ipv4/devinet.c   2006-12-11 11:32:53.0 
-0800
+++ linux-2.6.19.1T1/net/ipv4/devinet.c 2007-01-03 14:37:56.0 
-0800
@@ -165,9 +165,8 @@ struct in_device *inetdev_init(struct ne
  NET_IPV4_NEIGH, ipv4, NULL, NULL);
 #endif
 
-   /* Account for reference dev-ip_ptr */
+   /* Account for reference dev-ip_ptr (below) */
in_dev_hold(in_dev);
-   rcu_assign_pointer(dev-ip_ptr, in_dev);
 
 #ifdef CONFIG_SYSCTL
devinet_sysctl_register(in_dev, in_dev-cnf);
@@ -176,6 +175,8 @@ struct in_device *inetdev_init(struct ne
if (dev-flags  IFF_UP)
ip_mc_up(in_dev);
 out:
+   /* we can receive as soon as ip_ptr is set -- do this last */
+   rcu_assign_pointer(dev-ip_ptr, in_dev);
return in_dev;
 out_kfree:
kfree(in_dev);
diff -ruNp linux-2.6.19.1/net/ipv6/addrconf.c 
linux-2.6.19.1T1/net/ipv6/addrconf.c
--- linux-2.6.19.1/net/ipv6/addrconf.c  2006-12-11 11:32:53.0 
-0800
+++ linux-2.6.19.1T1/net/ipv6/addrconf.c2007-01-03 
14:47:07.0 -0800
@@ -413,8 +413,6 @@ static struct inet6_dev * ipv6_add_dev(s
if (netif_carrier_ok(dev))
ndev-if_flags |= IF_READY;
 
-   /* protected by rtnl_lock */
-   rcu_assign_pointer(dev-ip6_ptr, ndev);
 
ipv6_mc_init_dev(ndev);
ndev-tstamp = jiffies;
@@ -425,6 +423,8 @@ static struct inet6_dev * ipv6_add_dev(s
  NULL);
addrconf_sysctl_register(ndev, ndev-cnf);
 #endif
+   /* protected by rtnl_lock */
+   rcu_assign_pointer(dev-ip6_ptr, ndev);
return ndev;
 }
 


initfix.patch
Description: Binary data


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear

David Stevens wrote:

Ben,
Here's a patch that I think will fix it, assuming the receive is 
on the

same device as the initialization. Can you try this out?


We are attempting to reproduce this now...as soon as we can reproduce,
I'll apply this and see if that fixes the problem.  This race is evidently
quite difficult to hit, so I'm not sure how long this will take.

Perhaps someone like DaveM could review the patch for logical correctness
and go ahead and apply anyway if it is more correct?  I confuse myself often
enough trying to deal with the network stack locking that I should probably
not be the final arbiter of this patch :)

Thanks,
Ben



+-DLS
[inline for viewing, attached for applying]

Signed-off-by: David L Stevens [EMAIL PROTECTED]

diff -ruNp linux-2.6.19.1/net/ipv4/devinet.c 
linux-2.6.19.1T1/net/ipv4/devinet.c
--- linux-2.6.19.1/net/ipv4/devinet.c   2006-12-11 11:32:53.0 
-0800
+++ linux-2.6.19.1T1/net/ipv4/devinet.c 2007-01-03 14:37:56.0 
-0800

@@ -165,9 +165,8 @@ struct in_device *inetdev_init(struct ne
  NET_IPV4_NEIGH, ipv4, NULL, NULL);
 #endif
 
-   /* Account for reference dev-ip_ptr */

+   /* Account for reference dev-ip_ptr (below) */
in_dev_hold(in_dev);
-   rcu_assign_pointer(dev-ip_ptr, in_dev);
 
 #ifdef CONFIG_SYSCTL

devinet_sysctl_register(in_dev, in_dev-cnf);
@@ -176,6 +175,8 @@ struct in_device *inetdev_init(struct ne
if (dev-flags  IFF_UP)
ip_mc_up(in_dev);
 out:
+   /* we can receive as soon as ip_ptr is set -- do this last */
+   rcu_assign_pointer(dev-ip_ptr, in_dev);
return in_dev;
 out_kfree:
kfree(in_dev);
diff -ruNp linux-2.6.19.1/net/ipv6/addrconf.c 
linux-2.6.19.1T1/net/ipv6/addrconf.c
--- linux-2.6.19.1/net/ipv6/addrconf.c  2006-12-11 11:32:53.0 
-0800
+++ linux-2.6.19.1T1/net/ipv6/addrconf.c2007-01-03 
14:47:07.0 -0800

@@ -413,8 +413,6 @@ static struct inet6_dev * ipv6_add_dev(s
if (netif_carrier_ok(dev))
ndev-if_flags |= IF_READY;
 
-   /* protected by rtnl_lock */

-   rcu_assign_pointer(dev-ip6_ptr, ndev);
 
ipv6_mc_init_dev(ndev);

ndev-tstamp = jiffies;
@@ -425,6 +423,8 @@ static struct inet6_dev * ipv6_add_dev(s
  NULL);
addrconf_sysctl_register(ndev, ndev-cnf);
 #endif
+   /* protected by rtnl_lock */
+   rcu_assign_pointer(dev-ip6_ptr, ndev);
return ndev;
 }
 



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
OK, sounds good.

By the way, I think you can probably hit it more often if you have
something on the virtual network sending lots of multicast traffic while
you're creating the interface. That'll increase the odds that you'll
get into ip_check_mc() with a partially initialized in_dev.

You can use ping -I intfX 224.0.0.1 (e.g.) to generate multicast
traffic, though you'd want more than one. :-)

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Herbert Xu
David Stevens [EMAIL PROTECTED] wrote:
 
 Ben,
Here's a patch that I think will fix it, assuming the receive is 
 on the
 same device as the initialization. Can you try this out?

Hi David:

Your patch makes sense on its own but I don't see the direct connection
to the soft lock-up.  Sure it prevents the code path in question from
triggering.  However, if we don't understand why it's locking up in the
first place then this may just be hiding it rather than fixing it.

In particular, a soft lockup means that we're doing so much work in
the softirq handlers that processes are not getting run.  So what is
it exactly here that's causing us to get stuck in the softirq handlers?
Is it because we're somehow getting stuck in a net rx loop?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Ben Greear

Herbert Xu wrote:

David Stevens [EMAIL PROTECTED] wrote:

Ben,
   Here's a patch that I think will fix it, assuming the receive is 
on the

same device as the initialization. Can you try this out?


Hi David:

Your patch makes sense on its own but I don't see the direct connection
to the soft lock-up.  Sure it prevents the code path in question from
triggering.  However, if we don't understand why it's locking up in the
first place then this may just be hiding it rather than fixing it.

In particular, a soft lockup means that we're doing so much work in
the softirq handlers that processes are not getting run.  So what is
it exactly here that's causing us to get stuck in the softirq handlers?
Is it because we're somehow getting stuck in a net rx loop?


I'm not sure if it helps..but I did notice that 'ip' was using 99% of the
CPU on the system.  Could this be because it was spinning trying to acquire
the read-lock?  When I ran 'ifconfig -a', that process hung, and at that point
the system was rebooted.  Before I ran ifconfig, 'top' and 'ls' and similar
apps were responding fine, and I was logged in over ssh from the US to 
Australia, so
it's basic networking was functioning.

What if the race is that the read-lock is only half initialized, so that
it doesn't trigger the uninitialized-lock-use debug message, but still screws
up and will not ever let the reader acquire the lock?

Thanks,
Ben



Cheers,



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Herbert,
You're right, I don't know whether it'll fix the problem Ben saw
or not, but it looks like the original code can do a receive before the
in_device is fully initialized, and that, of course, is bad.
If the device for ip_rcv() is not the same one we were
initializing when the receive interrupted, then the patch should have
no effect either way -- I don't think it'll hide other problems.
If it's hard to reproduce (which I guess is true), then you're
right, no soft lockup doesn't really tell us if it's fixed or not.

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread David Stevens
Ben,
 If the ip_rcv() and the inetdev_init() are on the same
interface in your stack backtrace, it's a certainty at that point
that the lock value is still 0ed, because none of the initialization
occurs until after it has returned from the function it interrupted
to do the receive.
It'd have to be out of the register code and doing
ip_mc_init_dev() (after that call) to be a tight race with
lock creation.

+-DLS
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-03 Thread Herbert Xu
David Stevens [EMAIL PROTECTED] wrote:
You're right, I don't know whether it'll fix the problem Ben saw
 or not, but it looks like the original code can do a receive before the
 in_device is fully initialized, and that, of course, is bad.
If the device for ip_rcv() is not the same one we were
 initializing when the receive interrupted, then the patch should have
 no effect either way -- I don't think it'll hide other problems.
If it's hard to reproduce (which I guess is true), then you're
 right, no soft lockup doesn't really tell us if it's fixed or not.

Actually I missed your point that the multicast locks aren't even
initialised at that point.  So this does explain the soft lock-up
and therefore your patch is clearly the correct solution.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 08:39:09AM +0100, Jarek Poplawski wrote:
...
 It is hard to say what kind of bug to expect
 because at the same time other net_rx_action
 with the same vlan dev could take place on
 other processor and this inetdev_init could
 do more.

Sorry! inetdev_init couldn't do more because
of rtnl lock but anyway the rest should be valid:

 The main thing is the possibility of processing
 skb with not entirely open source dev which isn't
 expected (and checked) by receive functions.
 I think the easiest way to convince yourself is
 to add temporarily IFF_UP flag checking with
 dropping at the beginning of netif_receive_skb and
 __vlan_hwaccel_rx.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Jarek Poplawski
On Tue, Jan 02, 2007 at 09:23:02AM +0100, Jarek Poplawski wrote:
 On Tue, Jan 02, 2007 at 08:39:09AM +0100, Jarek Poplawski wrote:
 ...
  The main thing is the possibility of processing
  skb with not entirely open source dev which isn't
  expected (and checked) by receive functions.
  I think the easiest way to convince yourself is
  to add temporarily IFF_UP flag checking with
  dropping at the beginning of netif_receive_skb and
  __vlan_hwaccel_rx.

... and vlan_skb_recv also.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread David Stevens
I've looked at this a little too -- it'd be nice to know who holds
the write lock.

I see ip_mc_destroy_dev() is bouncing through the lock for
each multicast address, though it starts at the beginning of
the list each time. I don't see a problem with it, but it'd be
simpler if it acquired the write lock once, grabbed and nulled
the list, released the lock and then called igmp_group_dropped()
 ip_ma_put() on each address from the local list copy.

Are you destroying/creating interfaces or doing a lot of multicasting at
the time? How many group memberships do you have?

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-02 Thread Ben Greear

David Stevens wrote:

I've looked at this a little too -- it'd be nice to know who holds
the write lock.

I see ip_mc_destroy_dev() is bouncing through the lock for
each multicast address, though it starts at the beginning of
the list each time. I don't see a problem with it, but it'd be
simpler if it acquired the write lock once, grabbed and nulled
the list, released the lock and then called igmp_group_dropped()
 ip_ma_put() on each address from the local list copy.

Are you destroying/creating interfaces or doing a lot of multicasting at
the time? How many group memberships do you have?


Lots and lots of interfaces were being created...at least 200 mac-vlans (out-of 
tree patch
somewhat similar to 802.1q vlans.)  The avahi-daemon process was running, and 
it appears
to be adding a multicast to each interface.  It was spewing failure messages in 
/var/log/messages,
probably because it can't handle so many interfaces.

Other than that, there is no (known) multicast traffic being generated.

This bug was reported to me by a user in Australia, and we have not yet
attempted to recreate this locally, so I am not certain exactly what it
takes to trigger this bug.

Thanks,
Ben





+-DLS



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-01 Thread Ben Greear
I finally had time to look through the code in this backtrace in 
detail.  I think it *could*
be a race between ip_rcv and inetdev_init, but I am not certain.  Other 
than that, I'm real
low on ideas.  I found a few more stack trace debugging options to 
enable..perhaps that

will give a better backtrace if we can reproduce it again.

I do have lock-debugging enabled, so it should have caught this if was 
an un-initialized access

problem, however.

More details below inline.

Ben Greear wrote:
This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
active use.
From the backtrace, I am thinking this might be a generic problem, 
however.


Any ideas about what this could be?  It seems to be reproducible every 
day or

two, but no known way to make it happen quickly...

Kernel is SMP, PREEMPT.


Dec 19 04:49:33 localhost kernel: BUG: soft lockup detected on CPU#0!
Dec 19 04:49:33 localhost kernel:  [78104252] show_trace+0x12/0x20
Dec 19 04:49:33 localhost kernel:  [78104929] dump_stack+0x19/0x20
Dec 19 04:49:33 localhost kernel:  [7814c88b] softlockup_tick+0x9b/0xd0
Dec 19 04:49:33 localhost kernel:  [7812a992] 
run_local_timers+0x12/0x20
Dec 19 04:49:33 localhost kernel:  [7812ac08] 
update_process_times+0x38/0x80
Dec 19 04:49:33 localhost kernel:  [78112796] 
smp_apic_timer_interrupt+0x66/0x70
Dec 19 04:49:33 localhost kernel:  [78103baa] 
apic_timer_interrupt+0x2a/0x30

Dec 19 04:49:33 localhost kernel:  [78354e8c] _read_lock+0x3c/0x50

 Dec 19 04:49:33 localhost kernel:  [78331f42] ip_check_mc+0x22/0xb0
This is blocked on:
igmp.c:read_lock(in_dev-mc_list_lock);

Dec 19 04:49:33 localhost kernel:  [783068bf] 
ip_route_input+0x17f/0xef0
route.c:int our = ip_check_mc(in_dev, daddr, saddr, 
skb-nh.iph-protocol);

Dec 19 04:49:33 localhost kernel:  [78309c59] ip_rcv+0x349/0x580
?? Called by a macro maybe?  Can't find an obvious call to the 
ip_route_input.
Dec 19 04:49:33 localhost kernel:  [782ec98d] 
netif_receive_skb+0x36d/0x3b0
Dec 19 04:49:33 localhost kernel:  [782ee50c] 
process_backlog+0x9c/0x130

Dec 19 04:49:33 localhost kernel:  [782ee795] net_rx_action+0xc5/0x1f0
Dec 19 04:49:33 localhost kernel:  [78125e58] __do_softirq+0x88/0x110
Dec 19 04:49:33 localhost kernel:  [78125f59] do_softirq+0x79/0x80
Dec 19 04:49:33 localhost kernel:  [781260ed] irq_exit+0x5d/0x60
Dec 19 04:49:33 localhost kernel:  [78105a6d] do_IRQ+0x4d/0xa0
Dec 19 04:49:33 localhost kernel:  [78103ae9] 
common_interrupt+0x25/0x2c

Dec 19 04:49:33 localhost kernel:  [78354c45] _spin_lock+0x35/0x50
Dec 19 04:49:33 localhost kernel:  [781aab1d] proc_register+0x2d/0x110
Dec 19 04:49:33 localhost kernel:  [781ab23d] 
create_proc_entry+0x5d/0xd0
Dec 19 04:49:33 localhost kernel:  [7812873b] 
register_proc_table+0x6b/0x110
Dec 19 04:49:33 localhost kernel:  [78128771] 
register_proc_table+0xa1/0x110

Dec 19 04:49:33 localhost last message repeated 3 times
Dec 19 04:49:33 localhost kernel:  [7812886d] 
register_sysctl_table+0x8d/0xc0
Dec 19 04:49:33 localhost kernel:  [7832f0c9] 
devinet_sysctl_register+0x109/0x150


This devinet_sysctl_register is called right before the ip_mc_init_dev 
call is made, and
that call is used to initialize the multicast lock that is blocked on at 
the top of this backtrace.
This *could* be the race, but only if the entities in question are the 
same thing.  I don't see

any way to determine whether they are or not based on the backtrace.

I looked through all of the uses of the mc_list_lock, and the places 
where it does a write_lock
are few and appear to be correct with no possibility of deadlocking.  If 
a lock was un-initialized, then
that could perhaps explain why it is able to deadlock (though, that 
should have triggered a different

bug report since I have spin/rw-lock debugging enabled.)


Dec 19 04:49:33 localhost kernel:  [7832f2ea] inetdev_init+0xea/0x160
Dec 19 04:49:33 localhost kernel:  [7832fa2e] 
inet_rtm_newaddr+0x16e/0x190
Dec 19 04:49:33 localhost kernel:  [782f58a9] 
rtnetlink_rcv_msg+0x169/0x230
Dec 19 04:49:33 localhost kernel:  [78300ed0] 
netlink_run_queue+0x90/0x140

Dec 19 04:49:33 localhost kernel:  [782f56dc] rtnetlink_rcv+0x2c/0x50
Dec 19 04:49:33 localhost kernel:  [783014a5] 
netlink_data_ready+0x15/0x60

Dec 19 04:49:33 localhost kernel:  [78300167] netlink_sendskb+0x27/0x50
Dec 19 04:49:33 localhost kernel:  [78300bab] 
netlink_unicast+0x15b/0x1f0
Dec 19 04:49:33 localhost kernel:  [783013ab] 
netlink_sendmsg+0x20b/0x2f0

Dec 19 04:49:33 localhost kernel:  [782e12bc] sock_sendmsg+0xfc/0x120
Dec 19 04:49:33 localhost kernel:  [782e1a5a] sys_sendmsg+0x10a/0x220
Dec 19 04:49:33 localhost kernel:  [782e3311] 
sys_socketcall+0x261/0x290
Dec 19 04:49:33 localhost kernel:  [7810307d] 
sysenter_past_esp+0x56/0x8d
Dec 19 04:52:17 localhost sshd[32311]: gethostby*.getanswer: asked for 
203.60.60.10.in-addr.arpa IN PTR, got type A





--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: 

Re: BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2007-01-01 Thread Jarek Poplawski
On Mon, Jan 01, 2007 at 09:00:05PM -0800, Ben Greear wrote:
 I finally had time to look through the code in this backtrace in 
 detail.  I think it *could*
 be a race between ip_rcv and inetdev_init, but I am not certain.  Other 
 than that, I'm real
 low on ideas.  I found a few more stack trace debugging options to 
 enable..perhaps that
 will give a better backtrace if we can reproduce it again.
 
 I do have lock-debugging enabled, so it should have caught this if was 
 an un-initialized access
 problem, however.
 
 More details below inline.
 
 Ben Greear wrote:
 This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
 active use.
 From the backtrace, I am thinking this might be a generic problem, 
 however.
 
 Any ideas about what this could be?  It seems to be reproducible every 
 day or
 two, but no known way to make it happen quickly...
 
 Kernel is SMP, PREEMPT.
 
 
 Dec 19 04:49:33 localhost kernel: BUG: soft lockup detected on CPU#0!
 Dec 19 04:49:33 localhost kernel:  [78104252] show_trace+0x12/0x20
 Dec 19 04:49:33 localhost kernel:  [78104929] dump_stack+0x19/0x20
 Dec 19 04:49:33 localhost kernel:  [7814c88b] softlockup_tick+0x9b/0xd0
 Dec 19 04:49:33 localhost kernel:  [7812a992] 
 run_local_timers+0x12/0x20
 Dec 19 04:49:33 localhost kernel:  [7812ac08] 
 update_process_times+0x38/0x80
 Dec 19 04:49:33 localhost kernel:  [78112796] 
 smp_apic_timer_interrupt+0x66/0x70
 Dec 19 04:49:33 localhost kernel:  [78103baa] 
 apic_timer_interrupt+0x2a/0x30
 Dec 19 04:49:33 localhost kernel:  [78354e8c] _read_lock+0x3c/0x50
  Dec 19 04:49:33 localhost kernel:  [78331f42] ip_check_mc+0x22/0xb0
 This is blocked on:
 igmp.c:read_lock(in_dev-mc_list_lock);
 
 Dec 19 04:49:33 localhost kernel:  [783068bf] 
 ip_route_input+0x17f/0xef0
 route.c:int our = ip_check_mc(in_dev, daddr, saddr, 
 skb-nh.iph-protocol);
 Dec 19 04:49:33 localhost kernel:  [78309c59] ip_rcv+0x349/0x580
 ?? Called by a macro maybe?  Can't find an obvious call to the 

Probably deliver_skb.

 ip_route_input.
 Dec 19 04:49:33 localhost kernel:  [782ec98d] 
 netif_receive_skb+0x36d/0x3b0
 Dec 19 04:49:33 localhost kernel:  [782ee50c] 
 process_backlog+0x9c/0x130
 Dec 19 04:49:33 localhost kernel:  [782ee795] net_rx_action+0xc5/0x1f0
 Dec 19 04:49:33 localhost kernel:  [78125e58] __do_softirq+0x88/0x110
 Dec 19 04:49:33 localhost kernel:  [78125f59] do_softirq+0x79/0x80
 Dec 19 04:49:33 localhost kernel:  [781260ed] irq_exit+0x5d/0x60
 Dec 19 04:49:33 localhost kernel:  [78105a6d] do_IRQ+0x4d/0xa0
 Dec 19 04:49:33 localhost kernel:  [78103ae9] 
 common_interrupt+0x25/0x2c
 Dec 19 04:49:33 localhost kernel:  [78354c45] _spin_lock+0x35/0x50
 Dec 19 04:49:33 localhost kernel:  [781aab1d] proc_register+0x2d/0x110
 Dec 19 04:49:33 localhost kernel:  [781ab23d] 
 create_proc_entry+0x5d/0xd0
 Dec 19 04:49:33 localhost kernel:  [7812873b] 
 register_proc_table+0x6b/0x110
 Dec 19 04:49:33 localhost kernel:  [78128771] 
 register_proc_table+0xa1/0x110
 Dec 19 04:49:33 localhost last message repeated 3 times
 Dec 19 04:49:33 localhost kernel:  [7812886d] 
 register_sysctl_table+0x8d/0xc0
 Dec 19 04:49:33 localhost kernel:  [7832f0c9] 
 devinet_sysctl_register+0x109/0x150
 
 This devinet_sysctl_register is called right before the ip_mc_init_dev 
 call is made, and
 that call is used to initialize the multicast lock that is blocked on at 
 the top of this backtrace.
 This *could* be the race, but only if the entities in question are the 
 same thing.  I don't see
 any way to determine whether they are or not based on the backtrace.
 
 I looked through all of the uses of the mc_list_lock, and the places 
 where it does a write_lock
 are few and appear to be correct with no possibility of deadlocking.  If 
 a lock was un-initialized, then
 that could perhaps explain why it is able to deadlock (though, that 
 should have triggered a different
 bug report since I have spin/rw-lock debugging enabled.)
 

It is hard to say what kind of bug to expect
because at the same time other net_rx_action
with the same vlan dev could take place on
other processor and this inetdev_init could
do more.

The main thing is the possibility of processing
skb with not entirely open source dev which isn't
expected (and checked) by receive functions.
I think the easiest way to convince yourself is
to add temporarily IFF_UP flag checking with
dropping at the beginning of netif_receive_skb and
__vlan_hwaccel_rx.

Jarek P.

 Dec 19 04:49:33 localhost kernel:  [7832f2ea] inetdev_init+0xea/0x160
 Dec 19 04:49:33 localhost kernel:  [7832fa2e] 
 inet_rtm_newaddr+0x16e/0x190
 Dec 19 04:49:33 localhost kernel:  [782f58a9] 
 rtnetlink_rcv_msg+0x169/0x230
 Dec 19 04:49:33 localhost kernel:  [78300ed0] 
 netlink_run_queue+0x90/0x140
 Dec 19 04:49:33 localhost kernel:  [782f56dc] rtnetlink_rcv+0x2c/0x50
 Dec 19 04:49:33 localhost kernel:  [783014a5] 
 netlink_data_ready+0x15/0x60
 Dec 19 04:49:33 localhost kernel:  [78300167] netlink_sendskb+0x27/0x50
 Dec 19 

BUG: soft lockup detected on CPU#0! (2.6.18.2 plus hacks)

2006-12-19 Thread Ben Greear

This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in active 
use.
From the backtrace, I am thinking this might be a generic problem, however.

Any ideas about what this could be?  It seems to be reproducible every day or
two, but no known way to make it happen quickly...

Kernel is SMP, PREEMPT.


Dec 19 04:49:33 localhost kernel: BUG: soft lockup detected on CPU#0!
Dec 19 04:49:33 localhost kernel:  [78104252] show_trace+0x12/0x20
Dec 19 04:49:33 localhost kernel:  [78104929] dump_stack+0x19/0x20
Dec 19 04:49:33 localhost kernel:  [7814c88b] softlockup_tick+0x9b/0xd0
Dec 19 04:49:33 localhost kernel:  [7812a992] run_local_timers+0x12/0x20
Dec 19 04:49:33 localhost kernel:  [7812ac08] update_process_times+0x38/0x80
Dec 19 04:49:33 localhost kernel:  [78112796] 
smp_apic_timer_interrupt+0x66/0x70
Dec 19 04:49:33 localhost kernel:  [78103baa] apic_timer_interrupt+0x2a/0x30
Dec 19 04:49:33 localhost kernel:  [78354e8c] _read_lock+0x3c/0x50
Dec 19 04:49:33 localhost kernel:  [78331f42] ip_check_mc+0x22/0xb0
Dec 19 04:49:33 localhost kernel:  [783068bf] ip_route_input+0x17f/0xef0
Dec 19 04:49:33 localhost kernel:  [78309c59] ip_rcv+0x349/0x580
Dec 19 04:49:33 localhost kernel:  [782ec98d] netif_receive_skb+0x36d/0x3b0
Dec 19 04:49:33 localhost kernel:  [782ee50c] process_backlog+0x9c/0x130
Dec 19 04:49:33 localhost kernel:  [782ee795] net_rx_action+0xc5/0x1f0
Dec 19 04:49:33 localhost kernel:  [78125e58] __do_softirq+0x88/0x110
Dec 19 04:49:33 localhost kernel:  [78125f59] do_softirq+0x79/0x80
Dec 19 04:49:33 localhost kernel:  [781260ed] irq_exit+0x5d/0x60
Dec 19 04:49:33 localhost kernel:  [78105a6d] do_IRQ+0x4d/0xa0
Dec 19 04:49:33 localhost kernel:  [78103ae9] common_interrupt+0x25/0x2c
Dec 19 04:49:33 localhost kernel:  [78354c45] _spin_lock+0x35/0x50
Dec 19 04:49:33 localhost kernel:  [781aab1d] proc_register+0x2d/0x110
Dec 19 04:49:33 localhost kernel:  [781ab23d] create_proc_entry+0x5d/0xd0
Dec 19 04:49:33 localhost kernel:  [7812873b] register_proc_table+0x6b/0x110
Dec 19 04:49:33 localhost kernel:  [78128771] register_proc_table+0xa1/0x110
Dec 19 04:49:33 localhost last message repeated 3 times
Dec 19 04:49:33 localhost kernel:  [7812886d] register_sysctl_table+0x8d/0xc0
Dec 19 04:49:33 localhost kernel:  [7832f0c9] 
devinet_sysctl_register+0x109/0x150
Dec 19 04:49:33 localhost kernel:  [7832f2ea] inetdev_init+0xea/0x160
Dec 19 04:49:33 localhost kernel:  [7832fa2e] inet_rtm_newaddr+0x16e/0x190
Dec 19 04:49:33 localhost kernel:  [782f58a9] rtnetlink_rcv_msg+0x169/0x230
Dec 19 04:49:33 localhost kernel:  [78300ed0] netlink_run_queue+0x90/0x140
Dec 19 04:49:33 localhost kernel:  [782f56dc] rtnetlink_rcv+0x2c/0x50
Dec 19 04:49:33 localhost kernel:  [783014a5] netlink_data_ready+0x15/0x60
Dec 19 04:49:33 localhost kernel:  [78300167] netlink_sendskb+0x27/0x50
Dec 19 04:49:33 localhost kernel:  [78300bab] netlink_unicast+0x15b/0x1f0
Dec 19 04:49:33 localhost kernel:  [783013ab] netlink_sendmsg+0x20b/0x2f0
Dec 19 04:49:33 localhost kernel:  [782e12bc] sock_sendmsg+0xfc/0x120
Dec 19 04:49:33 localhost kernel:  [782e1a5a] sys_sendmsg+0x10a/0x220
Dec 19 04:49:33 localhost kernel:  [782e3311] sys_socketcall+0x261/0x290
Dec 19 04:49:33 localhost kernel:  [7810307d] sysenter_past_esp+0x56/0x8d
Dec 19 04:52:17 localhost sshd[32311]: gethostby*.getanswer: asked for 
203.60.60.10.in-addr.arpa IN PTR, got type A

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html