Re: [ofa-general] madeye kernel oops

2007-04-15 Thread Tziporet Koren

Ami Perlmutter wrote:

seems to be OK

On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote:
  

Can you see if this patch fixes your problem?

(I'm not sure how I never hit this before.)

- Sean

  

Was this applied to OFED?

Thanks,
Tziporet

___
general mailing list
[EMAIL PROTECTED]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-04-04 Thread Ami Perlmutter
seems to be OK

On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote:
> Can you see if this patch fixes your problem?
> 
> (I'm not sure how I never hit this before.)
> 
> - Sean
> 
> ---
> 
> IB/madeye: Fix array subscript out of range error.
> 
> Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>
> 
> diff --git a/drivers/infiniband/util/madeye/madeye.c
> b/drivers/infiniband/util/madeye/madeye.c
> index f3d02d1..1b2c384 100644
> --- a/drivers/infiniband/util/madeye/madeye.c
> +++ b/drivers/infiniband/util/madeye/madeye.c
> @@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device)
>   goto out;
>  
>   reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS;
> - for (i = s; i <= e; i++) {
> + for (i = 0; i <= e - s; i++) {
>   port[i].smi_agent = ib_register_mad_snoop(device, i,
> IB_QPT_SMI,
> reg_flags,
> @@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device)
>   e = device->phys_port_cnt;
>   }
>  
> - for (i = s; i <= e; i++) {
> + for (i = 0; i <= e - s; i++) {
>   if (!IS_ERR(port[i].smi_agent))
>   ib_unregister_mad_agent(port[i].smi_agent);
>   if (!IS_ERR(port[i].gsi_agent))
> 

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-04-02 Thread Sean Hefty
Can you see if this patch fixes your problem?

(I'm not sure how I never hit this before.)

- Sean

---

IB/madeye: Fix array subscript out of range error.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

diff --git a/drivers/infiniband/util/madeye/madeye.c
b/drivers/infiniband/util/madeye/madeye.c
index f3d02d1..1b2c384 100644
--- a/drivers/infiniband/util/madeye/madeye.c
+++ b/drivers/infiniband/util/madeye/madeye.c
@@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device)
goto out;
 
reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS;
-   for (i = s; i <= e; i++) {
+   for (i = 0; i <= e - s; i++) {
port[i].smi_agent = ib_register_mad_snoop(device, i,
  IB_QPT_SMI,
  reg_flags,
@@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device)
e = device->phys_port_cnt;
}
 
-   for (i = s; i <= e; i++) {
+   for (i = 0; i <= e - s; i++) {
if (!IS_ERR(port[i].smi_agent))
ib_unregister_mad_agent(port[i].smi_agent);
if (!IS_ERR(port[i].gsi_agent))

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-03-28 Thread Ami Perlmutter
On Wed, 2007-03-28 at 14:15 -0500, Hal Rosenstock wrote:
> On Wed, 2007-03-28 at 02:15, Ami Perlmutter wrote:
> > On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote:
> > > How easily can you reproduce this?  I'm assuming that this is with OFED 
> > > 1.2 on
> > > 2.6.20, correct?
> > yes
> > > Can you describe what you were doing when this crash occurred?
> > opensm was running on the other computer
> > running SDP programs
> 
> So the node which oops'd was only running madeye and some SDP data
> transfer ?
I was using madeye to debug mad loses in SDP connect.
so other that CM mads there was no data being sent by SDP
> Can you be more specific about the failure scenario ? What was going on
> on the node which failed ? It looks like you were removing madeye. Was
> this the first time ? Anything else going on ?
when I tried to remove the module, the node was not running anything.
opensm was running on the other machine.
the oops happend when I tried to remove madeye in order to reset the
driver.
this oops happend more than once, but not every time I removed the
module.
> Thanks.
> 
> -- Hal
> 
> > > Thanks,
> > > Sean
> > > 
> > > >Unable to handle kernel NULL pointer dereference at 0038
> > > >RIP:
> > > > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > > >PGD 73387067 PUD 72844067 PMD 0
> > > >Oops:  [1] SMP
> > > >CPU 0
> > > >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm
> > > >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa
> > > >ib_mad ib_core
> > > >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1
> > > >RIP: 0010:[]
> > > >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > > >RSP: :810071ee1e08  EFLAGS: 00010292
> > > >RAX:  RBX: 0020 RCX: 003f
> > > >RDX: 810077ebd6c0 RSI: 0202 RDI: 
> > > >RBP:  R08: 810077ebd728 R09: 0003
> > > >R10:  R11:  R12: 8100766c33c0
> > > >R13: 0002 R14: 0880 R15: 00503010
> > > >FS:  2b3d6689fb00() GS:80702000()
> > > >knlGS:
> > > >CS:  0010 DS:  ES:  CR0: 8005003b
> > > >CR2: 0038 CR3: 71086000 CR4: 06e0
> > > >Process rmmod (pid: 8917, threadinfo 810071ee, task
> > > >8100781aeee0)
> > > >Stack:  810071ee1e18 8022b92f 810071ee1e28
> > > >80538b43
> > > > 810071ee1ea8 80538ea2 80690880 810071ee1e78
> > > > 000f 0020 0002 8100766c33c0
> > > >Call Trace:
> > > > [] __cond_resched+0x1c/0x44
> > > > [] cond_resched+0x2e/0x39
> > > > [] wait_for_completion+0x1a/0xd0
> > > > [] :ib_madeye:madeye_remove_one+0x56/0x88
> > > > [] :ib_core:ib_unregister_client+0x40/0xe2
> > > > [] sys_delete_module+0x1b4/0x1e5
> > > > [] add_uevent_var+0x40/0xe3
> > > > [] sys_munmap+0x4b/0x58
> > > > [] system_call+0x7e/0x83
> > > >
> > > >
> > > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48
> > > >RIP  [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > > > RSP 
> > > >CR2: 0038
> > 
> > ___
> > general mailing list
> > [email protected]
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> 

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-03-28 Thread Hal Rosenstock
On Wed, 2007-03-28 at 02:15, Ami Perlmutter wrote:
> On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote:
> > How easily can you reproduce this?  I'm assuming that this is with OFED 1.2 
> > on
> > 2.6.20, correct?
> yes
> > Can you describe what you were doing when this crash occurred?
> opensm was running on the other computer
> running SDP programs

So the node which oops'd was only running madeye and some SDP data
transfer ?

Can you be more specific about the failure scenario ? What was going on
on the node which failed ? It looks like you were removing madeye. Was
this the first time ? Anything else going on ?

Thanks.

-- Hal

> > Thanks,
> > Sean
> > 
> > >Unable to handle kernel NULL pointer dereference at 0038
> > >RIP:
> > > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > >PGD 73387067 PUD 72844067 PMD 0
> > >Oops:  [1] SMP
> > >CPU 0
> > >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm
> > >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa
> > >ib_mad ib_core
> > >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1
> > >RIP: 0010:[]
> > >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > >RSP: :810071ee1e08  EFLAGS: 00010292
> > >RAX:  RBX: 0020 RCX: 003f
> > >RDX: 810077ebd6c0 RSI: 0202 RDI: 
> > >RBP:  R08: 810077ebd728 R09: 0003
> > >R10:  R11:  R12: 8100766c33c0
> > >R13: 0002 R14: 0880 R15: 00503010
> > >FS:  2b3d6689fb00() GS:80702000()
> > >knlGS:
> > >CS:  0010 DS:  ES:  CR0: 8005003b
> > >CR2: 0038 CR3: 71086000 CR4: 06e0
> > >Process rmmod (pid: 8917, threadinfo 810071ee, task
> > >8100781aeee0)
> > >Stack:  810071ee1e18 8022b92f 810071ee1e28
> > >80538b43
> > > 810071ee1ea8 80538ea2 80690880 810071ee1e78
> > > 000f 0020 0002 8100766c33c0
> > >Call Trace:
> > > [] __cond_resched+0x1c/0x44
> > > [] cond_resched+0x2e/0x39
> > > [] wait_for_completion+0x1a/0xd0
> > > [] :ib_madeye:madeye_remove_one+0x56/0x88
> > > [] :ib_core:ib_unregister_client+0x40/0xe2
> > > [] sys_delete_module+0x1b4/0x1e5
> > > [] add_uevent_var+0x40/0xe3
> > > [] sys_munmap+0x4b/0x58
> > > [] system_call+0x7e/0x83
> > >
> > >
> > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48
> > >RIP  [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > > RSP 
> > >CR2: 0038
> 
> ___
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-03-27 Thread Ami Perlmutter
On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote:
> How easily can you reproduce this?  I'm assuming that this is with OFED 1.2 on
> 2.6.20, correct?
yes
> Can you describe what you were doing when this crash occurred?
opensm was running on the other computer
running SDP programs
> Thanks,
> Sean
> 
> >Unable to handle kernel NULL pointer dereference at 0038
> >RIP:
> > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> >PGD 73387067 PUD 72844067 PMD 0
> >Oops:  [1] SMP
> >CPU 0
> >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm
> >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa
> >ib_mad ib_core
> >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1
> >RIP: 0010:[]
> >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> >RSP: :810071ee1e08  EFLAGS: 00010292
> >RAX:  RBX: 0020 RCX: 003f
> >RDX: 810077ebd6c0 RSI: 0202 RDI: 
> >RBP:  R08: 810077ebd728 R09: 0003
> >R10:  R11:  R12: 8100766c33c0
> >R13: 0002 R14: 0880 R15: 00503010
> >FS:  2b3d6689fb00() GS:80702000()
> >knlGS:
> >CS:  0010 DS:  ES:  CR0: 8005003b
> >CR2: 0038 CR3: 71086000 CR4: 06e0
> >Process rmmod (pid: 8917, threadinfo 810071ee, task
> >8100781aeee0)
> >Stack:  810071ee1e18 8022b92f 810071ee1e28
> >80538b43
> > 810071ee1ea8 80538ea2 80690880 810071ee1e78
> > 000f 0020 0002 8100766c33c0
> >Call Trace:
> > [] __cond_resched+0x1c/0x44
> > [] cond_resched+0x2e/0x39
> > [] wait_for_completion+0x1a/0xd0
> > [] :ib_madeye:madeye_remove_one+0x56/0x88
> > [] :ib_core:ib_unregister_client+0x40/0xe2
> > [] sys_delete_module+0x1b4/0x1e5
> > [] add_uevent_var+0x40/0xe3
> > [] sys_munmap+0x4b/0x58
> > [] system_call+0x7e/0x83
> >
> >
> >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48
> >RIP  [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> > RSP 
> >CR2: 0038

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] madeye kernel oops

2007-03-27 Thread Sean Hefty
How easily can you reproduce this?  I'm assuming that this is with OFED 1.2 on
2.6.20, correct?

Can you describe what you were doing when this crash occurred?

Thanks,
Sean

>Unable to handle kernel NULL pointer dereference at 0038
>RIP:
> [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
>PGD 73387067 PUD 72844067 PMD 0
>Oops:  [1] SMP
>CPU 0
>Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm
>ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa
>ib_mad ib_core
>Pid: 8917, comm: rmmod Not tainted 2.6.20 #1
>RIP: 0010:[]
>[] :ib_mad:ib_unregister_mad_agent+0x11/0x480
>RSP: :810071ee1e08  EFLAGS: 00010292
>RAX:  RBX: 0020 RCX: 003f
>RDX: 810077ebd6c0 RSI: 0202 RDI: 
>RBP:  R08: 810077ebd728 R09: 0003
>R10:  R11:  R12: 8100766c33c0
>R13: 0002 R14: 0880 R15: 00503010
>FS:  2b3d6689fb00() GS:80702000()
>knlGS:
>CS:  0010 DS:  ES:  CR0: 8005003b
>CR2: 0038 CR3: 71086000 CR4: 06e0
>Process rmmod (pid: 8917, threadinfo 810071ee, task
>8100781aeee0)
>Stack:  810071ee1e18 8022b92f 810071ee1e28
>80538b43
> 810071ee1ea8 80538ea2 80690880 810071ee1e78
> 000f 0020 0002 8100766c33c0
>Call Trace:
> [] __cond_resched+0x1c/0x44
> [] cond_resched+0x2e/0x39
> [] wait_for_completion+0x1a/0xd0
> [] :ib_madeye:madeye_remove_one+0x56/0x88
> [] :ib_core:ib_unregister_client+0x40/0xe2
> [] sys_delete_module+0x1b4/0x1e5
> [] add_uevent_var+0x40/0xe3
> [] sys_munmap+0x4b/0x58
> [] system_call+0x7e/0x83
>
>
>Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48
>RIP  [] :ib_mad:ib_unregister_mad_agent+0x11/0x480
> RSP 
>CR2: 0038
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general