Re: [ofa-general] madeye kernel oops
Ami Perlmutter wrote: seems to be OK On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote: Can you see if this patch fixes your problem? (I'm not sure how I never hit this before.) - Sean Was this applied to OFED? Thanks, Tziporet ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
seems to be OK
On Mon, 2007-04-02 at 12:49 -0700, Sean Hefty wrote:
> Can you see if this patch fixes your problem?
>
> (I'm not sure how I never hit this before.)
>
> - Sean
>
> ---
>
> IB/madeye: Fix array subscript out of range error.
>
> Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>
>
> diff --git a/drivers/infiniband/util/madeye/madeye.c
> b/drivers/infiniband/util/madeye/madeye.c
> index f3d02d1..1b2c384 100644
> --- a/drivers/infiniband/util/madeye/madeye.c
> +++ b/drivers/infiniband/util/madeye/madeye.c
> @@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device)
> goto out;
>
> reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS;
> - for (i = s; i <= e; i++) {
> + for (i = 0; i <= e - s; i++) {
> port[i].smi_agent = ib_register_mad_snoop(device, i,
> IB_QPT_SMI,
> reg_flags,
> @@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device)
> e = device->phys_port_cnt;
> }
>
> - for (i = s; i <= e; i++) {
> + for (i = 0; i <= e - s; i++) {
> if (!IS_ERR(port[i].smi_agent))
> ib_unregister_mad_agent(port[i].smi_agent);
> if (!IS_ERR(port[i].gsi_agent))
>
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
Can you see if this patch fixes your problem?
(I'm not sure how I never hit this before.)
- Sean
---
IB/madeye: Fix array subscript out of range error.
Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>
diff --git a/drivers/infiniband/util/madeye/madeye.c
b/drivers/infiniband/util/madeye/madeye.c
index f3d02d1..1b2c384 100644
--- a/drivers/infiniband/util/madeye/madeye.c
+++ b/drivers/infiniband/util/madeye/madeye.c
@@ -533,7 +533,7 @@ static void madeye_add_one(struct ib_device *device)
goto out;
reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS;
- for (i = s; i <= e; i++) {
+ for (i = 0; i <= e - s; i++) {
port[i].smi_agent = ib_register_mad_snoop(device, i,
IB_QPT_SMI,
reg_flags,
@@ -570,7 +570,7 @@ static void madeye_remove_one(struct ib_device *device)
e = device->phys_port_cnt;
}
- for (i = s; i <= e; i++) {
+ for (i = 0; i <= e - s; i++) {
if (!IS_ERR(port[i].smi_agent))
ib_unregister_mad_agent(port[i].smi_agent);
if (!IS_ERR(port[i].gsi_agent))
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
On Wed, 2007-03-28 at 14:15 -0500, Hal Rosenstock wrote: > On Wed, 2007-03-28 at 02:15, Ami Perlmutter wrote: > > On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote: > > > How easily can you reproduce this? I'm assuming that this is with OFED > > > 1.2 on > > > 2.6.20, correct? > > yes > > > Can you describe what you were doing when this crash occurred? > > opensm was running on the other computer > > running SDP programs > > So the node which oops'd was only running madeye and some SDP data > transfer ? I was using madeye to debug mad loses in SDP connect. so other that CM mads there was no data being sent by SDP > Can you be more specific about the failure scenario ? What was going on > on the node which failed ? It looks like you were removing madeye. Was > this the first time ? Anything else going on ? when I tried to remove the module, the node was not running anything. opensm was running on the other machine. the oops happend when I tried to remove madeye in order to reset the driver. this oops happend more than once, but not every time I removed the module. > Thanks. > > -- Hal > > > > Thanks, > > > Sean > > > > > > >Unable to handle kernel NULL pointer dereference at 0038 > > > >RIP: > > > > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > > >PGD 73387067 PUD 72844067 PMD 0 > > > >Oops: [1] SMP > > > >CPU 0 > > > >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm > > > >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa > > > >ib_mad ib_core > > > >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1 > > > >RIP: 0010:[] > > > >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > > >RSP: :810071ee1e08 EFLAGS: 00010292 > > > >RAX: RBX: 0020 RCX: 003f > > > >RDX: 810077ebd6c0 RSI: 0202 RDI: > > > >RBP: R08: 810077ebd728 R09: 0003 > > > >R10: R11: R12: 8100766c33c0 > > > >R13: 0002 R14: 0880 R15: 00503010 > > > >FS: 2b3d6689fb00() GS:80702000() > > > >knlGS: > > > >CS: 0010 DS: ES: CR0: 8005003b > > > >CR2: 0038 CR3: 71086000 CR4: 06e0 > > > >Process rmmod (pid: 8917, threadinfo 810071ee, task > > > >8100781aeee0) > > > >Stack: 810071ee1e18 8022b92f 810071ee1e28 > > > >80538b43 > > > > 810071ee1ea8 80538ea2 80690880 810071ee1e78 > > > > 000f 0020 0002 8100766c33c0 > > > >Call Trace: > > > > [] __cond_resched+0x1c/0x44 > > > > [] cond_resched+0x2e/0x39 > > > > [] wait_for_completion+0x1a/0xd0 > > > > [] :ib_madeye:madeye_remove_one+0x56/0x88 > > > > [] :ib_core:ib_unregister_client+0x40/0xe2 > > > > [] sys_delete_module+0x1b4/0x1e5 > > > > [] add_uevent_var+0x40/0xe3 > > > > [] sys_munmap+0x4b/0x58 > > > > [] system_call+0x7e/0x83 > > > > > > > > > > > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48 > > > >RIP [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > > > RSP > > > >CR2: 0038 > > > > ___ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
On Wed, 2007-03-28 at 02:15, Ami Perlmutter wrote: > On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote: > > How easily can you reproduce this? I'm assuming that this is with OFED 1.2 > > on > > 2.6.20, correct? > yes > > Can you describe what you were doing when this crash occurred? > opensm was running on the other computer > running SDP programs So the node which oops'd was only running madeye and some SDP data transfer ? Can you be more specific about the failure scenario ? What was going on on the node which failed ? It looks like you were removing madeye. Was this the first time ? Anything else going on ? Thanks. -- Hal > > Thanks, > > Sean > > > > >Unable to handle kernel NULL pointer dereference at 0038 > > >RIP: > > > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > >PGD 73387067 PUD 72844067 PMD 0 > > >Oops: [1] SMP > > >CPU 0 > > >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm > > >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa > > >ib_mad ib_core > > >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1 > > >RIP: 0010:[] > > >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > >RSP: :810071ee1e08 EFLAGS: 00010292 > > >RAX: RBX: 0020 RCX: 003f > > >RDX: 810077ebd6c0 RSI: 0202 RDI: > > >RBP: R08: 810077ebd728 R09: 0003 > > >R10: R11: R12: 8100766c33c0 > > >R13: 0002 R14: 0880 R15: 00503010 > > >FS: 2b3d6689fb00() GS:80702000() > > >knlGS: > > >CS: 0010 DS: ES: CR0: 8005003b > > >CR2: 0038 CR3: 71086000 CR4: 06e0 > > >Process rmmod (pid: 8917, threadinfo 810071ee, task > > >8100781aeee0) > > >Stack: 810071ee1e18 8022b92f 810071ee1e28 > > >80538b43 > > > 810071ee1ea8 80538ea2 80690880 810071ee1e78 > > > 000f 0020 0002 8100766c33c0 > > >Call Trace: > > > [] __cond_resched+0x1c/0x44 > > > [] cond_resched+0x2e/0x39 > > > [] wait_for_completion+0x1a/0xd0 > > > [] :ib_madeye:madeye_remove_one+0x56/0x88 > > > [] :ib_core:ib_unregister_client+0x40/0xe2 > > > [] sys_delete_module+0x1b4/0x1e5 > > > [] add_uevent_var+0x40/0xe3 > > > [] sys_munmap+0x4b/0x58 > > > [] system_call+0x7e/0x83 > > > > > > > > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48 > > >RIP [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > > RSP > > >CR2: 0038 > > ___ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
On Tue, 2007-03-27 at 12:03 -0700, Sean Hefty wrote: > How easily can you reproduce this? I'm assuming that this is with OFED 1.2 on > 2.6.20, correct? yes > Can you describe what you were doing when this crash occurred? opensm was running on the other computer running SDP programs > Thanks, > Sean > > >Unable to handle kernel NULL pointer dereference at 0038 > >RIP: > > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > >PGD 73387067 PUD 72844067 PMD 0 > >Oops: [1] SMP > >CPU 0 > >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm > >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa > >ib_mad ib_core > >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1 > >RIP: 0010:[] > >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > >RSP: :810071ee1e08 EFLAGS: 00010292 > >RAX: RBX: 0020 RCX: 003f > >RDX: 810077ebd6c0 RSI: 0202 RDI: > >RBP: R08: 810077ebd728 R09: 0003 > >R10: R11: R12: 8100766c33c0 > >R13: 0002 R14: 0880 R15: 00503010 > >FS: 2b3d6689fb00() GS:80702000() > >knlGS: > >CS: 0010 DS: ES: CR0: 8005003b > >CR2: 0038 CR3: 71086000 CR4: 06e0 > >Process rmmod (pid: 8917, threadinfo 810071ee, task > >8100781aeee0) > >Stack: 810071ee1e18 8022b92f 810071ee1e28 > >80538b43 > > 810071ee1ea8 80538ea2 80690880 810071ee1e78 > > 000f 0020 0002 8100766c33c0 > >Call Trace: > > [] __cond_resched+0x1c/0x44 > > [] cond_resched+0x2e/0x39 > > [] wait_for_completion+0x1a/0xd0 > > [] :ib_madeye:madeye_remove_one+0x56/0x88 > > [] :ib_core:ib_unregister_client+0x40/0xe2 > > [] sys_delete_module+0x1b4/0x1e5 > > [] add_uevent_var+0x40/0xe3 > > [] sys_munmap+0x4b/0x58 > > [] system_call+0x7e/0x83 > > > > > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48 > >RIP [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > > RSP > >CR2: 0038 ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] madeye kernel oops
How easily can you reproduce this? I'm assuming that this is with OFED 1.2 on 2.6.20, correct? Can you describe what you were doing when this crash occurred? Thanks, Sean >Unable to handle kernel NULL pointer dereference at 0038 >RIP: > [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 >PGD 73387067 PUD 72844067 PMD 0 >Oops: [1] SMP >CPU 0 >Modules linked in: ib_madeye i2c_dev i2c_core ib_sdp rdma_cm iw_cm >ib_addr ib_local_sa ib_uverbs ib_umad ib_mthca ib_ipoib ib_cm ib_sa >ib_mad ib_core >Pid: 8917, comm: rmmod Not tainted 2.6.20 #1 >RIP: 0010:[] >[] :ib_mad:ib_unregister_mad_agent+0x11/0x480 >RSP: :810071ee1e08 EFLAGS: 00010292 >RAX: RBX: 0020 RCX: 003f >RDX: 810077ebd6c0 RSI: 0202 RDI: >RBP: R08: 810077ebd728 R09: 0003 >R10: R11: R12: 8100766c33c0 >R13: 0002 R14: 0880 R15: 00503010 >FS: 2b3d6689fb00() GS:80702000() >knlGS: >CS: 0010 DS: ES: CR0: 8005003b >CR2: 0038 CR3: 71086000 CR4: 06e0 >Process rmmod (pid: 8917, threadinfo 810071ee, task >8100781aeee0) >Stack: 810071ee1e18 8022b92f 810071ee1e28 >80538b43 > 810071ee1ea8 80538ea2 80690880 810071ee1e78 > 000f 0020 0002 8100766c33c0 >Call Trace: > [] __cond_resched+0x1c/0x44 > [] cond_resched+0x2e/0x39 > [] wait_for_completion+0x1a/0xd0 > [] :ib_madeye:madeye_remove_one+0x56/0x88 > [] :ib_core:ib_unregister_client+0x40/0xe2 > [] sys_delete_module+0x1b4/0x1e5 > [] add_uevent_var+0x40/0xe3 > [] sys_munmap+0x4b/0x58 > [] system_call+0x7e/0x83 > > >Code: 83 7f 38 00 0f 84 fd 03 00 00 48 8d 44 24 20 4c 8d 67 f0 48 >RIP [] :ib_mad:ib_unregister_mad_agent+0x11/0x480 > RSP >CR2: 0038 ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
