Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Gleb Natapov
On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote:
> Gleb Natapov wrote:
> > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
>  While looking into another problem I ran into an issue which made ob1
>  segfault on me. Using gm, and running the test test_dan1 in the onesided
>  test suite, if I limit the gm freelist by too much, I get a segfault.
>  That is,
> 
>  mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> 
>  works fine, but
> 
>  mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
> >>> I cannot, unfortunately, reproduce this with openib BTL.
> >>>
>  segfaults. Here is the relevant output from gdb:
> 
>  Program received signal SIGSEGV, Segmentation fault.
>  [Switching to Thread 1077541088 (LWP 15600)]
>  0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
>  hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
>  267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
>  sizeof(mca_pml_ob1_fin_hdr_t));
> >>> can you send me what's inside bml_btl?
> >> It turns out that the order of arguments to mca_pml_ob1_send_fin was 
> >> wrong. I 
> >> fixed this in r15304. But now we hang instead of segfault, and have both 
> >> processes just looping through opal_progress. I really don't what to look 
> >> for. Any hints?
> >>
> > Can you look in gdb at mca_pml_ob1.rdma_pending?
> Yeah, rank 0 has nothing on the list, and rank 1 has 48 things.
Do you run both ranks on the same node? Can you try to run them on
different node?

--
Gleb.



Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Tim Prins

Gleb Natapov wrote:

On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:

On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:

On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:

While looking into another problem I ran into an issue which made ob1
segfault on me. Using gm, and running the test test_dan1 in the onesided
test suite, if I limit the gm freelist by too much, I get a segfault.
That is,

mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1

works fine, but

mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1

I cannot, unfortunately, reproduce this with openib BTL.


segfaults. Here is the relevant output from gdb:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1077541088 (LWP 15600)]
0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
sizeof(mca_pml_ob1_fin_hdr_t));

can you send me what's inside bml_btl?
It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
fixed this in r15304. But now we hang instead of segfault, and have both 
processes just looping through opal_progress. I really don't what to look 
for. Any hints?



Can you look in gdb at mca_pml_ob1.rdma_pending?

Yeah, rank 0 has nothing on the list, and rank 1 has 48 things.

Here is the first item on the list:
$7 = {
  super = {
super = {
  super = {
obj_magic_id = 16046253926196952813,
obj_class = 0x404f5980,
obj_reference_count = 1,
cls_init_file_name = 0x404f30f9 "pml_ob1_sendreq.c",
cls_init_lineno = 1134
  },
  opal_list_next = 0x8f5d680,
  opal_list_prev = 0x404f57c8,
  opal_list_item_refcount = 1,
  opal_list_item_belong_to = 0x404f57b0
},
registration = 0x0,
ptr = 0x0
  },
  rdma_bml = 0x8729098,
  rdma_hdr = {
hdr_common = {
  hdr_type = 8 '\b',
  hdr_flags = 4 '\004'
},
hdr_match = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_ctx = 5,
  hdr_src = 1,
  hdr_tag = 142418176,
  hdr_seq = 0,
  hdr_padding = "\000"
},
hdr_rndv = {
  hdr_match = {
hdr_common = {
  hdr_type = 8 '\b',
  hdr_flags = 4 '\004'
},
hdr_ctx = 5,
hdr_src = 1,
hdr_tag = 142418176,
hdr_seq = 0,
hdr_padding = "\000"
  },
  hdr_msg_length = 236982400,
  hdr_src_req = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  }
},
hdr_rget = {
  hdr_rndv = {
hdr_match = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_ctx = 5,
  hdr_src = 1,
  hdr_tag = 142418176,
  hdr_seq = 0,
  hdr_padding = "\000"
},
hdr_msg_length = 236982400,
hdr_src_req = {
  lval = 0,
  ival = 0,
  pval = 0x0,
  sval = {
uval = 0,
lval = 0
  }
}
  },
  hdr_seg_cnt = 1106481152,
  hdr_padding = "\000\000\000",
  hdr_des = {
lval = 32768,
ival = 32768,
pval = 0x8000,
sval = {
  uval = 32768,
  lval = 0
}
  },
  hdr_segs = {{
  seg_addr = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  },
  seg_len = 0,
  seg_padding = "\000\000\000",
  seg_key = {
key32 = {0, 0},
key64 = 0,
key8 = "\000\000\000\000\000\000\000"
  }
}}
},
hdr_frag = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_padding = "\005\000\001\000\000",
  hdr_frag_offset = 142418176,
  hdr_src_req = {
lval = 236982400,
ival = 236982400,
pval = 0xe201080,
sval = {
  uval = 236982400,
  lval = 0
}
  },
  hdr_dst_req = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  }
},
hdr_ack = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_padding = "\005\000\001\000\000",
  hdr_src_req = {
lval = 142418176,
ival = 142418176,
pval = 0x87d2100,
sval = {
  uval = 142418176,
  lval = 0
}
  },
  hdr_dst_req = {
lval = 236982400,
ival = 236982400,
pval = 0xe201080,
sval = {
  uval = 236982400,
  lval = 0
}
  },
  hdr_send_offset = 0
},
hdr_rdma = {
  hdr_common = {
hdr_type = 8 

Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Gleb Natapov
On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > > While looking into another problem I ran into an issue which made ob1
> > > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > > test suite, if I limit the gm freelist by too much, I get a segfault.
> > > That is,
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> > >
> > > works fine, but
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
> >
> > I cannot, unfortunately, reproduce this with openib BTL.
> >
> > > segfaults. Here is the relevant output from gdb:
> > >
> > > Program received signal SIGSEGV, Segmentation fault.
> > > [Switching to Thread 1077541088 (LWP 15600)]
> > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > > sizeof(mca_pml_ob1_fin_hdr_t));
> >
> > can you send me what's inside bml_btl?
> 
> It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
> fixed this in r15304. But now we hang instead of segfault, and have both 
> processes just looping through opal_progress. I really don't what to look 
> for. Any hints?
> 
Can you look in gdb at mca_pml_ob1.rdma_pending?

--
Gleb.



Re: [OMPI devel] Ob1 segfault

2007-07-08 Thread Tim Prins
On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > While looking into another problem I ran into an issue which made ob1
> > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > test suite, if I limit the gm freelist by too much, I get a segfault.
> > That is,
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> >
> > works fine, but
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
>
> I cannot, unfortunately, reproduce this with openib BTL.
>
> > segfaults. Here is the relevant output from gdb:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 1077541088 (LWP 15600)]
> > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > sizeof(mca_pml_ob1_fin_hdr_t));
>
> can you send me what's inside bml_btl?

It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
fixed this in r15304. But now we hang instead of segfault, and have both 
processes just looping through opal_progress. I really don't what to look 
for. Any hints?

Thanks,

Tim


>
> > (gdb) bt
> > #0  0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490,
> > bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at
> > pml_ob1.c:267 #1  0x404eef7a in mca_pml_ob1_send_request_put_frag
> > (frag=0xa711f00) at pml_ob1_sendreq.c:1141
> > #2  0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387
> > #3  0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38,
> > ep=0x9c42c78, des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108
> > #4  0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0,
> > context=0xb62ad00, status=GM_SUCCESS) at btl_gm.c:682
> > #5  0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_handle_sent_tokens.c:82
> > #6  0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:222
> > #7  0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:300
> > #8  0x40502708 in mca_btl_gm_component_progress () at
> > btl_gm_component.c:649 #9  0x404f6fd6 in mca_bml_r2_progress () at
> > bml_r2.c:110
> > #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201
> > #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478)
> > at ../../../../opal/threads/condition.h:98
> > #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8)
> > at osc_pt2pt_sync.c:142
> > #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at
> > pwin_fence.c:57 #14 0x0804a2f3 in test_bandwidth1 (nbufsize=105,
> > min_iterations=10, max_iterations=1000, verbose=0) at test_dan1.c:282
> > #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686
> > #16 0x080512f5 in test_dan1 () at test_dan1.c:3555
> > #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639
> > (gdb)
> >
> > This is using the trunk. Any ideas?
> >
> > Thanks,
> >
> > Tim
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
>   Gleb.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Ob1 segfault

2007-07-08 Thread Gleb Natapov
On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> While looking into another problem I ran into an issue which made ob1 
> segfault 
> on me. Using gm, and running the test test_dan1 in the onesided test suite, 
> if I limit the gm freelist by too much, I get a segfault. That is,
> 
> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> 
> works fine, but
> 
> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
I cannot, unfortunately, reproduce this with openib BTL.

> 
> segfaults. Here is the relevant output from gdb:
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1077541088 (LWP 15600)]
> 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, 
> hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, 
> sizeof(mca_pml_ob1_fin_hdr_t));
can you send me what's inside bml_btl?

> (gdb) bt
> #0  0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, 
> hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> #1  0x404eef7a in mca_pml_ob1_send_request_put_frag (frag=0xa711f00)
> at pml_ob1_sendreq.c:1141
> #2  0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387
> #3  0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38, ep=0x9c42c78, 
> des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108
> #4  0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0, context=0xb62ad00, 
> status=GM_SUCCESS) at btl_gm.c:682
> #5  0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0)
> at ./libgm/gm_handle_sent_tokens.c:82
> #6  0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0)
> at ./libgm/gm_unknown.c:222
> #7  0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0)
> at ./libgm/gm_unknown.c:300
> #8  0x40502708 in mca_btl_gm_component_progress () at btl_gm_component.c:649
> #9  0x404f6fd6 in mca_bml_r2_progress () at bml_r2.c:110
> #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201
> #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478)
> at ../../../../opal/threads/condition.h:98
> #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8)
> at osc_pt2pt_sync.c:142
> #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at pwin_fence.c:57
> #14 0x0804a2f3 in test_bandwidth1 (nbufsize=105, min_iterations=10, 
> max_iterations=1000, verbose=0) at test_dan1.c:282
> #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686
> #16 0x080512f5 in test_dan1 () at test_dan1.c:3555
> #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639
> (gdb) 
> 
> This is using the trunk. Any ideas?
> 
> Thanks,
> 
> Tim
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Gleb.