Re: [OMPI devel] Ob1 segfault
On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote: > Gleb Natapov wrote: > > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: > >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > While looking into another problem I ran into an issue which made ob1 > segfault on me. Using gm, and running the test test_dan1 in the onesided > test suite, if I limit the gm freelist by too much, I get a segfault. > That is, > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > works fine, but > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > >>> I cannot, unfortunately, reproduce this with openib BTL. > >>> > segfaults. Here is the relevant output from gdb: > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1077541088 (LWP 15600)] > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > sizeof(mca_pml_ob1_fin_hdr_t)); > >>> can you send me what's inside bml_btl? > >> It turns out that the order of arguments to mca_pml_ob1_send_fin was > >> wrong. I > >> fixed this in r15304. But now we hang instead of segfault, and have both > >> processes just looping through opal_progress. I really don't what to look > >> for. Any hints? > >> > > Can you look in gdb at mca_pml_ob1.rdma_pending? > Yeah, rank 0 has nothing on the list, and rank 1 has 48 things. Do you run both ranks on the same node? Can you try to run them on different node? -- Gleb.
Re: [OMPI devel] Ob1 segfault
Gleb Natapov wrote: On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: While looking into another problem I ran into an issue which made ob1 segfault on me. Using gm, and running the test test_dan1 in the onesided test suite, if I limit the gm freelist by too much, I get a segfault. That is, mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 works fine, but mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 I cannot, unfortunately, reproduce this with openib BTL. segfaults. Here is the relevant output from gdb: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1077541088 (LWP 15600)] 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, sizeof(mca_pml_ob1_fin_hdr_t)); can you send me what's inside bml_btl? It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I fixed this in r15304. But now we hang instead of segfault, and have both processes just looping through opal_progress. I really don't what to look for. Any hints? Can you look in gdb at mca_pml_ob1.rdma_pending? Yeah, rank 0 has nothing on the list, and rank 1 has 48 things. Here is the first item on the list: $7 = { super = { super = { super = { obj_magic_id = 16046253926196952813, obj_class = 0x404f5980, obj_reference_count = 1, cls_init_file_name = 0x404f30f9 "pml_ob1_sendreq.c", cls_init_lineno = 1134 }, opal_list_next = 0x8f5d680, opal_list_prev = 0x404f57c8, opal_list_item_refcount = 1, opal_list_item_belong_to = 0x404f57b0 }, registration = 0x0, ptr = 0x0 }, rdma_bml = 0x8729098, rdma_hdr = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_rndv = { hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_msg_length = 236982400, hdr_src_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_rget = { hdr_rndv = { hdr_match = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_ctx = 5, hdr_src = 1, hdr_tag = 142418176, hdr_seq = 0, hdr_padding = "\000" }, hdr_msg_length = 236982400, hdr_src_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_seg_cnt = 1106481152, hdr_padding = "\000\000\000", hdr_des = { lval = 32768, ival = 32768, pval = 0x8000, sval = { uval = 32768, lval = 0 } }, hdr_segs = {{ seg_addr = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } }, seg_len = 0, seg_padding = "\000\000\000", seg_key = { key32 = {0, 0}, key64 = 0, key8 = "\000\000\000\000\000\000\000" } }} }, hdr_frag = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_padding = "\005\000\001\000\000", hdr_frag_offset = 142418176, hdr_src_req = { lval = 236982400, ival = 236982400, pval = 0xe201080, sval = { uval = 236982400, lval = 0 } }, hdr_dst_req = { lval = 0, ival = 0, pval = 0x0, sval = { uval = 0, lval = 0 } } }, hdr_ack = { hdr_common = { hdr_type = 8 '\b', hdr_flags = 4 '\004' }, hdr_padding = "\005\000\001\000\000", hdr_src_req = { lval = 142418176, ival = 142418176, pval = 0x87d2100, sval = { uval = 142418176, lval = 0 } }, hdr_dst_req = { lval = 236982400, ival = 236982400, pval = 0xe201080, sval = { uval = 236982400, lval = 0 } }, hdr_send_offset = 0 }, hdr_rdma = { hdr_common = { hdr_type = 8
Re: [OMPI devel] Ob1 segfault
On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: > On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > > > While looking into another problem I ran into an issue which made ob1 > > > segfault on me. Using gm, and running the test test_dan1 in the onesided > > > test suite, if I limit the gm freelist by too much, I get a segfault. > > > That is, > > > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > > > > > works fine, but > > > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > > > > I cannot, unfortunately, reproduce this with openib BTL. > > > > > segfaults. Here is the relevant output from gdb: > > > > > > Program received signal SIGSEGV, Segmentation fault. > > > [Switching to Thread 1077541088 (LWP 15600)] > > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > > > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > > > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > > > sizeof(mca_pml_ob1_fin_hdr_t)); > > > > can you send me what's inside bml_btl? > > It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I > fixed this in r15304. But now we hang instead of segfault, and have both > processes just looping through opal_progress. I really don't what to look > for. Any hints? > Can you look in gdb at mca_pml_ob1.rdma_pending? -- Gleb.
Re: [OMPI devel] Ob1 segfault
On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > > While looking into another problem I ran into an issue which made ob1 > > segfault on me. Using gm, and running the test test_dan1 in the onesided > > test suite, if I limit the gm freelist by too much, I get a segfault. > > That is, > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > > > works fine, but > > > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > > I cannot, unfortunately, reproduce this with openib BTL. > > > segfaults. Here is the relevant output from gdb: > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 1077541088 (LWP 15600)] > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > > sizeof(mca_pml_ob1_fin_hdr_t)); > > can you send me what's inside bml_btl? It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I fixed this in r15304. But now we hang instead of segfault, and have both processes just looping through opal_progress. I really don't what to look for. Any hints? Thanks, Tim > > > (gdb) bt > > #0 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, > > bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at > > pml_ob1.c:267 #1 0x404eef7a in mca_pml_ob1_send_request_put_frag > > (frag=0xa711f00) at pml_ob1_sendreq.c:1141 > > #2 0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387 > > #3 0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38, > > ep=0x9c42c78, des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108 > > #4 0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0, > > context=0xb62ad00, status=GM_SUCCESS) at btl_gm.c:682 > > #5 0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0) > > at ./libgm/gm_handle_sent_tokens.c:82 > > #6 0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0) > > at ./libgm/gm_unknown.c:222 > > #7 0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0) > > at ./libgm/gm_unknown.c:300 > > #8 0x40502708 in mca_btl_gm_component_progress () at > > btl_gm_component.c:649 #9 0x404f6fd6 in mca_bml_r2_progress () at > > bml_r2.c:110 > > #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201 > > #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478) > > at ../../../../opal/threads/condition.h:98 > > #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8) > > at osc_pt2pt_sync.c:142 > > #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at > > pwin_fence.c:57 #14 0x0804a2f3 in test_bandwidth1 (nbufsize=105, > > min_iterations=10, max_iterations=1000, verbose=0) at test_dan1.c:282 > > #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686 > > #16 0x080512f5 in test_dan1 () at test_dan1.c:3555 > > #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639 > > (gdb) > > > > This is using the trunk. Any ideas? > > > > Thanks, > > > > Tim > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Ob1 segfault
On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > While looking into another problem I ran into an issue which made ob1 > segfault > on me. Using gm, and running the test test_dan1 in the onesided test suite, > if I limit the gm freelist by too much, I get a segfault. That is, > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > > works fine, but > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 I cannot, unfortunately, reproduce this with openib BTL. > > segfaults. Here is the relevant output from gdb: > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1077541088 (LWP 15600)] > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > sizeof(mca_pml_ob1_fin_hdr_t)); can you send me what's inside bml_btl? > (gdb) bt > #0 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > #1 0x404eef7a in mca_pml_ob1_send_request_put_frag (frag=0xa711f00) > at pml_ob1_sendreq.c:1141 > #2 0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387 > #3 0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38, ep=0x9c42c78, > des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108 > #4 0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0, context=0xb62ad00, > status=GM_SUCCESS) at btl_gm.c:682 > #5 0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0) > at ./libgm/gm_handle_sent_tokens.c:82 > #6 0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0) > at ./libgm/gm_unknown.c:222 > #7 0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0) > at ./libgm/gm_unknown.c:300 > #8 0x40502708 in mca_btl_gm_component_progress () at btl_gm_component.c:649 > #9 0x404f6fd6 in mca_bml_r2_progress () at bml_r2.c:110 > #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201 > #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478) > at ../../../../opal/threads/condition.h:98 > #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8) > at osc_pt2pt_sync.c:142 > #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at pwin_fence.c:57 > #14 0x0804a2f3 in test_bandwidth1 (nbufsize=105, min_iterations=10, > max_iterations=1000, verbose=0) at test_dan1.c:282 > #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686 > #16 0x080512f5 in test_dan1 () at test_dan1.c:3555 > #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639 > (gdb) > > This is using the trunk. Any ideas? > > Thanks, > > Tim > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.