Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20
Yes, only the first segfault is fixed in the nightly builds. You can run mx_endpoint_info to see how many endpoints are available and if any are in use. As far as the segfault you are seeing now, I am unsure what is causing it. Hopefully someone who knows more about that area of the code than me can help. Thanks, Tim On Apr 2, 2007, at 6:12 AM, de Almeida, Valmor F. wrote: Hi Tim, I installed the openmpi-1.2.1a0r14178 tarball (took this opportunity to use the intel fortran compiler instead gfortran). With a simple test it seems to work but note the same messages ->mpirun -np 8 -machinefile mymachines a.out [x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20 Hello, world! I am 4 of 7 Hello, world! I am 0 of 7 Hello, world! I am 1 of 7 Hello, world! I am 5 of 7 Hello, world! I am 2 of 7 Hello, world! I am 7 of 7 Hello, world! I am 6 of 7 Hello, world! I am 3 of 7 and the machinefile is x1 slots=4 max_slots=4 x2 slots=4 max_slots=4 However with a realistic code, it starts fine (same messages as above) and somewhere later: [x1:25947] *** Process received signal *** [x1:25947] Signal: Segmentation fault (11) [x1:25947] Signal code: Address not mapped (1) [x1:25947] Failing at address: 0x14 [x1:25947] [ 0] [0xb7f00440] [x1:25947] [ 1] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_send_r equest_start_copy+0x13e) [0xb7a80e6e] [x1:25947] [ 2] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_send_r equest_process_pending+0x1e3) [0xb7a82463] [x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so [0xb7a7ebf8] [x1:25947] [ 4] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so (mca_btl_sm_componen t_progress+0x1813) [0xb7a41923] [x1:25947] [ 5] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so (mca_bml_r2_progress +0x36) [0xb7a4fdd6] [x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79) [0xb7dc41a9] [x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5) [0xb7e90145] [x1:25947] [ 8] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned _sendrecv_actual+0xc9) [0xb7a167a9] [x1:25947] [ 9] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned _barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4] [x1:25947] [10] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned _barrier_intra_dec_fixed+0x48) [0xb7a16a18] [x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69) [0xb7ea4059] [x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4] [x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92) [0x808bb78] [x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc) [0x8086f96] [x1:25947] [15] driver0(main+0x181) [0x8068c7f] [x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824] [x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991] [x1:25947] *** End of error message *** mpirun noticed that job rank 0 with PID 25945 on node x1 exited on signal 15 (Terminated). 7 additional processes aborted (not shown) This code does run to completion using ompi-1.2 if I use only 2 slots per machine. Thanks for any help. -- Valmor -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins Sent: Friday, March 30, 2007 10:49 PM To: Open MPI Users Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20 Hi Valmor, What is happening here is that when Open MPI tries to create MX endpoint for communication, mx returns code 20, which is MX_BUSY. At this point we should gracefully move on, but there is a bug in Open MPI 1.2 which causes a segmentation fault in case of this type of error. This will be fixed in 1.2.1, and the fix is available now in the 1.2 nightly tarballs. Hope this helps, Tim On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote: Hello, I am getting this error any time the number of processes requested per machine is greater than the number of cpus. I suspect it is something on the configuration of mx / ompi that I am missing since another machine I have without mx installed runs ompi correctly with oversubscription. Thanks for any help. -- Valmor ->mpirun -np 3 --machinefile mymachines-1 a.out [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x1:23624] *** Process received signal *** [x1:23624] Signal: Segmentation fault (11) [x1:23624] Signal code: Address not mapped (1) [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440] [x1:23624] [ 1] /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25) [0xb7aca825] [x1:23624] [ 2]
Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20
Hi Tim, I installed the openmpi-1.2.1a0r14178 tarball (took this opportunity to use the intel fortran compiler instead gfortran). With a simple test it seems to work but note the same messages ->mpirun -np 8 -machinefile mymachines a.out [x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20 Hello, world! I am 4 of 7 Hello, world! I am 0 of 7 Hello, world! I am 1 of 7 Hello, world! I am 5 of 7 Hello, world! I am 2 of 7 Hello, world! I am 7 of 7 Hello, world! I am 6 of 7 Hello, world! I am 3 of 7 and the machinefile is x1 slots=4 max_slots=4 x2 slots=4 max_slots=4 However with a realistic code, it starts fine (same messages as above) and somewhere later: [x1:25947] *** Process received signal *** [x1:25947] Signal: Segmentation fault (11) [x1:25947] Signal code: Address not mapped (1) [x1:25947] Failing at address: 0x14 [x1:25947] [ 0] [0xb7f00440] [x1:25947] [ 1] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r equest_start_copy+0x13e) [0xb7a80e6e] [x1:25947] [ 2] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r equest_process_pending+0x1e3) [0xb7a82463] [x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so [0xb7a7ebf8] [x1:25947] [ 4] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so(mca_btl_sm_componen t_progress+0x1813) [0xb7a41923] [x1:25947] [ 5] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress +0x36) [0xb7a4fdd6] [x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79) [0xb7dc41a9] [x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5) [0xb7e90145] [x1:25947] [ 8] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _sendrecv_actual+0xc9) [0xb7a167a9] [x1:25947] [ 9] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4] [x1:25947] [10] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _barrier_intra_dec_fixed+0x48) [0xb7a16a18] [x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69) [0xb7ea4059] [x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4] [x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92) [0x808bb78] [x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc) [0x8086f96] [x1:25947] [15] driver0(main+0x181) [0x8068c7f] [x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824] [x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991] [x1:25947] *** End of error message *** mpirun noticed that job rank 0 with PID 25945 on node x1 exited on signal 15 (Terminated). 7 additional processes aborted (not shown) This code does run to completion using ompi-1.2 if I use only 2 slots per machine. Thanks for any help. -- Valmor > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Tim Prins > Sent: Friday, March 30, 2007 10:49 PM > To: Open MPI Users > Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed > withstatus=20 > > Hi Valmor, > > What is happening here is that when Open MPI tries to create MX endpoint > for > communication, mx returns code 20, which is MX_BUSY. > > At this point we should gracefully move on, but there is a bug in Open MPI > 1.2 > which causes a segmentation fault in case of this type of error. This will > be > fixed in 1.2.1, and the fix is available now in the 1.2 nightly tarballs. > > Hope this helps, > > Tim > > On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote: > > Hello, > > > > I am getting this error any time the number of processes requested per > > machine is greater than the number of cpus. I suspect it is something on > > the configuration of mx / ompi that I am missing since another machine I > > have without mx installed runs ompi correctly with oversubscription. > > > > Thanks for any help. > > > > -- > > Valmor > > > > > > ->mpirun -np 3 --machinefile mymachines-1 a.out > > [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20 > > [x1:23624] *** Process received signal *** [x1:23624] Signal: > > Segmentation fault (11) [x1:23624] Signal code: Address not mapped (1) > > [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440] > > [x1:23624] [ 1] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25) > > [0xb7aca825] [x1:23624] [ 2] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init+0x6 > > f8) [0xb7acc658
Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20
Hello Tim, Thanks for the info. I also received this help from Myrinet: It looks like you are running out of endpoints. This discusses what endpoints are: http://www.myri.com/cgi-bin/fom.pl?file=421 And this explains how to increase the limit: http://www.myri.com/cgi-bin/fom.pl?file=482 Let us know if this doesn't address the problem. I haven't had time to look into it. -- Valmor > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Tim Prins > Sent: Friday, March 30, 2007 10:49 PM > To: Open MPI Users > Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed > withstatus=20 > > Hi Valmor, > > What is happening here is that when Open MPI tries to create MX endpoint > for > communication, mx returns code 20, which is MX_BUSY. > > At this point we should gracefully move on, but there is a bug in Open MPI > 1.2 > which causes a segmentation fault in case of this type of error. This will > be > fixed in 1.2.1, and the fix is available now in the 1.2 nightly tarballs. > > Hope this helps, > > Tim > > On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote: > > Hello, > > > > I am getting this error any time the number of processes requested per > > machine is greater than the number of cpus. I suspect it is something on > > the configuration of mx / ompi that I am missing since another machine I > > have without mx installed runs ompi correctly with oversubscription. > > > > Thanks for any help. > > > > -- > > Valmor > > > > > > ->mpirun -np 3 --machinefile mymachines-1 a.out > > [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20 > > [x1:23624] *** Process received signal *** [x1:23624] Signal: > > Segmentation fault (11) [x1:23624] Signal code: Address not mapped (1) > > [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440] > > [x1:23624] [ 1] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25) > > [0xb7aca825] [x1:23624] [ 2] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init+0x6 > > f8) [0xb7acc658] [x1:23624] [ 3] > > /opt/ompi/lib/libmpi.so.0(mca_btl_base_select+0x1a0) [0xb7f41900] > > [x1:23624] [ 4] > > /opt/openmpi-1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x2 > > 6) [0xb7ad1006] [x1:23624] [ 5] > > /opt/ompi/lib/libmpi.so.0(mca_bml_base_init+0x78) [0xb7f41198] > > [x1:23624] [ 6] > > /opt/openmpi-1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_init+0 > > x7d) [0xb7af866d] [x1:23624] [ 7] > > /opt/ompi/lib/libmpi.so.0(mca_pml_base_select+0x176) [0xb7f49b56] > > [x1:23624] [ 8] /opt/ompi/lib/libmpi.so.0(ompi_mpi_init+0x4cf) > > [0xb7f0fe2f] [x1:23624] [ 9] /opt/ompi/lib/libmpi.so.0(MPI_Init+0xab) > > [0xb7f3204b] [x1:23624] [10] a.out(_ZN3MPI4InitERiRPPc+0x18) [0x8052cbe] > > [x1:23624] [11] a.out(main+0x21) [0x804f4a7] [x1:23624] [12] > > /lib/libc.so.6(__libc_start_main+0xdc) [0xb7be9824] > > > > content of mymachines-1 file > > > > x1 max_slots=4 > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users