Yes, only the first segfault is fixed in the nightly builds. You can run mx_endpoint_info to see how many endpoints are available and if any are in use.

As far as the segfault you are seeing now, I am unsure what is causing it. Hopefully someone who knows more about that area of the code than me can help.

Thanks,

Tim

On Apr 2, 2007, at 6:12 AM, de Almeida, Valmor F. wrote:


Hi Tim,

I installed the openmpi-1.2.1a0r14178 tarball (took this opportunity to use the intel fortran compiler instead gfortran). With a simple test it
seems to work but note the same messages

->mpirun -np 8 -machinefile mymachines a.out
[x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20
Hello, world! I am 4 of 7
Hello, world! I am 0 of 7
Hello, world! I am 1 of 7
Hello, world! I am 5 of 7
Hello, world! I am 2 of 7
Hello, world! I am 7 of 7
Hello, world! I am 6 of 7
Hello, world! I am 3 of 7

and the machinefile is

x1  slots=4 max_slots=4
x2  slots=4 max_slots=4

However with a realistic code, it starts fine (same messages as above)
and somewhere later:

[x1:25947] *** Process received signal ***
[x1:25947] Signal: Segmentation fault (11)
[x1:25947] Signal code: Address not mapped (1)
[x1:25947] Failing at address: 0x14
[x1:25947] [ 0] [0xb7f00440]
[x1:25947] [ 1]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_send_r
equest_start_copy+0x13e) [0xb7a80e6e]
[x1:25947] [ 2]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_send_r
equest_process_pending+0x1e3) [0xb7a82463]
[x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
[0xb7a7ebf8]
[x1:25947] [ 4]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so (mca_btl_sm_componen
t_progress+0x1813) [0xb7a41923]
[x1:25947] [ 5]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so (mca_bml_r2_progress
+0x36) [0xb7a4fdd6]
[x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79)
[0xb7dc41a9]
[x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5)
[0xb7e90145]
[x1:25947] [ 8]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned
_sendrecv_actual+0xc9) [0xb7a167a9]
[x1:25947] [ 9]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned
_barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4]
[x1:25947] [10]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned
_barrier_intra_dec_fixed+0x48) [0xb7a16a18]
[x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69)
[0xb7ea4059]
[x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4]
[x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92)
[0x808bb78]
[x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc)
[0x8086f96]
[x1:25947] [15] driver0(main+0x181) [0x8068c7f]
[x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824]
[x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991]
[x1:25947] *** End of error message ***
mpirun noticed that job rank 0 with PID 25945 on node x1 exited on
signal 15 (Terminated).
7 additional processes aborted (not shown)


This code does run to completion using ompi-1.2 if I use only 2 slots
per machine.

Thanks for any help.

--
Valmor

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
Behalf Of Tim Prins
Sent: Friday, March 30, 2007 10:49 PM
To: Open MPI Users
Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
withstatus=20

Hi Valmor,

What is happening here is that when Open MPI tries to create MX
endpoint
for
communication, mx returns code 20, which is MX_BUSY.

At this point we should gracefully move on, but there is a bug in Open
MPI
1.2
which causes a segmentation fault in case of this type of error. This
will
be
fixed in 1.2.1, and the fix is available now in the 1.2 nightly
tarballs.

Hope this helps,

Tim

On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:
Hello,

I am getting this error any time the number of processes requested
per
machine is greater than the number of cpus. I suspect it is
something on
the configuration of mx / ompi that I am missing since another
machine I
have without mx installed runs ompi correctly with oversubscription.

Thanks for any help.

--
Valmor


->mpirun -np 3 --machinefile mymachines-1 a.out
[x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:23624] *** Process received signal *** [x1:23624] Signal:
Segmentation fault (11) [x1:23624] Signal code: Address not mapped
(1)
[x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
[x1:23624] [ 1]
/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
[0xb7aca825] [x1:23624] [ 2]

/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init +0x6
f8) [0xb7acc658] [x1:23624] [ 3]
/opt/ompi/lib/libmpi.so.0(mca_btl_base_select+0x1a0) [0xb7f41900]
[x1:23624] [ 4]

/opt/openmpi-1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init +0x2
6) [0xb7ad1006] [x1:23624] [ 5]
/opt/ompi/lib/libmpi.so.0(mca_bml_base_init+0x78) [0xb7f41198]
[x1:23624] [ 6]

/opt/openmpi-1.2/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_component_init+0
x7d) [0xb7af866d] [x1:23624] [ 7]
/opt/ompi/lib/libmpi.so.0(mca_pml_base_select+0x176) [0xb7f49b56]
[x1:23624] [ 8] /opt/ompi/lib/libmpi.so.0(ompi_mpi_init+0x4cf)
[0xb7f0fe2f] [x1:23624] [ 9]
/opt/ompi/lib/libmpi.so.0(MPI_Init+0xab)
[0xb7f3204b] [x1:23624] [10] a.out(_ZN3MPI4InitERiRPPc+0x18)
[0x8052cbe]
[x1:23624] [11] a.out(main+0x21) [0x804f4a7] [x1:23624] [12]
/lib/libc.so.6(__libc_start_main+0xdc) [0xb7be9824]

content of mymachines-1 file

x1  max_slots=4



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to