Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20

2007-04-02 Thread Tim Prins
Yes, only the first segfault is fixed in the nightly builds. You can  
run mx_endpoint_info to see how many endpoints are available and if  
any are in use.


As far as the segfault you are seeing now, I am unsure what is  
causing it. Hopefully someone who knows more about that area of the  
code than me can help.


Thanks,

Tim

On Apr 2, 2007, at 6:12 AM, de Almeida, Valmor F. wrote:



Hi Tim,

I installed the openmpi-1.2.1a0r14178 tarball (took this  
opportunity to
use the intel fortran compiler instead gfortran). With a simple  
test it

seems to work but note the same messages

->mpirun -np 8 -machinefile mymachines a.out
[x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20
Hello, world! I am 4 of 7
Hello, world! I am 0 of 7
Hello, world! I am 1 of 7
Hello, world! I am 5 of 7
Hello, world! I am 2 of 7
Hello, world! I am 7 of 7
Hello, world! I am 6 of 7
Hello, world! I am 3 of 7

and the machinefile is

x1  slots=4 max_slots=4
x2  slots=4 max_slots=4

However with a realistic code, it starts fine (same messages as above)
and somewhere later:

[x1:25947] *** Process received signal ***
[x1:25947] Signal: Segmentation fault (11)
[x1:25947] Signal code: Address not mapped (1)
[x1:25947] Failing at address: 0x14
[x1:25947] [ 0] [0xb7f00440]
[x1:25947] [ 1]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_r

equest_start_copy+0x13e) [0xb7a80e6e]
[x1:25947] [ 2]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_r

equest_process_pending+0x1e3) [0xb7a82463]
[x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
[0xb7a7ebf8]
[x1:25947] [ 4]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so 
(mca_btl_sm_componen

t_progress+0x1813) [0xb7a41923]
[x1:25947] [ 5]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2_progress

+0x36) [0xb7a4fdd6]
[x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79)
[0xb7dc41a9]
[x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5)
[0xb7e90145]
[x1:25947] [ 8]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_sendrecv_actual+0xc9) [0xb7a167a9]
[x1:25947] [ 9]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4]
[x1:25947] [10]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_barrier_intra_dec_fixed+0x48) [0xb7a16a18]
[x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69)
[0xb7ea4059]
[x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4]
[x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92)
[0x808bb78]
[x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc)
[0x8086f96]
[x1:25947] [15] driver0(main+0x181) [0x8068c7f]
[x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824]
[x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991]
[x1:25947] *** End of error message ***
mpirun noticed that job rank 0 with PID 25945 on node x1 exited on
signal 15 (Terminated).
7 additional processes aborted (not shown)


This code does run to completion using ompi-1.2 if I use only 2 slots
per machine.

Thanks for any help.

--
Valmor


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]

On

Behalf Of Tim Prins
Sent: Friday, March 30, 2007 10:49 PM
To: Open MPI Users
Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
withstatus=20

Hi Valmor,

What is happening here is that when Open MPI tries to create MX

endpoint

for
communication, mx returns code 20, which is MX_BUSY.

At this point we should gracefully move on, but there is a bug in  
Open

MPI

1.2
which causes a segmentation fault in case of this type of error. This

will

be
fixed in 1.2.1, and the fix is available now in the 1.2 nightly

tarballs.


Hope this helps,

Tim

On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:

Hello,

I am getting this error any time the number of processes requested

per

machine is greater than the number of cpus. I suspect it is

something on

the configuration of mx / ompi that I am missing since another

machine I

have without mx installed runs ompi correctly with oversubscription.

Thanks for any help.

--
Valmor


->mpirun -np 3 --machinefile mymachines-1 a.out
[x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:23624] *** Process received signal *** [x1:23624] Signal:
Segmentation fault (11) [x1:23624] Signal code: Address not mapped

(1)

[x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
[x1:23624] [ 1]
/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
[0xb7aca825] [x1:23624] [ 2]

Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20

2007-04-02 Thread de Almeida, Valmor F.

Hi Tim,

I installed the openmpi-1.2.1a0r14178 tarball (took this opportunity to
use the intel fortran compiler instead gfortran). With a simple test it
seems to work but note the same messages

->mpirun -np 8 -machinefile mymachines a.out 
[x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20
Hello, world! I am 4 of 7
Hello, world! I am 0 of 7
Hello, world! I am 1 of 7
Hello, world! I am 5 of 7
Hello, world! I am 2 of 7
Hello, world! I am 7 of 7
Hello, world! I am 6 of 7
Hello, world! I am 3 of 7

and the machinefile is

x1  slots=4 max_slots=4
x2  slots=4 max_slots=4

However with a realistic code, it starts fine (same messages as above)
and somewhere later:

[x1:25947] *** Process received signal ***
[x1:25947] Signal: Segmentation fault (11)
[x1:25947] Signal code: Address not mapped (1)
[x1:25947] Failing at address: 0x14
[x1:25947] [ 0] [0xb7f00440]
[x1:25947] [ 1]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r
equest_start_copy+0x13e) [0xb7a80e6e]
[x1:25947] [ 2]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r
equest_process_pending+0x1e3) [0xb7a82463]
[x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
[0xb7a7ebf8]
[x1:25947] [ 4]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so(mca_btl_sm_componen
t_progress+0x1813) [0xb7a41923]
[x1:25947] [ 5]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress
+0x36) [0xb7a4fdd6]
[x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79)
[0xb7dc41a9]
[x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5)
[0xb7e90145]
[x1:25947] [ 8]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned
_sendrecv_actual+0xc9) [0xb7a167a9]
[x1:25947] [ 9]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned
_barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4]
[x1:25947] [10]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned
_barrier_intra_dec_fixed+0x48) [0xb7a16a18]
[x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69)
[0xb7ea4059]
[x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4]
[x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92)
[0x808bb78]
[x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc)
[0x8086f96]
[x1:25947] [15] driver0(main+0x181) [0x8068c7f]
[x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824]
[x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991]
[x1:25947] *** End of error message ***
mpirun noticed that job rank 0 with PID 25945 on node x1 exited on
signal 15 (Terminated). 
7 additional processes aborted (not shown)


This code does run to completion using ompi-1.2 if I use only 2 slots
per machine.

Thanks for any help.

--
Valmor

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Tim Prins
> Sent: Friday, March 30, 2007 10:49 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
> withstatus=20
> 
> Hi Valmor,
> 
> What is happening here is that when Open MPI tries to create MX
endpoint
> for
> communication, mx returns code 20, which is MX_BUSY.
> 
> At this point we should gracefully move on, but there is a bug in Open
MPI
> 1.2
> which causes a segmentation fault in case of this type of error. This
will
> be
> fixed in 1.2.1, and the fix is available now in the 1.2 nightly
tarballs.
> 
> Hope this helps,
> 
> Tim
> 
> On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:
> > Hello,
> >
> > I am getting this error any time the number of processes requested
per
> > machine is greater than the number of cpus. I suspect it is
something on
> > the configuration of mx / ompi that I am missing since another
machine I
> > have without mx installed runs ompi correctly with oversubscription.
> >
> > Thanks for any help.
> >
> > --
> > Valmor
> >
> >
> > ->mpirun -np 3 --machinefile mymachines-1 a.out
> > [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> > [x1:23624] *** Process received signal *** [x1:23624] Signal:
> > Segmentation fault (11) [x1:23624] Signal code: Address not mapped
(1)
> > [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
> > [x1:23624] [ 1]
> > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
> > [0xb7aca825] [x1:23624] [ 2]
> >
/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init+0x6
> > f8) [0xb7acc658

Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20

2007-04-01 Thread de Almeida, Valmor F.

Hello Tim,

Thanks for the info. I also received this help from Myrinet:



It looks like you are running out of endpoints.

This discusses what endpoints are:
 http://www.myri.com/cgi-bin/fom.pl?file=421 

And this explains how to increase the limit:
 http://www.myri.com/cgi-bin/fom.pl?file=482

Let us know if this doesn't address the problem.


I haven't had time to look into it.

--
Valmor

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Tim Prins
> Sent: Friday, March 30, 2007 10:49 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
> withstatus=20
> 
> Hi Valmor,
> 
> What is happening here is that when Open MPI tries to create MX
endpoint
> for
> communication, mx returns code 20, which is MX_BUSY.
> 
> At this point we should gracefully move on, but there is a bug in Open
MPI
> 1.2
> which causes a segmentation fault in case of this type of error. This
will
> be
> fixed in 1.2.1, and the fix is available now in the 1.2 nightly
tarballs.
> 
> Hope this helps,
> 
> Tim
> 
> On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:
> > Hello,
> >
> > I am getting this error any time the number of processes requested
per
> > machine is greater than the number of cpus. I suspect it is
something on
> > the configuration of mx / ompi that I am missing since another
machine I
> > have without mx installed runs ompi correctly with oversubscription.
> >
> > Thanks for any help.
> >
> > --
> > Valmor
> >
> >
> > ->mpirun -np 3 --machinefile mymachines-1 a.out
> > [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
> > [x1:23624] *** Process received signal *** [x1:23624] Signal:
> > Segmentation fault (11) [x1:23624] Signal code: Address not mapped
(1)
> > [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
> > [x1:23624] [ 1]
> > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
> > [0xb7aca825] [x1:23624] [ 2]
> >
/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init+0x6
> > f8) [0xb7acc658] [x1:23624] [ 3]
> > /opt/ompi/lib/libmpi.so.0(mca_btl_base_select+0x1a0) [0xb7f41900]
> > [x1:23624] [ 4]
> >
/opt/openmpi-1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x2
> > 6) [0xb7ad1006] [x1:23624] [ 5]
> > /opt/ompi/lib/libmpi.so.0(mca_bml_base_init+0x78) [0xb7f41198]
> > [x1:23624] [ 6]
> >
/opt/openmpi-1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_init+0
> > x7d) [0xb7af866d] [x1:23624] [ 7]
> > /opt/ompi/lib/libmpi.so.0(mca_pml_base_select+0x176) [0xb7f49b56]
> > [x1:23624] [ 8] /opt/ompi/lib/libmpi.so.0(ompi_mpi_init+0x4cf)
> > [0xb7f0fe2f] [x1:23624] [ 9]
/opt/ompi/lib/libmpi.so.0(MPI_Init+0xab)
> > [0xb7f3204b] [x1:23624] [10] a.out(_ZN3MPI4InitERiRPPc+0x18)
[0x8052cbe]
> > [x1:23624] [11] a.out(main+0x21) [0x804f4a7] [x1:23624] [12]
> > /lib/libc.so.6(__libc_start_main+0xdc) [0xb7be9824]
> >
> > content of mymachines-1 file
> >
> > x1  max_slots=4
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users