Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

2016-03-17 Thread Ralph Castain
Yeah, it looks like something is wrong with the mmap backend for some reason. 
It gets used by both vader and sm, so no help there.

I’m afraid I’ll have to defer to Nathan from here as he is more familiar with 
it than I.


> On Mar 17, 2016, at 4:55 PM, Lane, William  wrote:
> 
> I ran OpenMPI using the "-mca btl ^vader" switch Ralph recommended and I'm 
> still getting the same errors
> 
> qsub -q short.q -V -pe make 206 -b y mpirun -np 206 --prefix 
> /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^vader --mca 
> plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out
> 
> [csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got update_proc_state for 
> job [42154,1]
> [csclprd3-6-12:30667] *** Process received signal ***
> [csclprd3-6-12:30667] Signal: Bus error (7)
> [csclprd3-6-12:30667] Signal code: Non-existant physical address (2)
> [csclprd3-6-12:30667] Failing at address: 0x2b1b18a2d000
> [csclprd3-6-12:30667] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b1b0e06c500]
> [csclprd3-6-12:30667] [ 1] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b1b0f5fd524]
> [csclprd3-6-12:30667] [ 2] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmca_common_sm.so.4(mca_common_sm_module_create_and_attach+0x56)[0x2b1b124daab6]
> [csclprd3-6-12:30667] [ 3] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x39cb)[0x2b1b12d749cb]
> [csclprd3-6-12:30667] [ 4] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x3f2a)[0x2b1b12d74f2a]
> [csclprd3-6-12:30667] [ 5] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b1b0ddfdb07]
> [csclprd3-6-12:30667] [ 6] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b1b126de7b2]
> [csclprd3-6-12:30667] [ 7] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b1b0ddfd309]
> [csclprd3-6-12:30667] [ 8] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b1b133a138c]
> [csclprd3-6-12:30667] [ 9] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b1b0de0e780]
> [csclprd3-6-12:30667] [10] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b1b0ddc017d]
> [csclprd3-6-12:30667] [11] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b1b0dddf820]
> [csclprd3-6-12:30667] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
> [csclprd3-6-12:30667] [13] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b1b0e298cdd]
> [csclprd3-6-12:30667] [14] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
> [csclprd3-6-12:30667] *** End of error message ***
> 
> -Bill L.
> 
> From: users [users-boun...@open-mpi.org ] 
> on behalf of Lane, William [william.l...@cshs.org 
> ]
> Sent: Thursday, March 17, 2016 4:49 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 
> 4096 still required?
> 
> I apologize Ralph, I forgot to include my command line for invoking OpenMPI 
> on SoGE:
> 
> qsub -q short.q -V -pe make 87 -b y mpirun -np 87 --prefix 
> /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca 
> plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out
> 
> a_1_10_1.out is my OpenMPI test code binary compiled under OpenMPI 1.10.1.
> 
> Thanks for the quick response!
> 
> -Bill L.
> 
> From: users [users-boun...@open-mpi.org ] 
> on behalf of Ralph Castain [r...@open-mpi.org ]
> Sent: Thursday, March 17, 2016 4:44 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 
> 4096 still required?
> 
> No, that shouldn’t be the issue any more - and that isn’t what the backtrace 
> indicates. It looks instead like there was a problem with the shared memory 
> backing file on a remote node, and that caused the vader shared memory BTL to 
> segfault.
> 
> Try turning vader off and see if that helps - I’m not sure what you are 
> using, but maybe “-mca btl ^vader” will suffice
> 
> Nathan - any other suggestions?
> 
> 
>> On Mar 17, 2016, at 4:40 PM, Lane, William > > wrote:
>> 
>> I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
>> files limits be >= 4096 in order to function when large numbers of slots
>> were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
>> still present for OpenMPI versions 1.10.1 and greater?
>> 
>> I'm having some issues now with OpenMPI version 1.10.1 that remind me
>> of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
>> request too many slots.
>> 
>> When I look at the ulimits -a (soft limit) I see:
>> open files  (-n) 1024
>> 
>> Ulimits -Ha (hard limit) gives:
>> open files  (-n) 4096
>> 
>> I'm getting errors of the form:
>> [csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_st

Re: [OMPI users] Dynamically throttle/scale processes

2016-03-17 Thread Gilles Gouaillardet

Brian,

unlike Ralph, i will assume all your processes are MPI tasks.

at first glance, the MPI philosophy is the other way around :
start with mpirun -np 1 traffic_cop, and then MPI_Comm_spawn("child") 
when you need more workers.


that being said, if you are fine with having idle children (e.g. 
children that consume no CPU resources, but do keep memory, network and 
other system resources allocated), then you can start 256 mpi tasks, 
either with

mpirun -np 256 cop_children
or
mpirun -np 1 traffic_cop : -np 255 children
/*i am not 100% sure about the syntax here ... */

there is no MPI way to signal a task, but you can have your children 
wait for a message from the master.
unless you are using a TCP interconnect, i do no think OpenMPI is 
production ready for MPI_THREAD_MULTIPLE,
so one option is to have your children MPI_Recv() information from the 
traffic cop in the main process, and do the real job is an other pthread 
(so the main process can kill the working thread when MPI_Recv returns)
an other option is to MPI_Irecv(), do the job and periodically 
MPI_Test() to check if there is any order from the traffic cop.


Cheers,

Gilles

On 3/18/2016 8:38 AM, Ralph Castain wrote:
Hmmm….I haven’t heard of that specific use-case, but I have seen some 
similar things. Did you want the processes to be paused, or killed, 
when you scale down? Obviously, I’m assuming they are not MPI procs, yes?


I can certainly see a way to make mpirun do it without too much fuss, 
though it would require a message as opposed to a signal so you can 
indicate how many procs to “idle/kill”.



On Mar 17, 2016, at 3:22 PM, Andrus, Brian Contractor 
mailto:bdand...@nps.edu>> wrote:


All,
I have an mpi-based program that has a master process that acts as a 
‘traffic cop’ in that it hands out work to child processes.
I want to be able to dynamically throttle how many child processes 
are in use at any given time.
For instance, if I start it with “mpirun -n 512” I could send a 
signal to tell it to only use 256 of them for a bit and then tell it 
to scale back up. The upper limit being the number of processes 
mpirun was launched with.

Has anyone done anything like this? Maybe a better way to do it?
Basically my program is crunching through a large text file, 
examining each line for various things.

Thanks in advance for any insight,
Brian Andrus
___
users mailing list
us...@open-mpi.org 
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/03/28744.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28745.php




Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

2016-03-17 Thread Lane, William
I ran OpenMPI using the "-mca btl ^vader" switch Ralph recommended and I'm 
still getting the same errors

qsub -q short.q -V -pe make 206 -b y mpirun -np 206 --prefix 
/hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^vader --mca 
plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out

[csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got update_proc_state for 
job [42154,1]
[csclprd3-6-12:30667] *** Process received signal ***
[csclprd3-6-12:30667] Signal: Bus error (7)
[csclprd3-6-12:30667] Signal code: Non-existant physical address (2)
[csclprd3-6-12:30667] Failing at address: 0x2b1b18a2d000
[csclprd3-6-12:30667] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b1b0e06c500]
[csclprd3-6-12:30667] [ 1] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b1b0f5fd524]
[csclprd3-6-12:30667] [ 2] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmca_common_sm.so.4(mca_common_sm_module_create_and_attach+0x56)[0x2b1b124daab6]
[csclprd3-6-12:30667] [ 3] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x39cb)[0x2b1b12d749cb]
[csclprd3-6-12:30667] [ 4] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x3f2a)[0x2b1b12d74f2a]
[csclprd3-6-12:30667] [ 5] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b1b0ddfdb07]
[csclprd3-6-12:30667] [ 6] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b1b126de7b2]
[csclprd3-6-12:30667] [ 7] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b1b0ddfd309]
[csclprd3-6-12:30667] [ 8] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b1b133a138c]
[csclprd3-6-12:30667] [ 9] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b1b0de0e780]
[csclprd3-6-12:30667] [10] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b1b0ddc017d]
[csclprd3-6-12:30667] [11] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b1b0dddf820]
[csclprd3-6-12:30667] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
[csclprd3-6-12:30667] [13] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b1b0e298cdd]
[csclprd3-6-12:30667] [14] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
[csclprd3-6-12:30667] *** End of error message ***

-Bill L.


From: users [users-boun...@open-mpi.org] on behalf of Lane, William 
[william.l...@cshs.org]
Sent: Thursday, March 17, 2016 4:49 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 
4096 still required?

I apologize Ralph, I forgot to include my command line for invoking OpenMPI on 
SoGE:

qsub -q short.q -V -pe make 87 -b y mpirun -np 87 --prefix 
/hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca 
plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out

a_1_10_1.out is my OpenMPI test code binary compiled under OpenMPI 1.10.1.

Thanks for the quick response!

-Bill L.


From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Thursday, March 17, 2016 4:44 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 
4096 still required?

No, that shouldn’t be the issue any more - and that isn’t what the backtrace 
indicates. It looks instead like there was a problem with the shared memory 
backing file on a remote node, and that caused the vader shared memory BTL to 
segfault.

Try turning vader off and see if that helps - I’m not sure what you are using, 
but maybe “-mca btl ^vader” will suffice

Nathan - any other suggestions?


On Mar 17, 2016, at 4:40 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
files limits be >= 4096 in order to function when large numbers of slots
were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
still present for OpenMPI versions 1.10.1 and greater?

I'm having some issues now with OpenMPI version 1.10.1 that remind me
of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
request too many slots.

When I look at the ulimits -a (soft limit) I see:
open files  (-n) 1024

Ulimits -Ha (hard limit) gives:
open files  (-n) 4096

I'm getting errors of the form:
[csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for 
job [40732,1]
[csclprd3-6-12:30567] *** Process received signal ***
[csclprd3-6-12:30567] Signal: Bus error (7)
[csclprd3-6-12:30567] Signal code: Non-existant physical address (2)
[csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000
[csclprd3-6-12:30568] *** Process received signal ***
[csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500]
[csclprd3-6-12:30567] [ 1] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524]
[csclprd3-6-12:30567] [ 2] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.

Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

2016-03-17 Thread Lane, William
I apologize Ralph, I forgot to include my command line for invoking OpenMPI on 
SoGE:

qsub -q short.q -V -pe make 87 -b y mpirun -np 87 --prefix 
/hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca 
plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out

a_1_10_1.out is my OpenMPI test code binary compiled under OpenMPI 1.10.1.

Thanks for the quick response!

-Bill L.


From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Thursday, March 17, 2016 4:44 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 
4096 still required?

No, that shouldn’t be the issue any more - and that isn’t what the backtrace 
indicates. It looks instead like there was a problem with the shared memory 
backing file on a remote node, and that caused the vader shared memory BTL to 
segfault.

Try turning vader off and see if that helps - I’m not sure what you are using, 
but maybe “-mca btl ^vader” will suffice

Nathan - any other suggestions?


On Mar 17, 2016, at 4:40 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
files limits be >= 4096 in order to function when large numbers of slots
were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
still present for OpenMPI versions 1.10.1 and greater?

I'm having some issues now with OpenMPI version 1.10.1 that remind me
of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
request too many slots.

When I look at the ulimits -a (soft limit) I see:
open files  (-n) 1024

Ulimits -Ha (hard limit) gives:
open files  (-n) 4096

I'm getting errors of the form:
[csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for 
job [40732,1]
[csclprd3-6-12:30567] *** Process received signal ***
[csclprd3-6-12:30567] Signal: Bus error (7)
[csclprd3-6-12:30567] Signal code: Non-existant physical address (2)
[csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000
[csclprd3-6-12:30568] *** Process received signal ***
[csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500]
[csclprd3-6-12:30567] [ 1] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524]
[csclprd3-6-12:30567] [ 2] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674]
[csclprd3-6-12:30567] [ 3] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07]
[csclprd3-6-12:30567] [ 4] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2]
[csclprd3-6-12:30567] [ 5] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309]
[csclprd3-6-12:30567] [ 6] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c]
[csclprd3-6-12:30567] [ 7] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780]
[csclprd3-6-12:30567] [ 8] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d]
[csclprd3-6-12:30567] [ 9] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820]
[csclprd3-6-12:30567] [10] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
[csclprd3-6-12:30567] [11] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd]
[csclprd3-6-12:30567] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
[csclprd3-6-12:30567] *** End of error message ***

Ugh.

Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation. 
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28746.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.


Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

2016-03-17 Thread Ralph Castain
No, that shouldn’t be the issue any more - and that isn’t what the backtrace 
indicates. It looks instead like there was a problem with the shared memory 
backing file on a remote node, and that caused the vader shared memory BTL to 
segfault.

Try turning vader off and see if that helps - I’m not sure what you are using, 
but maybe “-mca btl ^vader” will suffice

Nathan - any other suggestions?


> On Mar 17, 2016, at 4:40 PM, Lane, William  wrote:
> 
> I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
> files limits be >= 4096 in order to function when large numbers of slots
> were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
> still present for OpenMPI versions 1.10.1 and greater?
> 
> I'm having some issues now with OpenMPI version 1.10.1 that remind me
> of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
> request too many slots.
> 
> When I look at the ulimits -a (soft limit) I see:
> open files  (-n) 1024
> 
> Ulimits -Ha (hard limit) gives:
> open files  (-n) 4096
> 
> I'm getting errors of the form:
> [csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for 
> job [40732,1]
> [csclprd3-6-12:30567] *** Process received signal ***
> [csclprd3-6-12:30567] Signal: Bus error (7)
> [csclprd3-6-12:30567] Signal code: Non-existant physical address (2)
> [csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000
> [csclprd3-6-12:30568] *** Process received signal ***
> [csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500]
> [csclprd3-6-12:30567] [ 1] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524]
> [csclprd3-6-12:30567] [ 2] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674]
> [csclprd3-6-12:30567] [ 3] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07]
> [csclprd3-6-12:30567] [ 4] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2]
> [csclprd3-6-12:30567] [ 5] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309]
> [csclprd3-6-12:30567] [ 6] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c]
> [csclprd3-6-12:30567] [ 7] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780]
> [csclprd3-6-12:30567] [ 8] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d]
> [csclprd3-6-12:30567] [ 9] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820]
> [csclprd3-6-12:30567] [10] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
> [csclprd3-6-12:30567] [11] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd]
> [csclprd3-6-12:30567] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
> [csclprd3-6-12:30567] *** End of error message ***
> 
> Ugh.
> 
> Bill L.
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation. ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28746.php 
> 


[OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

2016-03-17 Thread Lane, William
I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
files limits be >= 4096 in order to function when large numbers of slots
were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
still present for OpenMPI versions 1.10.1 and greater?

I'm having some issues now with OpenMPI version 1.10.1 that remind me
of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
request too many slots.

When I look at the ulimits -a (soft limit) I see:
open files  (-n) 1024

Ulimits -Ha (hard limit) gives:
open files  (-n) 4096

I'm getting errors of the form:
[csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for 
job [40732,1]
[csclprd3-6-12:30567] *** Process received signal ***
[csclprd3-6-12:30567] Signal: Bus error (7)
[csclprd3-6-12:30567] Signal code: Non-existant physical address (2)
[csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000
[csclprd3-6-12:30568] *** Process received signal ***
[csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500]
[csclprd3-6-12:30567] [ 1] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524]
[csclprd3-6-12:30567] [ 2] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674]
[csclprd3-6-12:30567] [ 3] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07]
[csclprd3-6-12:30567] [ 4] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2]
[csclprd3-6-12:30567] [ 5] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309]
[csclprd3-6-12:30567] [ 6] 
/hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c]
[csclprd3-6-12:30567] [ 7] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780]
[csclprd3-6-12:30567] [ 8] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d]
[csclprd3-6-12:30567] [ 9] 
/hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820]
[csclprd3-6-12:30567] [10] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
[csclprd3-6-12:30567] [11] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd]
[csclprd3-6-12:30567] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
[csclprd3-6-12:30567] *** End of error message ***

Ugh.

Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.


Re: [OMPI users] Dynamically throttle/scale processes

2016-03-17 Thread Ralph Castain
Hmmm….I haven’t heard of that specific use-case, but I have seen some similar 
things. Did you want the processes to be paused, or killed, when you scale 
down? Obviously, I’m assuming they are not MPI procs, yes?

I can certainly see a way to make mpirun do it without too much fuss, though it 
would require a message as opposed to a signal so you can indicate how many 
procs to “idle/kill”.


> On Mar 17, 2016, at 3:22 PM, Andrus, Brian Contractor  
> wrote:
> 
> All,
>  
> I have an mpi-based program that has a master process that acts as a ‘traffic 
> cop’ in that it hands out work to child processes.
>  
> I want to be able to dynamically throttle how many child processes are in use 
> at any given time.
>  
> For instance, if I start it with “mpirun -n 512” I could send a signal to 
> tell it to only use 256 of them for a bit and then tell it to scale back up. 
> The upper limit being the number of processes mpirun was launched with.
>  
>  
> Has anyone done anything like this? Maybe a better way to do it?
> Basically my program is crunching through a large text file, examining each 
> line for various things.
>  
> Thanks in advance for any insight,
>  
> Brian Andrus
>  
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28744.php 
> 


[OMPI users] Dynamically throttle/scale processes

2016-03-17 Thread Andrus, Brian Contractor
All,

I have an mpi-based program that has a master process that acts as a 'traffic 
cop' in that it hands out work to child processes.

I want to be able to dynamically throttle how many child processes are in use 
at any given time.

For instance, if I start it with "mpirun -n 512" I could send a signal to tell 
it to only use 256 of them for a bit and then tell it to scale back up. The 
upper limit being the number of processes mpirun was launched with.


Has anyone done anything like this? Maybe a better way to do it?
Basically my program is crunching through a large text file, examining each 
line for various things.

Thanks in advance for any insight,

Brian Andrus



Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:15 PM, Cabral, Matias A
 wrote:
> I was looking for lines like" [nodexyz:17085] selected cm best priority 40" 
> and  " [nodexyz:17099] select: component psm selected"

this may have turned up more then i expected.  i recompiled openmpi
v1.8.4 as a test and reran the tests.  which seemed to run just fine.
looking at the debug output, i can clearly see a difference in the psm
calls.  i performed the same test using 1.10.2 and it works as well.

i've sent a msg off to the user to have him rerun and see where we're at.

i suspect my system level compile of openmpi might be all screwed up
with concern for psm.  i didn't see anything off in the configure
output, but i must have missed something.  i'll report back


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-17 Thread Xavier Besseron
On Thu, Mar 17, 2016 at 3:17 PM, Ralph Castain  wrote:
> Just to clarify: I am not aware of any MPI that will allow you to relocate a
> process while it is running. You have to checkpoint the job, terminate it,
> and then restart the entire thing with the desired process on the new node.
>


Dear all,

For your information, MVAPICH2 supports live migration of MPI
processes, without the need to terminate and restart the whole job.

All the details are in the MVAPICH2 user guide:
  - How to configure MVAPICH2 for migration

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2b-userguide.html#x1-120004.4
  - How to trigger process migration

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2b-userguide.html#x1-760006.14.3

You can also check the paper "High Performance Pipelined Process
Migration with RDMA"
http://mvapich.cse.ohio-state.edu/static/media/publications/abstract/ouyangx-2011-ccgrid.pdf


Best regards,

Xavier



>
> On Mar 16, 2016, at 3:15 AM, Husen R  wrote:
>
> In the case of MPI application (not gromacs), How do I relocate MPI
> application from one node to another node while it is running ?
> I'm sorry, as far as I know the ompi-restart command is used to restart
> application, based on checkpoint file, once the application already
> terminated (no longer running).
>
> Thanks
>
> regards,
>
> Husen
>
> On Wed, Mar 16, 2016 at 4:29 PM, Jeff Hammond 
> wrote:
>>
>> Just checkpoint-restart the app to relocate. The overhead will be lower
>> than trying to do with MPI.
>>
>> Jeff
>>
>>
>> On Wednesday, March 16, 2016, Husen R  wrote:
>>>
>>> Hi Jeff,
>>>
>>> Thanks for the reply.
>>>
>>> After consulting the Gromacs docs, as you suggested, Gromacs already
>>> supports checkpoint/restart. thanks for the suggestion.
>>>
>>> Previously, I asked about checkpoint/restart in Open MPI because I want
>>> to checkpoint MPI Application and restart/migrate it while it is running.
>>> For the example, I run MPI application in node A,B and C in a cluster and
>>> I want to migrate process running in node A to other node, let's say to node
>>> C.
>>> is there a way to do this with open MPI ? thanks.
>>>
>>> Regards,
>>>
>>> Husen
>>>
>>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond 
>>> wrote:

 Why do you need OpenMPI to do this? Molecular dynamics trajectories are
 trivial to checkpoint and restart at the application level. I'm sure 
 Gromacs
 already supports this. Please consult the Gromacs docs or user support for
 details.

 Jeff


 On Tuesday, March 15, 2016, Husen R  wrote:
>
> Dear Open MPI Users,
>
>
> Does the current stable release of Open MPI (v1.10 series) support
> fault tolerant feature ?
> I got the information from Open MPI FAQ that The checkpoint/restart
> support was last released as part of the v1.6 series.
> I just want to make sure about this.
>
> and by the way, does Open MPI able to checkpoint or restart mpi
> application/GROMACS automatically ?
> Please, I really need help.
>
> Regards,
>
>
> Husen



 --
 Jeff Hammond
 jeff.scie...@gmail.com
 http://jeffhammond.github.io/

 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2016/03/28705.php
>>>
>>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com
>> http://jeffhammond.github.io/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28709.php
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28710.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28731.php


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:52 PM, Jeff Squyres (jsquyres)
 wrote:
> Can you send all the information listed here?
>
> https://www.open-mpi.org/community/help/
>
> (including the full output from the run with the PML/BTL/MTL/etc. verbosity)
>
> This will allow Matias to look through all the relevant info, potentially 
> with fewer back-n-forth emails.

Understood, but unfortunately i cannot pull large dumps from the
system, its isolated.


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Jeff Squyres (jsquyres)
Michael --

Can you send all the information listed here?

https://www.open-mpi.org/community/help/

(including the full output from the run with the PML/BTL/MTL/etc. verbosity)

This will allow Matias to look through all the relevant info, potentially with 
fewer back-n-forth emails.

Thanks!


> On Mar 17, 2016, at 12:47 PM, Michael Di Domenico  
> wrote:
> 
> On Thu, Mar 17, 2016 at 12:15 PM, Cabral, Matias A
>  wrote:
>> I was looking for lines like" [nodexyz:17085] selected cm best priority 40" 
>> and  " [nodexyz:17099] select: component psm selected"
> 
> i see cm best priority 20, which seems to relate to ob1 being
> selected.  i don't see a mention of psm anywhere (i am NOT doing --mca
> mtl ^psm), but i did compile openmpi with psm support
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28739.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Thu, Mar 17, 2016 at 12:15 PM, Cabral, Matias A
 wrote:
> I was looking for lines like" [nodexyz:17085] selected cm best priority 40" 
> and  " [nodexyz:17099] select: component psm selected"

i see cm best priority 20, which seems to relate to ob1 being
selected.  i don't see a mention of psm anywhere (i am NOT doing --mca
mtl ^psm), but i did compile openmpi with psm support


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Cabral, Matias A
I was looking for lines like" [nodexyz:17085] selected cm best priority 40" and 
 " [nodexyz:17099] select: component psm selected"

_MAC


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di Domenico
Sent: Thursday, March 17, 2016 5:52 AM
To: Open MPI Users 
Subject: Re: [OMPI users] locked memory and queue pairs

On Wed, Mar 16, 2016 at 4:49 PM, Cabral, Matias A  
wrote:
> I didn't go into the code to see who is actually calling this error message, 
> but I suspect this may be a generic error for "out of memory" kind of thing 
> and not specific to the que pair. To confirm please add  -mca 
> pml_base_verbose 100 and add  -mca mtl_base_verbose 100  to see what is being 
> selected.

this didn't spit out anything overly useful, just lots of lines

[node001:00909] mca: base: components_register: registering pml components 
[node001:00909] mca: base: components_register: found loaded component v 
[node001:00909] mca: base: components_register: component v register function 
successful [node001:00909] mca: base: components_register: found loaded 
component bfo [node001:00909] mca: base: components_register: component bfo 
register function successful [node001:00909] mca: base: components_register: 
found loaded component cm [node001:00909] mca: base: components_register: 
component cm register function successful [node001:00909] mca: base: 
components_register: found loaded component ob1 [node001:00909] mca: base: 
components_register: component ob1 register function successful

> I'm trying to remember some details of IMB  and alltoallv to see if it is 
> indeed requiring more resources that the other micro benchmarks.

i'm using IMB for my tests, but this issue came up because a researcher isn't 
able to run large alltoall codes, so i don't believe it's specific to IMB

> BTW, did you confirm the limits setup? Also do the nodes have all the same 
> amount of mem?

yes, all nodes have the limits set to unlimited and each node has 256GB of 
memory ___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28726.php


Re: [OMPI users] How to link the statically compiled OpenMPI library ?

2016-03-17 Thread Nathan Hjelm

Instead of --static try using -Wl,-Bstatic. I do not think you can
safely mix --static with -Wl,-Bdynamic.

-Nathan
HPC-ENV, LANL

On Thu, Mar 17, 2016 at 03:54:33PM +0100, evelina dumitrescu wrote:
>hello,
> 
>I unsuccessfully tried to link the statically compiled OpenMPI library.
>I used for compilation:
> 
>./configure --enable-static -disable-shared
>make -j 4
>make install
> 
>When I try to link the library to my executable, I get the following
>error:
> 
>gcc mm.c --static -I/usr/local/include/openmpi mm.c -o mm.out
>-L/usr/local/lib -L/usr/local/lib/openmpi -lmpi -lopen-rte -lopen-pal
>-Wl,--whole-archive -libverbs  -Wl,--no-whole-archive -lrt
>-Wl,--export-dynamic -Wl,-Bdynamic -ldl -lc -lnsl -lutil -lm -ldl -fPIE
>-pie
> 
>/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: relocation
>R_X86_64_32 against `__TMC_END__' can not be used when making a shared
>object; recompile with -fPIC
>/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: error adding symbols: Bad
>value
>collect2: error: ld returned 1 exit status
> 
>I use openmpi-1.10.2 and Ubuntu 14.04.
>What am I doing wrog ?
> 
>Evelina

> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28735.php



pgpESURJuqCK9.pgp
Description: PGP signature


Re: [OMPI users] How to link the statically compiled OpenMPI library ?

2016-03-17 Thread Jeff Squyres (jsquyres)
On Mar 17, 2016, at 10:54 AM, evelina dumitrescu 
 wrote:
> 
> hello,
> 
> I unsuccessfully tried to link the statically compiled OpenMPI library.
> I used for compilation:
> 
> ./configure --enable-static -disable-shared
> make -j 4
> make install
> 
> When I try to link the library to my executable, I get the following error:
> 
> gcc mm.c --static -I/usr/local/include/openmpi mm.c -o mm.out 
> -L/usr/local/lib -L/usr/local/lib/openmpi -lmpi -lopen-rte -lopen-pal 
> -Wl,--whole-archive -libverbs  -Wl,--no-whole-archive -lrt 
> -Wl,--export-dynamic -Wl,-Bdynamic -ldl -lc -lnsl -lutil -lm -ldl -fPIE -pie 

Looks like you found the FAQ item about compiling statically -- good!

I think the dl library is only available for dynamic builds, not static builds 
(i.e., this is not an Open MPI thing; it's an OS/library thing).

Have you tried removing the -ldl's from your link line?

Also, is there a reason you list mm.c twice on your compile line?  That seems 
incorrect.

> /usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: relocation 
> R_X86_64_32 against `__TMC_END__' can not be used when making a shared 
> object; recompile with -fPIC
> /usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: error adding symbols: Bad value
> collect2: error: ld returned 1 exit status
> 
> I use openmpi-1.10.2 and Ubuntu 14.04.
> 
> What am I doing wrog ?
> 
> Evelina
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28735.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] How to link the statically compiled OpenMPI library ?

2016-03-17 Thread evelina dumitrescu
hello,

I unsuccessfully tried to link the statically compiled OpenMPI library.
I used for compilation:

./configure --enable-static -disable-shared
make -j 4
make install

When I try to link the library to my executable, I get the following error:

gcc mm.c --static -I/usr/local/include/openmpi mm.c -o mm.out
-L/usr/local/lib -L/usr/local/lib/openmpi -lmpi -lopen-rte -lopen-pal
-Wl,--whole-archive -libverbs  -Wl,--no-whole-archive -lrt
-Wl,--export-dynamic -Wl,-Bdynamic -ldl -lc -lnsl -lutil -lm -ldl -fPIE
-pie

/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: relocation
R_X86_64_32 against `__TMC_END__' can not be used when making a shared
object; recompile with -fPIC
/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbeginT.o: error adding symbols: Bad
value
collect2: error: ld returned 1 exit status

I use openmpi-1.10.2 and Ubuntu 14.04.

What am I doing wrog ?

Evelina


Re: [OMPI users] Issue about cm PML

2016-03-17 Thread dpchoudh .
Thank you everybody. With your help I was able to resolve the issue. For
the sake of completeness, this is what I had to do:

infinipath-psm was already installed in my system when OpenMPI was built
from source. However, infinipath-psm-devel was NOT installed. I suppose
that's why openMPI could not add support for PSM when built from source,
and, following Jeff's advice, I ran

ompi_info | grep psm

which showed no output.

I had to install infinipath-psm-devel and rebuild OpenMPI. That fixed it.

Durga

Life is complex. It has real and imaginary parts.

On Thu, Mar 17, 2016 at 9:17 AM, Jeff Squyres (jsquyres)  wrote:

> Additionally, if you run
>
>   ompi_info | grep psm
>
> Do you see the PSM MTL listed?
>
> To force the CM MTL, you can run:
>
>   mpirun --mca pml cm ...
>
> That won't let any BTLs be selected (because only ob1 uses the BTLs).
>
>
> > On Mar 17, 2016, at 8:07 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > can you try to add
> > --mca mtl psm
> > to your mpirun command line ?
> >
> > you might also have to blacklist the opening btl
> >
> > Cheers,
> >
> > Gilles
> >
> > On Thursday, March 17, 2016, dpchoudh .  wrote:
> > Hello all
> > I have a simple test setup, consisting of two Dell workstation nodes
> with similar hardware profile.
> >
> > Both the nodes have (identical)
> > 1. Qlogic 4x DDR infiniband
> > 2. Chelsio C310 iWARP ethernet.
> >
> > Both of these cards are connected back to back, without a switch.
> >
> > With this setup, I can run OpenMPI over TCP and openib BTL. However, if
> I try to use the PSM MTL (excluding the Chelsio NIC, of course, since it
> does not support PSM), I get an error from one of the nodes (details
> below), which makes me think that a required library or package is not
> installed, but I can't figure out what it might be.
> >
> > Note that the test program is a simple 'hello world' program.
> >
> > The following work:
> >   mpirun -np 2 --hostfile ~/hostfile -mca btl tcp,self ./mpitest
> > mpirun -np 2 --hostfile ~/hostfile -mca btl self,openib -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
> >
> > (I had to exclude the Chelsio card because of this issue:
> > https://www.open-mpi.org/community/lists/users/2016/03/28661.php  )
> >
> > Here is what does NOT work:
> > mpirun -np 2 --hostfile ~/hostfile -mca mtl psm -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
> >
> > The error (from both nodes) is:
> >  mca: base: components_open: component pml / cm open function failed
> >
> > However, I still see the "Hello, world" output indicating that the
> program ran to completion.
> >
> > Here is also another command that does NOT work:
> >
> > mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
> >
> > The error is: (from the root node)
> > PML cm cannot be selected
> >
> > However, this time, I see no output from the program, indicating it did
> not run.
> >
> > The following command also fails in a similar way:
> >  mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca mtl psm -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
> >
> > I have verified that infinipath-psm is installed on both nodes. Both
> nodes run identical CentOS 7 and the libraries were installed from the
> CentOS repositories (i.e. were not compiled from source)
> >
> > Both nodes run OMPI 1.10.2, compiled from the source RPM.
> >
> > What am I doing wrong?
> >
> > Thanks
> > Durga
> >
> >
> >
> >
> > Life is complex. It has real and imaginary parts.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28725.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28727.php
>


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-17 Thread Bland, Wesley
Presumably Adaptive MPI would allow you to do that. I don’t know all the 
details of how that works there though.

From: users  on behalf of Ralph Castain 

Reply-To: Open MPI Users 
Date: Thursday, March 17, 2016 at 9:17 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Fault tolerant feature in Open MPI

Just to clarify: I am not aware of any MPI that will allow you to relocate a 
process while it is running. You have to checkpoint the job, terminate it, and 
then restart the entire thing with the desired process on the new node.


On Mar 16, 2016, at 3:15 AM, Husen R 
mailto:hus...@gmail.com>> wrote:

In the case of MPI application (not gromacs), How do I relocate MPI application 
from one node to another node while it is running ?
I'm sorry, as far as I know the ompi-restart command is used to restart 
application, based on checkpoint file, once the application already terminated 
(no longer running).
Thanks
regards,
Husen

On Wed, Mar 16, 2016 at 4:29 PM, Jeff Hammond 
mailto:jeff.scie...@gmail.com>> wrote:
Just checkpoint-restart the app to relocate. The overhead will be lower than 
trying to do with MPI.

Jeff


On Wednesday, March 16, 2016, Husen R 
mailto:hus...@gmail.com>> wrote:
Hi Jeff,
Thanks for the reply.
After consulting the Gromacs docs, as you suggested, Gromacs already supports 
checkpoint/restart. thanks for the suggestion.

Previously, I asked about checkpoint/restart in Open MPI because I want to 
checkpoint MPI Application and restart/migrate it while it is running.
For the example, I run MPI application in node A,B and C in a cluster and I 
want to migrate process running in node A to other node, let's say to node C.
is there a way to do this with open MPI ? thanks.
Regards,
Husen



On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond  wrote:
Why do you need OpenMPI to do this? Molecular dynamics trajectories are trivial 
to checkpoint and restart at the application level. I'm sure Gromacs already 
supports this. Please consult the Gromacs docs or user support for details.

Jeff


On Tuesday, March 15, 2016, Husen R  wrote:
Dear Open MPI Users,

Does the current stable release of Open MPI (v1.10 series) support fault 
tolerant feature ?
I got the information from Open MPI FAQ that The checkpoint/restart support was 
last released as part of the v1.6 series.
I just want to make sure about this.
and by the way, does Open MPI able to checkpoint or restart mpi 
application/GROMACS automatically ?
Please, I really need help.
Regards,

Husen


--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28705.php



--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28709.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28710.php



Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Gilles Gouaillardet
also, limits.conf is set when starting a ssh session.
it is not useful for services started at boot time, and
ulimit -l unlimited
should be added in the startup script
/etc/init.d/xxx
or
/etc/sysconfig/xxx

Cheers,

Gilles

On Thursday, March 17, 2016, Dave Love  wrote:

> Michael Di Domenico > writes:
>
> > On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom  > wrote:
> >> Hi Mike,
> >>
> >> In this file,
> >> $ cat /etc/security/limits.conf
> >> ...
> >> < do you see at the end ... >
> >>
> >> * hard memlock unlimited
> >> * soft memlock unlimited
> >> # -- All InfiniBand Settings End here --
> >> ?
> >
> > Yes.  I double checked that it's set on all compute nodes in the
> > actual file and through the ulimit command
>
> Is limits.conf actualy relevant to your job launch?  It's normally used
> by pam_limits (on GNU/Linux) which won't necessarily be run.  [In the
> case of SGE, you specify the resource limit as a parameter of the
> execution daemon (execd), at least with "builtin" remote startup.]
>
> I'd verify it by executing something like "procenv -l" under mpirun.
> (procenv is packaged for the major GNU/Linux distributions.)
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28728.php
>


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-17 Thread Ralph Castain
Just to clarify: I am not aware of any MPI that will allow you to relocate a 
process while it is running. You have to checkpoint the job, terminate it, and 
then restart the entire thing with the desired process on the new node.


> On Mar 16, 2016, at 3:15 AM, Husen R  wrote:
> 
> In the case of MPI application (not gromacs), How do I relocate MPI 
> application from one node to another node while it is running ?
> I'm sorry, as far as I know the ompi-restart command is used to restart 
> application, based on checkpoint file, once the application already 
> terminated (no longer running).
> 
> Thanks
> 
> regards,
> 
> Husen
> 
> On Wed, Mar 16, 2016 at 4:29 PM, Jeff Hammond  > wrote:
> Just checkpoint-restart the app to relocate. The overhead will be lower than 
> trying to do with MPI. 
> 
> Jeff
> 
> 
> On Wednesday, March 16, 2016, Husen R  > wrote:
> Hi Jeff,
> 
> Thanks for the reply.
> 
> After consulting the Gromacs docs, as you suggested, Gromacs already supports 
> checkpoint/restart. thanks for the suggestion.
> 
> Previously, I asked about checkpoint/restart in Open MPI because I want to 
> checkpoint MPI Application and restart/migrate it while it is running.
> For the example, I run MPI application in node A,B and C in a cluster and I 
> want to migrate process running in node A to other node, let's say to node C.
> is there a way to do this with open MPI ? thanks.
> 
> Regards,
> 
> Husen
> 
> 
> 
> 
> On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond > 
> wrote:
> Why do you need OpenMPI to do this? Molecular dynamics trajectories are 
> trivial to checkpoint and restart at the application level. I'm sure Gromacs 
> already supports this. Please consult the Gromacs docs or user support for 
> details. 
> 
> Jeff
> 
> 
> On Tuesday, March 15, 2016, Husen R > wrote:
> Dear Open MPI Users,
> 
> 
> Does the current stable release of Open MPI (v1.10 series) support fault 
> tolerant feature ?
> I got the information from Open MPI FAQ that The checkpoint/restart support 
> was last released as part of the v1.6 series. 
> I just want to make sure about this.
> 
> and by the way, does Open MPI able to checkpoint or restart mpi 
> application/GROMACS automatically ? 
> Please, I really need help.
> 
> Regards,
> 
> 
> Husen 
> 
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com <>
> http://jeffhammond.github.io/ 
> 
> ___
> users mailing list
> us...@open-mpi.org <>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28705.php 
> 
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com 
> http://jeffhammond.github.io/ 
> 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28709.php 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28710.php



Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-17 Thread Dave Love
Husen R  writes:

> Dear Open MPI Users,
>
>
> Does the current stable release of Open MPI (v1.10 series) support fault
> tolerant feature ?
> I got the information from Open MPI FAQ that The checkpoint/restart support
> was last released as part of the v1.6 series.
> I just want to make sure about this.

Orthogonal to Jeff's comments:  dmtcp  is
advertised as able to checkpoint OMPI, at least over TCP and IB (for
some value of "IB").

Does anyone here have experience with that?


Re: [OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-17 Thread Dave Love
Ralph Castain  writes:

> That’s an SGE error message - looks like your tmp file system on one
> of the remote nodes is full.

Yes; surely that just needs to be fixed, and I'd expect the host not to
accept jobs in that state.  It's not just going to break ompi.

> We don’t control where SGE puts its
> files, but it might be that your backend nodes are having issues with
> us doing a tree-based launch (i.e., where each backend daemon launches
> more daemons along the tree).

I doubt that's relevant.  You just need space for the SGE tmpdir, which
is where the ompi session directory will go, for instance.  Also, too
many things don't recognize TMPDIR and will fail if they can't write to
/tmp specifically, even if there's reason to avoid /tmp for tmpdir.


Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Dave Love
Michael Di Domenico  writes:

> On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom  wrote:
>> Hi Mike,
>>
>> In this file,
>> $ cat /etc/security/limits.conf
>> ...
>> < do you see at the end ... >
>>
>> * hard memlock unlimited
>> * soft memlock unlimited
>> # -- All InfiniBand Settings End here --
>> ?
>
> Yes.  I double checked that it's set on all compute nodes in the
> actual file and through the ulimit command

Is limits.conf actualy relevant to your job launch?  It's normally used
by pam_limits (on GNU/Linux) which won't necessarily be run.  [In the
case of SGE, you specify the resource limit as a parameter of the
execution daemon (execd), at least with "builtin" remote startup.]

I'd verify it by executing something like "procenv -l" under mpirun.
(procenv is packaged for the major GNU/Linux distributions.)


Re: [OMPI users] Issue about cm PML

2016-03-17 Thread Jeff Squyres (jsquyres)
Additionally, if you run

  ompi_info | grep psm

Do you see the PSM MTL listed?

To force the CM MTL, you can run:

  mpirun --mca pml cm ...

That won't let any BTLs be selected (because only ob1 uses the BTLs).


> On Mar 17, 2016, at 8:07 AM, Gilles Gouaillardet 
>  wrote:
> 
> can you try to add
> --mca mtl psm
> to your mpirun command line ?
> 
> you might also have to blacklist the opening btl
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, March 17, 2016, dpchoudh .  wrote:
> Hello all
> I have a simple test setup, consisting of two Dell workstation nodes with 
> similar hardware profile.
> 
> Both the nodes have (identical)
> 1. Qlogic 4x DDR infiniband
> 2. Chelsio C310 iWARP ethernet.
> 
> Both of these cards are connected back to back, without a switch.
> 
> With this setup, I can run OpenMPI over TCP and openib BTL. However, if I try 
> to use the PSM MTL (excluding the Chelsio NIC, of course, since it does not 
> support PSM), I get an error from one of the nodes (details below), which 
> makes me think that a required library or package is not installed, but I 
> can't figure out what it might be.
> 
> Note that the test program is a simple 'hello world' program.
> 
> The following work:
>   mpirun -np 2 --hostfile ~/hostfile -mca btl tcp,self ./mpitest
> mpirun -np 2 --hostfile ~/hostfile -mca btl self,openib -mca 
> btl_openib_if_exclude cxgb3_0 ./mpitest
> 
> (I had to exclude the Chelsio card because of this issue:
> https://www.open-mpi.org/community/lists/users/2016/03/28661.php  )
> 
> Here is what does NOT work:
> mpirun -np 2 --hostfile ~/hostfile -mca mtl psm -mca btl_openib_if_exclude 
> cxgb3_0 ./mpitest
> 
> The error (from both nodes) is: 
>  mca: base: components_open: component pml / cm open function failed
> 
> However, I still see the "Hello, world" output indicating that the program 
> ran to completion.
> 
> Here is also another command that does NOT work:
> 
> mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca btl_openib_if_exclude 
> cxgb3_0 ./mpitest
> 
> The error is: (from the root node)
> PML cm cannot be selected
> 
> However, this time, I see no output from the program, indicating it did not 
> run.
> 
> The following command also fails in a similar way:
>  mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca mtl psm -mca 
> btl_openib_if_exclude cxgb3_0 ./mpitest
> 
> I have verified that infinipath-psm is installed on both nodes. Both nodes 
> run identical CentOS 7 and the libraries were installed from the CentOS 
> repositories (i.e. were not compiled from source)
> 
> Both nodes run OMPI 1.10.2, compiled from the source RPM.
> 
> What am I doing wrong?
> 
> Thanks
> Durga
> 
> 
> 
> 
> Life is complex. It has real and imaginary parts.
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28725.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] locked memory and queue pairs

2016-03-17 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 4:49 PM, Cabral, Matias A
 wrote:
> I didn't go into the code to see who is actually calling this error message, 
> but I suspect this may be a generic error for "out of memory" kind of thing 
> and not specific to the que pair. To confirm please add  -mca 
> pml_base_verbose 100 and add  -mca mtl_base_verbose 100  to see what is being 
> selected.

this didn't spit out anything overly useful, just lots of lines

[node001:00909] mca: base: components_register: registering pml components
[node001:00909] mca: base: components_register: found loaded component v
[node001:00909] mca: base: components_register: component v register
function successful
[node001:00909] mca: base: components_register: found loaded component bfo
[node001:00909] mca: base: components_register: component bfo register
function successful
[node001:00909] mca: base: components_register: found loaded component cm
[node001:00909] mca: base: components_register: component cm register
function successful
[node001:00909] mca: base: components_register: found loaded component ob1
[node001:00909] mca: base: components_register: component ob1 register
function successful

> I'm trying to remember some details of IMB  and alltoallv to see if it is 
> indeed requiring more resources that the other micro benchmarks.

i'm using IMB for my tests, but this issue came up because a
researcher isn't able to run large alltoall codes, so i don't believe
it's specific to IMB

> BTW, did you confirm the limits setup? Also do the nodes have all the same 
> amount of mem?

yes, all nodes have the limits set to unlimited and each node has
256GB of memory


Re: [OMPI users] Issue about cm PML

2016-03-17 Thread Gilles Gouaillardet
can you try to add
--mca mtl psm
to your mpirun command line ?

you might also have to blacklist the opening btl

Cheers,

Gilles

On Thursday, March 17, 2016, dpchoudh .  wrote:

> Hello all
> I have a simple test setup, consisting of two Dell workstation nodes with
> similar hardware profile.
>
> Both the nodes have (identical)
> 1. Qlogic 4x DDR infiniband
> 2. Chelsio C310 iWARP ethernet.
>
> Both of these cards are connected back to back, without a switch.
>
> With this setup, I can run OpenMPI over TCP and openib BTL. However, if I
> try to use the PSM MTL (excluding the Chelsio NIC, of course, since it does
> not support PSM), I get an error from one of the nodes (details below),
> which makes me think that a required library or package is not installed,
> but I can't figure out what it might be.
>
> Note that the test program is a simple 'hello world' program.
>
> The following work:
>   mpirun -np 2 --hostfile ~/hostfile -mca btl tcp,self ./mpitest
> mpirun -np 2 --hostfile ~/hostfile -mca btl self,openib -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
>
> (I had to exclude the Chelsio card because of this issue:
> https://www.open-mpi.org/community/lists/users/2016/03/28661.php  )
>
> Here is what does NOT work:
> mpirun -np 2 --hostfile ~/hostfile -mca mtl psm -mca btl_openib_if_exclude
> cxgb3_0 ./mpitest
>
> The error (from both nodes) is:
>  mca: base: components_open: component pml / cm open function failed
>
> However, I still see the "Hello, world" output indicating that the program
> ran to completion.
>
> Here is also another command that does NOT work:
>
> mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca btl_openib_if_exclude
> cxgb3_0 ./mpitest
>
> The error is: (from the root node)
> PML cm cannot be selected
>
> However, this time, I see no output from the program, indicating it did
> not run.
>
> The following command also fails in a similar way:
>  mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca mtl psm -mca
> btl_openib_if_exclude cxgb3_0 ./mpitest
>
> I have verified that infinipath-psm is installed on both nodes. Both nodes
> run identical CentOS 7 and the libraries were installed from the CentOS
> repositories (i.e. were not compiled from source)
>
> Both nodes run OMPI 1.10.2, compiled from the source RPM.
>
> What am I doing wrong?
>
> Thanks
> Durga
>
>
>
>
> Life is complex. It has real and imaginary parts.
>


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-17 Thread Ralph Castain
Just some thoughts offhand:

* what version of OMPI are you using?

* are you saying that after the warm reboot, all 48 procs are running on a 
subset of cores?

* it sounds like some of the cores have been marked as “offline” for some 
reason. Make sure you have hwloc installed on the machine, and run “lstopo” and 
see if that is the case

Ralph

> On Mar 17, 2016, at 2:00 AM, Rainer Koenig  
> wrote:
> 
> Hi,
> 
> I'm experiencing a strange problem with running LIGGGHTS on  48 core
> workstation running Ubuntu 14.04.4 LTS.
> 
> If I cold boot the workstation and start one of the examples form
> LIGGGHTS then everything looks fine:
> 
> $ mpirun -np 48 liggghts < in.chute_wear
> 
> launches the example on all 48 cores, htop in a second window shows that
> all cores are occupied and run at nearly 100% workload.
> 
> So far so good. Now I just reboot the workstation and do the exact same
> steps as abovre.
> 
> This time the job just runs on a few cores (16 to 20) and the cores
> don't even run at 100% load.
> 
> So now I'm trying to find out what is wrong. Bad luck is that I can't
> just ask the vendor of the workstation since I'm working for that vendor
> and trying to solve this issue. :-)
> 
> I guess that something that OpenMPI needs is initialized different when
> I do a cold boot or a warm boot. But how can I find out what is wrong?
> 
> Already tried to look for differences in the Ubuntu boot logs, but there
> is nothing different.
> 
> ompi_info --all or even the parsable format  doesn't show any difference
> between cold boot and warm boot.
> 
> Any ideas what could be wrong after the reboot that causes such a behaviour?
> 
> Thanks,
> Rainer
> -- 
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R&D SW OSE
> 
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
> 
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:  mailto:rainer.koe...@ts.fujitsu.com
> 
> Internet ts.fujtsu.com
> Company Details  ts.fujitsu.com/imprint.html
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28722.php



Re: [OMPI users] Open SHMEM Error

2016-03-17 Thread RYAN RAY
Dear Gilles

Thanks for the reply.

Regards

Ryan

On Wed, 16 Mar 2016 11:39:49 +0530 Gilles Gouaillardet  wrote
> 




Ray,



from shmem_ptr man page :



RETURN VALUES

   shmem_ptr returns a pointer to the data object on the
specified remote PE. If target is not remotely accessible, a NULL
pointer is returned.



since you are running your application on two hosts and one task per
host, the target is not remotely accessible, and hence the NULL
pointer.

if you run two tasks on the same node, then the test should be fine.



note openshmem does not provide a virtual shared memory system.

if you want to run across nodes, then you need to shmem_{get,put}



Cheers,



Gilles



On 3/16/2016 2:59 PM, RYAN RAY wrote:


Dear Gilles



  I have attached the source code and the hostfile.



  Regards



  Ryan 



From: Gilles Gouaillardet 

Sent: Tue, 15 Mar 2016 15:44:48 

To: Open MPI Users 

Subject: Re: [OMPI users] Open SHMEM Error

Ryan,



can you please post your source code and hostfile ?



Cheers,



Gilles



  On Tuesday, March 15, 2016, RYAN RAY
   wrote:

  Dear
Gilles,
 
Thanks for the reply. After executing the code as you
  told I get the output as shown in the attached snapshot.
So I am understanding that the code cannot remotely
  access the array at PE1 from PE0. Can you please explain
  why this is happenning?



Regards,
Ryan



  From: Gilles Gouaillardet 

  Sent: Fri, 04 Mar 2016 11:16:38 

  To: Open MPI Users 

  Subject: Re: [OMPI users] Open SHMEM Error

  Ryan,



  do you really get a segmentation fault ?



  here is the message i have :



  ---

  Primary job  terminated normally, but 1 process returned

  a non-zero exit code.. Per user-direction, the job has
  been aborted.

  ---

--

  oshrun detected that one or more processes exited with
  non-zero status, thus causing

  the job to be terminated. The first process to do so was:



    Process name: [[23403,1],0]

    Exit code:    1

--



  the root cause is the test program ends with

  return 1;

  instead of

  return 0;



  /* and i cannot figure out a rationale for that, i just
  replaced this with return 0; and that was fine*/



  fwiw, this examples use the deprecated start_pes(0)
  instead of shmem_init();

  and there is no shmem_finalize(); 



  Cheers,



  Gilles



  On 3/3/2016 4:15 PM, RYAN RAY wrote:


  
1456988179.s.21347.24038.f4-235-148.1456989355.13...@webmail.rediffmail.com"
type="cite">





From: "RYAN RAY" ryan@rediffmail.com

Sent: Thu, 03 Mar 2016 12:26:19 +0530

To: "announce " annou...@open-mpi.org,
"ryan.ray " ryan@rediffmail.com

Subject: Open SHMEM Error





On
  trying a code specified in the manual"OpenSHMEM
  Specification Draft "asection8.16 example code , we are 
facing a
problem.



  The code is the c version of the
example code for the callSHMEM_PTR.


  We have
written the code exactly as it is in the manual ,
but we are getting a segmentation fault.




  The code
, manual and error snapshots are attached in
this mail.




I will
  be grateful if you can provide any solution to
  this problem.



RYAN SAPTARSHI RAY












Get
  your own FREE
  website, FREE
  domain & FREE
  mobile app with Company email.  

 Know More > 




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28622.php


___

users mailing list

us...@open-mpi.org

Subscription: http:/

[OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-17 Thread Rainer Koenig
Hi,

I'm experiencing a strange problem with running LIGGGHTS on  48 core
workstation running Ubuntu 14.04.4 LTS.

If I cold boot the workstation and start one of the examples form
LIGGGHTS then everything looks fine:

$ mpirun -np 48 liggghts < in.chute_wear

launches the example on all 48 cores, htop in a second window shows that
all cores are occupied and run at nearly 100% workload.

So far so good. Now I just reboot the workstation and do the exact same
steps as abovre.

This time the job just runs on a few cores (16 to 20) and the cores
don't even run at 100% load.

So now I'm trying to find out what is wrong. Bad luck is that I can't
just ask the vendor of the workstation since I'm working for that vendor
and trying to solve this issue. :-)

I guess that something that OpenMPI needs is initialized different when
I do a cold boot or a warm boot. But how can I find out what is wrong?

Already tried to look for differences in the Ubuntu boot logs, but there
is nothing different.

ompi_info --all or even the parsable format  doesn't show any difference
between cold boot and warm boot.

Any ideas what could be wrong after the reboot that causes such a behaviour?

Thanks,
Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R&D SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html