[OMPI users] Issue about cm PML

2016-03-16 Thread dpchoudh .
Hello all
I have a simple test setup, consisting of two Dell workstation nodes with
similar hardware profile.

Both the nodes have (identical)
1. Qlogic 4x DDR infiniband
2. Chelsio C310 iWARP ethernet.

Both of these cards are connected back to back, without a switch.

With this setup, I can run OpenMPI over TCP and openib BTL. However, if I
try to use the PSM MTL (excluding the Chelsio NIC, of course, since it does
not support PSM), I get an error from one of the nodes (details below),
which makes me think that a required library or package is not installed,
but I can't figure out what it might be.

Note that the test program is a simple 'hello world' program.

The following work:
  mpirun -np 2 --hostfile ~/hostfile -mca btl tcp,self ./mpitest
mpirun -np 2 --hostfile ~/hostfile -mca btl self,openib -mca
btl_openib_if_exclude cxgb3_0 ./mpitest

(I had to exclude the Chelsio card because of this issue:
https://www.open-mpi.org/community/lists/users/2016/03/28661.php  )

Here is what does NOT work:
mpirun -np 2 --hostfile ~/hostfile -mca mtl psm -mca btl_openib_if_exclude
cxgb3_0 ./mpitest

The error (from both nodes) is:
 mca: base: components_open: component pml / cm open function failed

However, I still see the "Hello, world" output indicating that the program
ran to completion.

Here is also another command that does NOT work:

mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca btl_openib_if_exclude
cxgb3_0 ./mpitest

The error is: (from the root node)
PML cm cannot be selected

However, this time, I see no output from the program, indicating it did not
run.

The following command also fails in a similar way:
 mpirun -np 2 --hostfile ~/hostfile -mca pml cm -mca mtl psm -mca
btl_openib_if_exclude cxgb3_0 ./mpitest

I have verified that infinipath-psm is installed on both nodes. Both nodes
run identical CentOS 7 and the libraries were installed from the CentOS
repositories (i.e. were not compiled from source)

Both nodes run OMPI 1.10.2, compiled from the source RPM.

What am I doing wrong?

Thanks
Durga




Life is complex. It has real and imaginary parts.


Re: [OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Ralph Castain
That’s an SGE error message - looks like your tmp file system on one of the 
remote nodes is full. We don’t control where SGE puts its files, but it might 
be that your backend nodes are having issues with us doing a tree-based launch 
(i.e., where each backend daemon launches more daemons along the tree).

You could try turning the tree-based launch “off” and see if that helps: "-mca 
plm_rsh_no_tree_spawn 1"


> On Mar 16, 2016, at 3:50 PM, Lane, William  wrote:
> 
> I'm getting an error message early on:
> [csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using 
> "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
> unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on 
> device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh: using 
> "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
> 
> According to the OpenMPI FAQ:
> 
> 'You may want to alter other parameters, but the important one is 
> "control_slaves", specifying that the environment has "tight integration". 
> Note also the lack of a start or stop procedure. The tight integration means 
> that mpirun automatically picks up the slot count to use as a default in 
> place of the '-np' argument, picks up a host file, spawns remote processes 
> via 'qrsh' so that SGE can control and monitor them, and creates and destroys 
> a per-job temporary directory ($TMPDIR), in which Open MPI's directory will 
> be created (by default).'
> 
> When I look at my OpenMPI environment there is no $TMPDIR environment 
> variable.
> 
> How does OpenMPI determine where it's going to put the "per-job temporary 
> directory ($TMPDIR)"? Does it use an SoGE defined environment variable? Is 
> the host file used by OpenMPI spawned in this $TMPDIR temporary directory?
> 
> Bill L.
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation. ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28719.php 
> 


[OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Lane, William
I'm getting an error message early on:
[csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh 
-inherit -nostdin -V -verbose" for launching
unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on 
device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh: using 
"/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching

According to the OpenMPI FAQ:

'You may want to alter other parameters, but the important one is 
"control_slaves", specifying that the environment has "tight integration". Note 
also the lack of a start or stop procedure. The tight integration means that 
mpirun automatically picks up the slot count to use as a default in place of 
the '-np' argument, picks up a host file, spawns remote processes via 'qrsh' so 
that SGE can control and monitor them, and creates and destroys a per-job 
temporary directory ($TMPDIR), in which Open MPI's directory will be created 
(by default).'

When I look at my OpenMPI environment there is no $TMPDIR environment variable.

How does OpenMPI determine where it's going to put the "per-job temporary 
directory ($TMPDIR)"? Does it use an SoGE defined environment variable? Is the 
host file used by OpenMPI spawned in this $TMPDIR temporary directory?

Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Cabral, Matias A
I didn't go into the code to see who is actually calling this error message, 
but I suspect this may be a generic error for "out of memory" kind of thing and 
not specific to the que pair. To confirm please add  -mca pml_base_verbose 100 
and add  -mca mtl_base_verbose 100  to see what is being selected. 

I'm trying to remember some details of IMB  and alltoallv to see if it is 
indeed requiring more resources that the other micro benchmarks. 

BTW, did you confirm the limits setup? Also do the nodes have all the same 
amount of mem? 

_MAC


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di Domenico
Sent: Wednesday, March 16, 2016 1:25 PM
To: Open MPI Users 
Subject: Re: [OMPI users] locked memory and queue pairs

On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A  
wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will 
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
> As Tom suggest, confirm the limits are setup on every node: could it be the 
> alltoall is reaching a node that "others" are not? Please share the command 
> line and the error message.



Yes, under normal circumstances, I use PSM.  i only disabled to see if it 
affected any kind of change.

the test i'm running is

mpirun -n 512 ./IMB-MPI1 alltoallv

when the system gets to 128 ranks, it freezes and errors out with

---

A process failed to create a queue pair. This usually means either the device 
has run out of queue pairs (too many connections) or there are insufficient 
resources available to allocate a queue pair (out of memory). The latter can 
happen if either 1) insufficient memory is available, or 2) no more physical 
memory can be registered with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: node001
Local device:   qib0
Queue pair type:Reliable connected (RC)

---

i've also tried various nodes across the cluster (200+).  i think i ruled out 
errant switch (qlogic single 12800-120) problems, bad cables, and bad nodes.  
that's not to say they're may not be present, i've just not been able to find 
it ___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28717.php


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A
 wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will 
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
> As Tom suggest, confirm the limits are setup on every node: could it be the 
> alltoall is reaching a node that "others" are not? Please share the command 
> line and the error message.



Yes, under normal circumstances, I use PSM.  i only disabled to see if
it affected any kind of change.

the test i'm running is

mpirun -n 512 ./IMB-MPI1 alltoallv

when the system gets to 128 ranks, it freezes and errors out with

---

A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: node001
Local device:   qib0
Queue pair type:Reliable connected (RC)

---

i've also tried various nodes across the cluster (200+).  i think i
ruled out errant switch (qlogic single 12800-120) problems, bad
cables, and bad nodes.  that's not to say they're may not be present,
i've just not been able to find it


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Cabral, Matias A
Hi Michael,

I may be missing some context, if you are using the qlogic cards you will 
always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
As Tom suggest, confirm the limits are setup on every node: could it be the 
alltoall is reaching a node that "others" are not? Please share the command 
line and the error message.  

Thanks, 

_MAC

>> Begin forwarded message:
>> 
>> From: Michael Di Domenico 
>> Subject: Re: [OMPI users] locked memory and queue pairs
>> Date: March 16, 2016 at 11:32:01 AM EDT
>> To: Open MPI Users 
>> Reply-To: Open MPI Users 
>> 
>> On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico 
>>  wrote:
>>> when i try to run an openmpi job with >128 ranks (16 ranks per node) 
>>> using alltoall or alltoallv, i'm getting an error that the process 
>>> was unable to get a queue pair.
>>> 
>>> i've checked the max locked memory settings across my machines;
>>> 
>>> using ulimit -l in and outside of mpirun and they're all set to 
>>> unlimited pam modules to ensure pam_limits.so is loaded and working 
>>> the /etc/security/limits.conf is set for soft/hard mem to unlimited
>>> 
>>> i tried a couple of quick mpi config settings i could think of;
>>> 
>>> -mca mtl ^psm no affect
>>> -mca btl_openib_flags 1 no affect
>>> 
>>> the openmpi faq says to tweak some mtt values in /sys, but since i'm 
>>> not on mellanox that doesn't apply to me
>>> 
>>> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled 
>>> ofed), running on qlogic single-port infiniband cards, psm is 
>>> enabled
>>> 
>>> other collectives seem to run okay, it seems to only be alltoall 
>>> comms that fail and only at scale
>>> 
>>> i believe (but can't prove) that this worked at one point, but i 
>>> can't recall when i last tested it.  so it's reasonable to assume 
>>> that some change to the system is preventing this.
>>> 
>>> the question is, where should i start poking to find it?
>> 
>> bump?
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/03/28713.php
>
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to: 
>http://www.cisco.com/web/about/doing_business/legal/cri/
>


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom  wrote:
> Hi Mike,
>
> In this file,
> $ cat /etc/security/limits.conf
> ...
> < do you see at the end ... >
>
> * hard memlock unlimited
> * soft memlock unlimited
> # -- All InfiniBand Settings End here --
> ?

Yes.  I double checked that it's set on all compute nodes in the
actual file and through the ulimit command


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Elken, Tom
Hi Mike,

In this file, 
$ cat /etc/security/limits.conf
...
< do you see at the end ... >

* hard memlock unlimited
* soft memlock unlimited
# -- All InfiniBand Settings End here --
?

-Tom

> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di
> Domenico
> Sent: Thursday, March 10, 2016 8:55 AM
> To: Open MPI Users
> Subject: [OMPI users] locked memory and queue pairs
> 
> when i try to run an openmpi job with >128 ranks (16 ranks per node)
> using alltoall or alltoallv, i'm getting an error that the process was
> unable to get a queue pair.
> 
> i've checked the max locked memory settings across my machines;
> 
> using ulimit -l in and outside of mpirun and they're all set to unlimited
> pam modules to ensure pam_limits.so is loaded and working
> the /etc/security/limits.conf is set for soft/hard mem to unlimited
> 
> i tried a couple of quick mpi config settings i could think of;
> 
> -mca mtl ^psm no affect
> -mca btl_openib_flags 1 no affect
> 
> the openmpi faq says to tweak some mtt values in /sys, but since i'm
> not on mellanox that doesn't apply to me
> 
> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
> running on qlogic single-port infiniband cards, psm is enabled
> 
> other collectives seem to run okay, it seems to only be alltoall comms
> that fail and only at scale
> 
> i believe (but can't prove) that this worked at one point, but i can't
> recall when i last tested it.  so it's reasonable to assume that some
> change to the system is preventing this.
> 
> the question is, where should i start poking to find it?
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-
> mpi.org/community/lists/users/2016/03/28673.php


Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico
 wrote:
> when i try to run an openmpi job with >128 ranks (16 ranks per node)
> using alltoall or alltoallv, i'm getting an error that the process was
> unable to get a queue pair.
>
> i've checked the max locked memory settings across my machines;
>
> using ulimit -l in and outside of mpirun and they're all set to unlimited
> pam modules to ensure pam_limits.so is loaded and working
> the /etc/security/limits.conf is set for soft/hard mem to unlimited
>
> i tried a couple of quick mpi config settings i could think of;
>
> -mca mtl ^psm no affect
> -mca btl_openib_flags 1 no affect
>
> the openmpi faq says to tweak some mtt values in /sys, but since i'm
> not on mellanox that doesn't apply to me
>
> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
> running on qlogic single-port infiniband cards, psm is enabled
>
> other collectives seem to run okay, it seems to only be alltoall comms
> that fail and only at scale
>
> i believe (but can't prove) that this worked at one point, but i can't
> recall when i last tested it.  so it's reasonable to assume that some
> change to the system is preventing this.
>
> the question is, where should i start poking to find it?

bump?


Re: [OMPI users] Error with MPI_Register_datarep

2016-03-16 Thread Edgar Gabriel

On 3/16/2016 7:06 AM, Éric Chamberland wrote:

Le 16-03-14 15:07, Rob Latham a écrit :

On mpich's discussion list the point was made that libraries like HDF5
and (Parallel-)NetCDF provide not only the sort of platform
portability Eric desires, but also provide a self-describing file format.

==rob


But I do not agree with that.

If MPI can provide me a simple solution like user datarep, why in the
world would I bind my code to another library?

Instead of re-coding all my I/O in my code, I would prefer to contribute
to MPI I/O implementations out there...  :)

So, the never answered question: How big is that task


Just speaking for OMPIO: there is a simple solution which would 
basically perform the necessary conversion of the user buffer as a first 
step. This implementation would be fairly straight forward, but would 
require a temporary buffer that is basically of the same size (or 
larger, depending on the format) as your input buffer, which would be a 
problem for many application scenarios.


The problem with trying to perform the conversion at a later step is, 
that all the buffers are treated as byte sequences internally, so the 
notion of data types is lost at one point in time. This is especially 
important for collective I/O, since the aggregation step might in some 
extreme situations even break up a datatype to be written in different 
cycles (or by different aggregators) internally.


That being said, I admit that I haven't spent too much time thinking 
about solutions to this problem. If there is interest, I am would be 
happy to work on it - and happy to accept help :-)


Edgar


Also, in 2012, I can state that having looked at HDF5, there was no
functions that used collective MPI I/O for *randomly distributed*
data...  Collective I/O was available only for "structured" data. So I
coded it all directly into MPI natives calls... and it works like a charm!

Thanks,

Eric

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28711.php


Re: [OMPI users] Error with MPI_Register_datarep

2016-03-16 Thread Éric Chamberland



Le 16-03-14 15:07, Rob Latham a écrit :


On mpich's discussion list the point was made that libraries like HDF5 
and (Parallel-)NetCDF provide not only the sort of platform 
portability Eric desires, but also provide a self-describing file format.


==rob


But I do not agree with that.

If MPI can provide me a simple solution like user datarep, why in the 
world would I bind my code to another library?


Instead of re-coding all my I/O in my code, I would prefer to contribute 
to MPI I/O implementations out there...  :)


So, the never answered question: How big is that task

Also, in 2012, I can state that having looked at HDF5, there was no 
functions that used collective MPI I/O for *randomly distributed* 
data...  Collective I/O was available only for "structured" data. So I 
coded it all directly into MPI natives calls... and it works like a charm!


Thanks,

Eric



Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
In the case of MPI application (not gromacs), How do I relocate MPI
application from one node to another node while it is running ?
I'm sorry, as far as I know the *ompi-restart *command is used to restart
application, based on checkpoint file, once the application already
terminated (no longer running).

Thanks

regards,

Husen

On Wed, Mar 16, 2016 at 4:29 PM, Jeff Hammond 
wrote:

> Just checkpoint-restart the app to relocate. The overhead will be lower
> than trying to do with MPI.
>
> Jeff
>
>
> On Wednesday, March 16, 2016, Husen R  wrote:
>
>> Hi Jeff,
>>
>> Thanks for the reply.
>>
>> After consulting the Gromacs docs, as you suggested, Gromacs already
>> supports checkpoint/restart. thanks for the suggestion.
>>
>> Previously, I asked about checkpoint/restart in Open MPI because I want
>> to checkpoint MPI Application and restart/migrate it while it is running.
>> For the example, I run MPI application in node A,B and C in a cluster and
>> I want to migrate process running in node A to other node, let's say to
>> node C.
>> is there a way to do this with open MPI ? thanks.
>>
>> Regards,
>>
>> Husen
>>
>>
>>
>>
>> On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond 
>> wrote:
>>
>>> Why do you need OpenMPI to do this? Molecular dynamics trajectories are
>>> trivial to checkpoint and restart at the application level. I'm sure
>>> Gromacs already supports this. Please consult the Gromacs docs or user
>>> support for details.
>>>
>>> Jeff
>>>
>>>
>>> On Tuesday, March 15, 2016, Husen R  wrote:
>>>
 Dear Open MPI Users,


 Does the current stable release of Open MPI (v1.10 series) support
 fault tolerant feature ?
 I got the information from Open MPI FAQ that The checkpoint/restart
 support was last released as part of the v1.6 series.
 I just want to make sure about this.

 and by the way, does Open MPI able to checkpoint or restart mpi
 application/GROMACS automatically ?
 Please, I really need help.

 Regards,


 Husen

>>>
>>>
>>> --
>>> Jeff Hammond
>>> jeff.scie...@gmail.com
>>> http://jeffhammond.github.io/
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28705.php
>>>
>>
>>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28709.php
>


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Jeff Hammond
Just checkpoint-restart the app to relocate. The overhead will be lower
than trying to do with MPI.

Jeff

On Wednesday, March 16, 2016, Husen R  wrote:

> Hi Jeff,
>
> Thanks for the reply.
>
> After consulting the Gromacs docs, as you suggested, Gromacs already
> supports checkpoint/restart. thanks for the suggestion.
>
> Previously, I asked about checkpoint/restart in Open MPI because I want to
> checkpoint MPI Application and restart/migrate it while it is running.
> For the example, I run MPI application in node A,B and C in a cluster and
> I want to migrate process running in node A to other node, let's say to
> node C.
> is there a way to do this with open MPI ? thanks.
>
> Regards,
>
> Husen
>
>
>
>
> On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond  > wrote:
>
>> Why do you need OpenMPI to do this? Molecular dynamics trajectories are
>> trivial to checkpoint and restart at the application level. I'm sure
>> Gromacs already supports this. Please consult the Gromacs docs or user
>> support for details.
>>
>> Jeff
>>
>>
>> On Tuesday, March 15, 2016, Husen R > > wrote:
>>
>>> Dear Open MPI Users,
>>>
>>>
>>> Does the current stable release of Open MPI (v1.10 series) support fault
>>> tolerant feature ?
>>> I got the information from Open MPI FAQ that The checkpoint/restart
>>> support was last released as part of the v1.6 series.
>>> I just want to make sure about this.
>>>
>>> and by the way, does Open MPI able to checkpoint or restart mpi
>>> application/GROMACS automatically ?
>>> Please, I really need help.
>>>
>>> Regards,
>>>
>>>
>>> Husen
>>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com
>> 
>> http://jeffhammond.github.io/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28705.php
>>
>
>

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
Hi Jeff,

Thanks for the reply.

After consulting the Gromacs docs, as you suggested, Gromacs already
supports checkpoint/restart. thanks for the suggestion.

Previously, I asked about checkpoint/restart in Open MPI because I want to
checkpoint MPI Application and restart/migrate it while it is running.
For the example, I run MPI application in node A,B and C in a cluster and I
want to migrate process running in node A to other node, let's say to node
C.
is there a way to do this with open MPI ? thanks.

Regards,

Husen




On Wed, Mar 16, 2016 at 12:37 PM, Jeff Hammond 
wrote:

> Why do you need OpenMPI to do this? Molecular dynamics trajectories are
> trivial to checkpoint and restart at the application level. I'm sure
> Gromacs already supports this. Please consult the Gromacs docs or user
> support for details.
>
> Jeff
>
>
> On Tuesday, March 15, 2016, Husen R  wrote:
>
>> Dear Open MPI Users,
>>
>>
>> Does the current stable release of Open MPI (v1.10 series) support fault
>> tolerant feature ?
>> I got the information from Open MPI FAQ that The checkpoint/restart
>> support was last released as part of the v1.6 series.
>> I just want to make sure about this.
>>
>> and by the way, does Open MPI able to checkpoint or restart mpi
>> application/GROMACS automatically ?
>> Please, I really need help.
>>
>> Regards,
>>
>>
>> Husen
>>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28705.php
>


Re: [OMPI users] Open SHMEM Error

2016-03-16 Thread Gilles Gouaillardet

Ray,

from shmem_ptr man page :

RETURN VALUES
   shmem_ptr returns a pointer to the data object on the specified 
remote PE. If target is not remotely accessible, a NULL pointer is returned.


since you are running your application on two hosts and one task per 
host, the target is not remotely accessible, and hence the NULL pointer.

if you run two tasks on the same node, then the test should be fine.

note openshmem does not provide a virtual shared memory system.
if you want to run across nodes, then you need to shmem_{get,put}

Cheers,

Gilles

On 3/16/2016 2:59 PM, RYAN RAY wrote:

Dear Gilles

I have attached the source code and the hostfile.

Regards

Ryan

From: Gilles Gouaillardet 
Sent: Tue, 15 Mar 2016 15:44:48
To: Open MPI Users 
Subject: Re: [OMPI users] Open SHMEM Error
Ryan,

can you please post your source code and hostfile ?

Cheers,

Gilles

On Tuesday, March 15, 2016, RYAN RAY  wrote:

Dear Gilles,
Thanks for the reply. After executing the code as you told I get
the output as shown in the attached snapshot.
So I am understanding that the code cannot remotely access the
array at PE1 from PE0. Can you please explain why this is happenning?

Regards,
Ryan

From: Gilles Gouaillardet >
Sent: Fri, 04 Mar 2016 11:16:38
To: Open MPI Users >
Subject: Re: [OMPI users] Open SHMEM Error
Ryan,

do you really get a segmentation fault ?

here is the message i have :

---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
oshrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23403,1],0]
  Exit code:1
--

the root cause is the test program ends with
return 1;
instead of
return 0;

/* and i cannot figure out a rationale for that, i just replaced
this with return 0; and that was fine*/

fwiw, this examples use the deprecated start_pes(0) instead of
shmem_init();
and there is no shmem_finalize();

Cheers,

Gilles

On 3/3/2016 4:15 PM, RYAN RAY wrote:


1456988179.s.21347.24038.f4-235-148.1456989355.13...@webmail.rediffmail.com

"
type="cite">


From: "RYAN RAY" ryan@rediffmail.com

Sent: Thu, 03 Mar 2016 12:26:19 +0530
To: "announce " annou...@open-mpi.org
,
"ryan.ray " ryan@rediffmail.com

Subject: Open SHMEM Error


On trying a code specified in the manual"OpenSHMEM
Specification Draft "asection8.16 example code , we are facing
a problem.
The code is the c version of the example code for the
callSHMEM_PTR.
We have written the code exactly as it is in the manual , but
we are getting a segmentation fault.
The code , manual and error snapshots are attached in this mail.

I will be grateful if you can provide any solution to this
problem.

RYAN SAPTARSHI RAY



*FREE* website, *FREE* domain & *FREE* mobile app with Company
email.



*Know More >*


Get your own *FREE* website, *FREE* domain & *FREE* mobile app
with Company email.



*Know 
More >* 
 




___ users mailing list 
us...@open-mpi.org Subscription: 
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/03/28622.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28635.php
Get your own *FREE* website, *FREE* domain & *FREE* mobile app with 
Company email.
*Know 
More >* 


___
users mailing list

Re: [OMPI users] Open SHMEM Error

2016-03-16 Thread RYAN RAY
Dear Gilles
I have attached the source code and the hostfile.
Regards
Ryan

From: Gilles Gouaillardet gilles.gouaillar...@gmail.com
Sent: Tue, 15 Mar 2016 15:44:48 
To: Open MPI Users us...@open-mpi.org
Subject: Re: [OMPI users] Open SHMEM Error
Ryan,
can you please post your source code and hostfile ?
Cheers,
Gilles

On Tuesday, March 15, 2016, RYAN RAY ryan@rediffmail.com wrote:
Dear Gilles,Thanks for the reply. After executing the code as you told I 
get the output as shown in the attached snapshot.So I am understanding that the 
code cannot remotely access the array at PE1 from PE0. Can you please explain 
why this is happenning?
Regards,Ryan

From: Gilles Gouaillardet gil...@rist.or.jp
Sent: Fri, 04 Mar 2016 11:16:38 
To: Open MPI Users us...@open-mpi.org
Subject: Re: [OMPI users]  Open SHMEM Error





Ryan,



do you really get a segmentation fault ?



here is the message i have :



---

Primary job terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

---

--

oshrun detected that one or more processes exited with non-zero
status, thus causing

the job to be terminated. The first process to do so was:



 Process name: [[23403,1],0]

 Exit code: 1

--



the root cause is the test program ends with

return 1;

instead of

return 0;



/* and i cannot figure out a rationale for that, i just replaced
this with return 0; and that was fine*/



fwiw, this examples use the deprecated start_pes(0) instead of
shmem_init();

and there is no shmem_finalize(); 



Cheers,



Gilles



On 3/3/2016 4:15 PM, RYAN RAY wrote:


1456988179.s.21347.24038.f4-235-148.1456989355.13...@webmail.rediffmail.com"
  type="cite"





  From: "RYAN RAY" ryan@rediffmail.com

  Sent: Thu, 03 Mar 2016 12:26:19 +0530

  To: "announce " annou...@open-mpi.org, "ryan.ray "
  ryan@rediffmail.com

  Subject: Open SHMEM Error





  On trying a code
specified in the manual"OpenSHMEM Specification Draft "asection8.16 
example code , we are facing a problem.



The code is the c version of the example code for
  the callSHMEM_PTR.


We
  have written the code exactly as it is in the manual , but we
  are getting a segmentation fault.




The
  code , manual and error snapshots are attached in this
  mail.




  I will be grateful if you can provide any
solution to this problem.



  RYAN SAPTARSHI RAY
















  Get your own FREE
website, FREE domain  FREE
mobile app with Company email. 

Know
  More 








  ___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28622.php





___

users mailing list

us...@open-mpi.org

Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28635.php



sourcecode.doc
Description: MS-Word document


hostfile.doc
Description: MS-Word document


Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Jeff Hammond
Why do you need OpenMPI to do this? Molecular dynamics trajectories are
trivial to checkpoint and restart at the application level. I'm sure
Gromacs already supports this. Please consult the Gromacs docs or user
support for details.

Jeff

On Tuesday, March 15, 2016, Husen R  wrote:

> Dear Open MPI Users,
>
>
> Does the current stable release of Open MPI (v1.10 series) support fault
> tolerant feature ?
> I got the information from Open MPI FAQ that The checkpoint/restart
> support was last released as part of the v1.6 series.
> I just want to make sure about this.
>
> and by the way, does Open MPI able to checkpoint or restart mpi
> application/GROMACS automatically ?
> Please, I really need help.
>
> Regards,
>
>
> Husen
>


-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


[OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
Dear Open MPI Users,


Does the current stable release of Open MPI (v1.10 series) support fault
tolerant feature ?
I got the information from Open MPI FAQ that The checkpoint/restart support
was last released as part of the v1.6 series.
I just want to make sure about this.

and by the way, does Open MPI able to checkpoint or restart mpi
application/GROMACS automatically ?
Please, I really need help.

Regards,


Husen