Re: [OMPI users] ConnectX with InfiniHost IB HCAs

2011-08-27 Thread Yevgeny Kliteynik
Egor,

If updating OFED doesn't solve the problem (and I kinda have the
feeling that it does), you might want to try this mailing list
for IB interoperability questions:
linux-r...@vger.kernel.org

-- YK

On 26-Aug-11 4:42 PM, Shamis, Pavel wrote:
> You may try to update your OFED version. I think 1.5.3 is the latest one.
> 
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> 
> 
> 
> 
> 
> 
> On Aug 25, 2011, at 7:46 PM,wrote:
> 
>>
>> Hi all,
>>
>> it is more hardware or system configuration question but
>> I hope people in this list have an experience.
>> I have just added new ConnectX IB card to cluster with InfiniHost cards.
>> And no mpi programs work. Even ofed's tests do not work.
>> For example ib_send_*, ib_write_* just segfault on the host with ConnectX 
>> card and
>> still wait on the hosts with InfiniHost card. rdma_lat/bw tests segfault too 
>> but
>> with messages on the InfiniHost card hosts like this:
>> server read: No such file or directory
>> 5924:pp_server_exch_dest: 0/45 Couldn't read remote address
>>
>> pp_read_keys: No such file or directory
>> Couldn't read remote address
>>
>> Other diagnostic tools like ibv_device, ibchecknet, ibstat, ibstatus... show 
>> no errors
>> and show ConnectX card in system. All modules (mlx4_*, rdma_*) loaded. IPoIB 
>> configured.
>> openibd, opensmd services started without errors.
>>
>> 08:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 
>> 5GT/s - IB QDR / 10GigE] (rev a0)
>> OFED is 1.3.1, CentOS 5.2.
>>
>> ibstat
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 1
>> Firmware version: 2.7.0
>> Hardware version: a0
>> Node GUID: 0x0002c903000cad14
>> System image GUID: 0x0002c903000cad17
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 20
>> Base lid: 60
>> LMC: 0
>> SM lid: 60
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c903000cad15
>>
>> Where is a problem?
>>
>> Thanx in advance,
>> Egor.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] Related to project ideas in OpenMPI

2011-08-27 Thread Ralph Castain
Let's chat off-list about it - I don't see exactly how this works, but it may 
be similar enough. 


On Aug 27, 2011, at 8:30 AM, Joshua Hursey wrote:

> There is a 'self' checkpointer (CRS component) that does application level 
> checkpointing - exposed at the MPI level. I don't know how different what you 
> are working on is, but maybe something like that could be harnessed. Note 
> that I have not tested the 'self' checkpointer with the process migration 
> support, it -should- work, but there might be some bugs to work out.
> 
> Documentation and examples at the link below:
>  http://osl.iu.edu/research/ft/ompi-cr/examples.php#example-self
> 
> -- Josh
> 
> On Aug 26, 2011, at 6:17 PM, Ralph Castain wrote:
> 
>> FWIW: I'm in the process of porting some code from a branch that allows apps 
>> to do on-demand checkpoint/recovery style operations at the app level. 
>> Specifically, it provides the ability to:
>> 
>> * request a "recovery image" - an application-level blob containing state 
>> info required for the app to recover its state.
>> 
>> * register a callback point for providing a "recovery image", either to 
>> store for later use (separate API is used to indicate when to acquire it) or 
>> to provide to another process upon request
>> 
>> This is at the RTE level, so someone would have to expose it via an 
>> appropriate MPI call if someone wants to use it at that layer (I'm open to 
>> changes to support that use, if someone is interested).
>> 
>> 
>> On Aug 26, 2011, at 3:16 PM, Josh Hursey wrote:
>> 
>>> There are some great comments in this thread. Process migration (like
>>> many topics in systems) can get complex fast.
>>> 
>>> The Open MPI process migration implementation is checkpoint/restart
>>> based (currently using BLCR), and uses an 'eager' style of migration.
>>> This style of migration stops a process completely on the source
>>> machine, checkpoints/terminates it, restarts it on the destination
>>> machine, then rejoins it with the other running processes. I think the
>>> only documentation that we have is at the webpage below (and my PhD
>>> thesis, if you want the finer details):
>>> http://osl.iu.edu/research/ft/ompi-cr/
>>> 
>>> We have wanted to experiment with a 'pre-copy' or 'live' migration
>>> style, but have not had the necessary support from the underlying
>>> checkpointer or time to devote to making it happen. I think BLCR is
>>> working on including the necessary pieces in a future release (there
>>> are papers where a development version of BLCR has done this with
>>> LAM/MPI). So that might be something of interest.
>>> 
>>> Process migration techniques can benefit from fault prediction and
>>> 'good' target destination selection. Fault prediction allows us to
>>> move processes away from soon-to-fail locations, but it can be
>>> difficult to accurately predict failures. Open MPI has some hooks in
>>> the runtime layer that support 'sensors' which might help here. Good
>>> target destination selection is equally complex, but the idea here is
>>> to move processes to a machine where they can continue supporting the
>>> efficient execution of the application. So this might mean moving to
>>> the least loaded machine, or moving to a machine with other processes
>>> to reduce interprocess communication (something like dynamic load
>>> balancing).
>>> 
>>> So there are some ideas to get you started.
>>> 
>>> -- Josh
>>> 
>>> On Thu, Aug 25, 2011 at 12:06 PM, Rayson Ho  wrote:
 Don't know which SSI project you are referring to... I only know the
 OpenSSI project, and I was one of the first who subscribed to its
 mailing list (since 2001).
 
 http://openssi.org/cgi-bin/view?page=openssi.html
 
 I don't think those OpenSSI clusters are designed for tens of
 thousands of nodes, and not sure if it scales well to even a thousand
 nodes -- so IMO they have limited use for HPC clusters.
 
 Rayson
 
 
 
 On Thu, Aug 25, 2011 at 11:45 AM, Durga Choudhury  
 wrote:
> Also, in 2005 there was an attempt to implement SSI (Single System
> Image) functionality to the then-current 2.6.10 kernel. The proposal
> was very detailed and covered most of the bases of task creation, PID
> allocation etc across a loosely tied cluster (without using fancy
> hardware such as RDMA fabric). Anybody knows if it was ever
> implemented? Any pointers in this direction?
> 
> Thanks and regards
> Durga
> 
> 
> On Thu, Aug 25, 2011 at 11:08 AM, Rayson Ho  wrote:
>> Srinivas,
>> 
>> There's also Kernel-Level Checkpointing vs. User-Level Checkpointing -
>> if you can checkpoint an MPI task and restart it on a new node, then
>> this is also "process migration".
>> 
>> Of course, doing a checkpoint & restart can be slower than pure
>> in-kernel process migration, but the advantage is that you don't need
>> any kernel support, and can in fact do all of it in user

Re: [OMPI users] How to add nodes while running job

2011-08-27 Thread Ralph Castain

On Aug 27, 2011, at 8:28 AM, Rayson Ho wrote:

> On Sat, Aug 27, 2011 at 9:12 AM, Ralph Castain  wrote:
>> OMPI has no way of knowing that you will turn the node on at some future
>> point. All it can do is try to launch the job on the provided node, which
>> fails because the node doesn't respond.
>> You'll have to come up with some scheme for telling the node to turn on in
>> anticipation of starting a job - a resource manager is typically used for
>> that purpose.
> 
> Hi Ralph,
> 
> Are you referring to a specific resource manager/batch system?? AFAIK,
> no common batch systems support MPI_Spawn properly...

Usually, resource managers "turn on" nodes when allocating them for use by a 
job - SLURM is an example that does this. Helps the cluster save energy when 
not in use. I believe almost all the RM's out there now support this to some 
degree.

Support for MPI_Comm_spawn (i.e., dynamically allocating new nodes as required 
by a running MPI job and turning them on) doesn't exist (to my knowledge) at 
this time, mostly because this MPI feature is so rarely used. I've helped 
(integrating from the OMPI side) several groups that were adding such support 
to various RM's (typically Torque), but I don't think that code has hit a 
distribution yet.


> 
> Rayson
> 
> 
> 
> 
>> On Aug 27, 2011, at 6:58 AM, Rafael Braga wrote:
>> 
>> I would like to know how to add nodes during a job execution.
>> Now my hostfile has the node 10.0.0.23 that is off,
>> I would start this node during the execution so that the job can use it
>> When I run the command:
>> 
>> mpirun -np 2 -hostfile /tmp/hosts application
>> 
>> the following message appears:
>> 
>> ssh: connect to host 10.0.0.23 port 22: No route to host
>> --
>> A daemon (pid 10773) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> mpirun: clean termination accomplished
>> 
>> thanks a lot,
>> 
>> --
>> Rafael Braga
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> 
> -- 
> Rayson
> 
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Related to project ideas in OpenMPI

2011-08-27 Thread Joshua Hursey
There is a 'self' checkpointer (CRS component) that does application level 
checkpointing - exposed at the MPI level. I don't know how different what you 
are working on is, but maybe something like that could be harnessed. Note that 
I have not tested the 'self' checkpointer with the process migration support, 
it -should- work, but there might be some bugs to work out.

Documentation and examples at the link below:
  http://osl.iu.edu/research/ft/ompi-cr/examples.php#example-self

-- Josh

On Aug 26, 2011, at 6:17 PM, Ralph Castain wrote:

> FWIW: I'm in the process of porting some code from a branch that allows apps 
> to do on-demand checkpoint/recovery style operations at the app level. 
> Specifically, it provides the ability to:
> 
> * request a "recovery image" - an application-level blob containing state 
> info required for the app to recover its state.
> 
> * register a callback point for providing a "recovery image", either to store 
> for later use (separate API is used to indicate when to acquire it) or to 
> provide to another process upon request
> 
> This is at the RTE level, so someone would have to expose it via an 
> appropriate MPI call if someone wants to use it at that layer (I'm open to 
> changes to support that use, if someone is interested).
> 
> 
> On Aug 26, 2011, at 3:16 PM, Josh Hursey wrote:
> 
>> There are some great comments in this thread. Process migration (like
>> many topics in systems) can get complex fast.
>> 
>> The Open MPI process migration implementation is checkpoint/restart
>> based (currently using BLCR), and uses an 'eager' style of migration.
>> This style of migration stops a process completely on the source
>> machine, checkpoints/terminates it, restarts it on the destination
>> machine, then rejoins it with the other running processes. I think the
>> only documentation that we have is at the webpage below (and my PhD
>> thesis, if you want the finer details):
>> http://osl.iu.edu/research/ft/ompi-cr/
>> 
>> We have wanted to experiment with a 'pre-copy' or 'live' migration
>> style, but have not had the necessary support from the underlying
>> checkpointer or time to devote to making it happen. I think BLCR is
>> working on including the necessary pieces in a future release (there
>> are papers where a development version of BLCR has done this with
>> LAM/MPI). So that might be something of interest.
>> 
>> Process migration techniques can benefit from fault prediction and
>> 'good' target destination selection. Fault prediction allows us to
>> move processes away from soon-to-fail locations, but it can be
>> difficult to accurately predict failures. Open MPI has some hooks in
>> the runtime layer that support 'sensors' which might help here. Good
>> target destination selection is equally complex, but the idea here is
>> to move processes to a machine where they can continue supporting the
>> efficient execution of the application. So this might mean moving to
>> the least loaded machine, or moving to a machine with other processes
>> to reduce interprocess communication (something like dynamic load
>> balancing).
>> 
>> So there are some ideas to get you started.
>> 
>> -- Josh
>> 
>> On Thu, Aug 25, 2011 at 12:06 PM, Rayson Ho  wrote:
>>> Don't know which SSI project you are referring to... I only know the
>>> OpenSSI project, and I was one of the first who subscribed to its
>>> mailing list (since 2001).
>>> 
>>> http://openssi.org/cgi-bin/view?page=openssi.html
>>> 
>>> I don't think those OpenSSI clusters are designed for tens of
>>> thousands of nodes, and not sure if it scales well to even a thousand
>>> nodes -- so IMO they have limited use for HPC clusters.
>>> 
>>> Rayson
>>> 
>>> 
>>> 
>>> On Thu, Aug 25, 2011 at 11:45 AM, Durga Choudhury  
>>> wrote:
 Also, in 2005 there was an attempt to implement SSI (Single System
 Image) functionality to the then-current 2.6.10 kernel. The proposal
 was very detailed and covered most of the bases of task creation, PID
 allocation etc across a loosely tied cluster (without using fancy
 hardware such as RDMA fabric). Anybody knows if it was ever
 implemented? Any pointers in this direction?
 
 Thanks and regards
 Durga
 
 
 On Thu, Aug 25, 2011 at 11:08 AM, Rayson Ho  wrote:
> Srinivas,
> 
> There's also Kernel-Level Checkpointing vs. User-Level Checkpointing -
> if you can checkpoint an MPI task and restart it on a new node, then
> this is also "process migration".
> 
> Of course, doing a checkpoint & restart can be slower than pure
> in-kernel process migration, but the advantage is that you don't need
> any kernel support, and can in fact do all of it in user-space.
> 
> Rayson
> 
> 
> On Thu, Aug 25, 2011 at 10:26 AM, Ralph Castain  wrote:
>> It also depends on what part of migration interests you - are you 
>> wanting to look at the MPI part of the problem (reconnecting MPI 
>> transports

Re: [OMPI users] How to add nodes while running job

2011-08-27 Thread Rayson Ho
On Sat, Aug 27, 2011 at 9:12 AM, Ralph Castain  wrote:
> OMPI has no way of knowing that you will turn the node on at some future
> point. All it can do is try to launch the job on the provided node, which
> fails because the node doesn't respond.
> You'll have to come up with some scheme for telling the node to turn on in
> anticipation of starting a job - a resource manager is typically used for
> that purpose.

Hi Ralph,

Are you referring to a specific resource manager/batch system?? AFAIK,
no common batch systems support MPI_Spawn properly...

Rayson




> On Aug 27, 2011, at 6:58 AM, Rafael Braga wrote:
>
> I would like to know how to add nodes during a job execution.
> Now my hostfile has the node 10.0.0.23 that is off,
> I would start this node during the execution so that the job can use it
> When I run the command:
>
> mpirun -np 2 -hostfile /tmp/hosts application
>
> the following message appears:
>
> ssh: connect to host 10.0.0.23 port 22: No route to host
> --
> A daemon (pid 10773) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
> thanks a lot,
>
> --
> Rafael Braga
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/


Re: [OMPI users] How to add nodes while running job

2011-08-27 Thread Ralph Castain
OMPI has no way of knowing that you will turn the node on at some future point. 
All it can do is try to launch the job on the provided node, which fails 
because the node doesn't respond.

You'll have to come up with some scheme for telling the node to turn on in 
anticipation of starting a job - a resource manager is typically used for that 
purpose.

On Aug 27, 2011, at 6:58 AM, Rafael Braga wrote:

> I would like to know how to add nodes during a job execution. 
> Now my hostfile has the node 10.0.0.23 that is off, 
> I would start this node during the execution so that the job can use it
> When I run the command:
> 
> mpirun -np 2 -hostfile /tmp/hosts application
> 
> the following message appears:
> 
> ssh: connect to host 10.0.0.23 port 22: No route to host
> --
> A daemon (pid 10773) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
> 
> thanks a lot,
> 
> -- 
> Rafael Braga
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] How to add nodes while running job

2011-08-27 Thread Rafael Braga
I would like to know how to add nodes during a job execution.
Now my hostfile has the node 10.0.0.23 that is off,
I would start this node during the execution so that the job can use it
When I run the command:

mpirun -np 2 -hostfile /tmp/hosts application

the following message appears:

ssh: connect to host 10.0.0.23 port 22: No route to host
--
A daemon (pid 10773) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished

thanks a lot,

-- 
Rafael Braga