Re: [OMPI users] Newbie question

2011-01-10 Thread pooja varshneya
You can use mpirun.

On Mon, Jan 10, 2011 at 8:04 PM, Tena Sakai  wrote:

>  Hi,
>
> I am an mpi newbie.  My open MPI is v 1.4.3, which I compiled
> on a linux machine.
>
> I am using a language called R, which has an mpi interface/package.
> It appears that it is happy, on the surface, with the open MPI I installed.
>
> There is an R function called mpi.spawn.Rslaves().  An argument to
> this function is nslaves.  I can issue, for example,
>   mpi.spawn.Rslaves( nslaves=20 )
> And it spawns 20 slave processes.  The trouble is that it is all on the
> same node as that of the master.  I want, instead, these 20 (or more)
> slaves spawned on other machines on the network.
>
> It so happens the mpi.spawn.Rslaves() has an extra argument called
> hosts.  Here’s the definition of hosts from the api document: “NULL or
> LAM node numbers to specify where R slaves to be spawned.”  I have
> no idea what LAM node is, but there  is a funciton called lamhosts().
> which returns a bit verbose message:
>
>   It seems that there is no lamd running on the host compute-0-0.local.
>
>   This indicates that the LAM/MPI runtime environment is not operating.
>   The LAM/MPI runtime environment is necessary for the "lamnodes" command.
>
>   Please run the "lamboot" command the start the LAM/MPI runtime
>   environment.  See the LAM/MPI documentation for how to invoke
>   "lamboot" across multiple machines.
>
> Here’s my question.  Is there such command as lamboot in open MPI 1.4.3?
> Or am I using a wrong mpi software?  In a FAQ I read that there are other
> MPI software (FT-mpi, LA-mpi, LAM-mpi), but I had notion that open MPI
> is to have functionalities of all.  Is this a wrong impression?
>
> Thank you for your help.
>
> Tena Sakai
> tsa...@gallo.ucsf.edu
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Newbie question

2011-01-10 Thread Tena Sakai
Hi,

I am an mpi newbie.  My open MPI is v 1.4.3, which I compiled
on a linux machine.

I am using a language called R, which has an mpi interface/package.
It appears that it is happy, on the surface, with the open MPI I installed.

There is an R function called mpi.spawn.Rslaves().  An argument to
this function is nslaves.  I can issue, for example,
  mpi.spawn.Rslaves( nslaves=20 )
And it spawns 20 slave processes.  The trouble is that it is all on the
same node as that of the master.  I want, instead, these 20 (or more)
slaves spawned on other machines on the network.

It so happens the mpi.spawn.Rslaves() has an extra argument called
hosts.  Here’s the definition of hosts from the api document: “NULL or
LAM node numbers to specify where R slaves to be spawned.”  I have
no idea what LAM node is, but there  is a funciton called lamhosts().
which returns a bit verbose message:

  It seems that there is no lamd running on the host compute-0-0.local.

  This indicates that the LAM/MPI runtime environment is not operating.
  The LAM/MPI runtime environment is necessary for the "lamnodes" command.

  Please run the "lamboot" command the start the LAM/MPI runtime
  environment.  See the LAM/MPI documentation for how to invoke
  "lamboot" across multiple machines.

Here’s my question.  Is there such command as lamboot in open MPI 1.4.3?
Or am I using a wrong mpi software?  In a FAQ I read that there are other
MPI software (FT-mpi, LA-mpi, LAM-mpi), but I had notion that open MPI
is to have functionalities of all.  Is this a wrong impression?

Thank you for your help.

Tena Sakai
tsa...@gallo.ucsf.edu


Re: [OMPI users] CQ errors

2011-01-10 Thread Michael Di Domenico
2011/1/10 Peter Kjellström :
> On Monday, January 10, 2011 03:06:06 pm Michael Di Domenico wrote:
>> I'm not sure if these are being reported from OpenMPI or through
>> OpenMPI from OpenFabrics, but i figured this would be a good place to
>> start
>>
>> On one node we received the below errors, i'm not sure i under the
>> error sequence, hopefully someone can shed some light on what
>> happened.
>>
>> [[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
> ...
>> network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)
>
> Not really addressing your problem, but, with qlogic you should be using psm,
> not verbs (btl_openib).
>
> That said, openib should work (slowly).

Yes, you are correct.  We're running via verbs at the moment because
of a slurm interop issue.  I have a patch from ralph but have not
tested it yet.

So far the only noticeable to effect to running non-psm is a 5usec hit
on each packet.  otherwise functionally we seem okay.



Re: [OMPI users] CQ errors

2011-01-10 Thread Peter Kjellström
On Monday, January 10, 2011 03:06:06 pm Michael Di Domenico wrote:
> I'm not sure if these are being reported from OpenMPI or through
> OpenMPI from OpenFabrics, but i figured this would be a good place to
> start
> 
> On one node we received the below errors, i'm not sure i under the
> error sequence, hopefully someone can shed some light on what
> happened.
> 
> [[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
...
> network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)

Not really addressing your problem, but, with qlogic you should be using psm, 
not verbs (btl_openib).

That said, openib should work (slowly).

/Peter


signature.asc
Description: This is a digitally signed message part.


[OMPI users] CQ errors

2011-01-10 Thread Michael Di Domenico
I'm not sure if these are being reported from OpenMPI or through
OpenMPI from OpenFabrics, but i figured this would be a good place to
start

On one node we received the below errors, i'm not sure i under the
error sequence, hopefully someone can shed some light on what
happened.

[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id c30b100 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node26 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1755c900 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1779b180 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node20 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 8e1aa80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node24 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1164b600 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 118c3f80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node12 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1b8f0080 opcode 128 vendor error 0 qp_idx 0

It was the only node out of a 75 node run that spit out the error.  I
rechecked the node, no symbol/link recovery errors on the network and
ran Pallas between it and several other machines with no errors

network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)

thanks