Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-26 Thread TERRY DONTJE



On 4/25/2012 1:00 PM, Jeff Squyres wrote:

On Apr 25, 2012, at 12:51 PM, Ralph Castain wrote:


Sounds rather bizarre. Do you have lstopo on your machine? Might be useful to 
see the output of that so we can understand what it thinks the topology is like 
as this underpins the binding code.

The -nooversubscribe option is a red herring here - it has nothing to do with 
the problem, nor will it help.

FWIW: if you aren't adding --bind-to-core, then OMPI isn't launching your process on any 
specific core at all - we are simply launching it on the node. It sounds to me like your 
code is incorrectly identifying "sharing" when a process isn't bound to a 
specific core.

+1

Put differently: if you're not binding your processes to processor cores, then 
it's quite likely/possible that multiple processes *are* running on the same 
processor cores, at least intermittently, because the OS is allowed to migrate 
processes to whatever processor cores it wants to.
However, Kyle mentioned previously that he was doing a -bind-to-core 
option.  I would suggest adding -report-bindings to the mpirun command 
line and see what mpirun really thinks it is binding to if it is at all.


There is one piece of information that seems missing and confusing me.  
Kyle how is your code determining it is the only process bound to a core 
or conversely another process is bound to the same core?


--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Jeff Squyres
On Apr 25, 2012, at 12:51 PM, Ralph Castain wrote:

> Sounds rather bizarre. Do you have lstopo on your machine? Might be useful to 
> see the output of that so we can understand what it thinks the topology is 
> like as this underpins the binding code.
> 
> The -nooversubscribe option is a red herring here - it has nothing to do with 
> the problem, nor will it help.
> 
> FWIW: if you aren't adding --bind-to-core, then OMPI isn't launching your 
> process on any specific core at all - we are simply launching it on the node. 
> It sounds to me like your code is incorrectly identifying "sharing" when a 
> process isn't bound to a specific core.

+1

Put differently: if you're not binding your processes to processor cores, then 
it's quite likely/possible that multiple processes *are* running on the same 
processor cores, at least intermittently, because the OS is allowed to migrate 
processes to whatever processor cores it wants to.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Ralph Castain
Sounds rather bizarre. Do you have lstopo on your machine? Might be useful to 
see the output of that so we can understand what it thinks the topology is like 
as this underpins the binding code.

The -nooversubscribe option is a red herring here - it has nothing to do with 
the problem, nor will it help.

FWIW: if you aren't adding --bind-to-core, then OMPI isn't launching your 
process on any specific core at all - we are simply launching it on the node. 
It sounds to me like your code is incorrectly identifying "sharing" when a 
process isn't bound to a specific core.

On Apr 25, 2012, at 10:39 AM, Kyle Boe wrote:

> >I just re-read the thread. I think there's a little confusion between the 
> >terms "processor" and "MPI process" here. You said "As a pre-processing 
> >step, each processor must figure out which other processors it must 
> >communicate with by virtue of sharing neighboring grid points." Did you mean 
> >"MPI process" instead of "processor"? 
> 
> The code is designed to be run using only one MPI process per 
> core/slot/whatever word you want to use. I believe what is happening here is 
> that OMPI is launching all MPI processes on a single slot.This is why my code 
> is freaking out and telling me that a slot is asking for information it 
> already owns. So, in order to answer your second point:
> 
> >Secondly, if you're just running on a single machine with no scheduler and 
> >no hostile, you should be able to: mpirun -np  
> >your_program_name When you get the "There are not enough slots available in 
> >the system..." message, that usually means that *something* is telling Open 
> >MPI a maximum number of processes that can be run, and your -np value is 
> >greater than that. This is *usually* a scheduler, but can also be a hostile 
> >and/or an environment variable or file-based MCA parameter. 
> 
> I wanted to force MPI to only assign a single process per each slot, so I 
> used the -nooversubscribe option. This is when I get the error about there 
> not being enough slots in the system to fulfill my request. I can use mpirun 
> with np set to whatever I want and it will launch succesfully, but then my 
> code kills itself because the processes are being oversubscribed to a single 
> slot, which doesn't do me or my code any good at all.
> 
> So the problem is that even though I have 8, 24, and 48 core machines, OMPI 
> thinks each one of them only has a single core, and will launch all MPI 
> processes on that one core.
> 
> Kyle
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Kyle Boe
>I just re-read the thread. I think there's a little confusion between the
terms "processor" and "MPI process" here. You said "As a pre-processing
step, each processor must figure out which other processors it must
communicate with by virtue of sharing neighboring grid points." Did you
mean "MPI process" instead of "processor"?

The code is designed to be run using only one MPI process per
core/slot/whatever word you want to use. I believe what is happening here
is that OMPI is launching all MPI processes on a single slot. This is why
my code is freaking out and telling me that a slot is asking for
information it already owns. So, in order to answer your second point:

>Secondly, if you're just running on a single machine with no scheduler and
no hostile, you should be able to: mpirun -np 
your_program_name When you get the "There are not enough slots available in
the system..." message, that usually means that *something* is telling Open
MPI a maximum number of processes that can be run, and your -np value is
greater than that. This is *usually* a scheduler, but can also be a hostile
and/or an environment variable or file-based MCA parameter.
I wanted to force MPI to only assign a single process per each slot, so I
used the -nooversubscribe option. This is when I get the error about there
not being enough slots in the system to fulfill my request. I can use
mpirun with np set to whatever I want and it will launch succesfully, but
then my code kills itself because the processes are being oversubscribed to
a single slot, which doesn't do me or my code any good at all.

So the problem is that even though I have 8, 24, and 48 core machines, OMPI
thinks each one of them only has a single core, and will launch all MPI
processes on that one core.

Kyle


Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Jeff Squyres
I just re-read the thread.

I think there's a little confusion between the terms "processor" and "MPI 
process" here.  

You said "As a pre-processing step, each processor must figure out which other 
processors it must communicate with by virtue of sharing neighboring grid 
points."

Did you mean "MPI process" instead of "processor"?

Secondly, if you're just running on a single machine with no scheduler and no 
hostile, you should be able to:

  mpirun -np  your_program_name

When you get the "There are not enough slots available in the system..." 
message, that usually means that *something* is telling Open MPI a maximum 
number of processes that can be run, and your -np value is greater than that.  
This is *usually* a scheduler, but can also be a hostile and/or an environment 
variable or file-based MCA parameter.


On Apr 25, 2012, at 12:24 PM, Kyle Boe wrote:

> > Any chance you could upgrade to Open MPI 1.5.5? It has a better version of 
> > the processor affinity stuff than the 1.4 series. 
> 
> Did this and recompiled everything that depended on OMPI. No difference 
> whatsoever. It still tells me, if I specify -np 2 for example, that "There 
> are not enough slots available in the system to satisfy the 2 slots 
> that were requested by the application." 
> 
> >My bad. I did not read the bottom part of the email. Not sure If this would 
> >help, but can u try, --mca btl sm,self ? 
> 
> This also does not change anything...
> 
> Really confused what is going on here!
> 
> Kyle
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Kyle Boe
> Any chance you could upgrade to Open MPI 1.5.5? It has a better version
of the processor affinity stuff than the 1.4 series.

Did this and recompiled everything that depended on OMPI. No difference
whatsoever. It still tells me, if I specify -np 2 for example, that "There
are not enough slots available in the system to satisfy the 2 slots
that were requested by the application."

>My bad. I did not read the bottom part of the email. Not sure If this
would help, but can u try, --mca btl sm,self ?

This also does not change anything...

Really confused what is going on here!

Kyle


Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Jingcha Joba
My bad. I did not read the bottom part of the email. 
Not sure If this would help, but can u try, --mca btl sm,self ?

--
Sent from my iPhone

On Apr 24, 2012, at 3:46 PM, Kyle Boe <boex0...@umn.edu> wrote:

> Right, I tried using a hostfile, and it made no difference. This is running 
> OpenMPI 1.4.4 on CentOS 5.x machines. The original issue was an error trap 
> built into my code, where it said one of the cores was asking for information 
> it already owned. I'm sorry to be vague, but I can't share anything from the 
> code in this forum. Basically, it is a CFD code, parallelized by splitting 
> the grid points in the simulation up amongst the processors assigned to the 
> job. As a pre-processing step, each processor must figure out which other 
> processors it must communicate with by virtue of sharing neighboring 
> gridpoints. The error I received told me that the grid points were not being 
> split amongst different processors. I have used this exact same code using 
> OpenMPI on other (larger) architectures, which, combined with the MPI error I 
> shared before, leads me to believe I must have something not configured 
> correctly, or there is some run time option I'm not setting properly, etc.
> 
> Thanks
> 
> Kyle
> 
> On Tue, Apr 24, 2012 at 4:15 PM, <users-requ...@open-mpi.org> wrote:
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] MPI doesn't recognize multiple cores
>available onmulticore machines
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <f9d4fce5-9974-4814-9bcf-a39124961...@open-mpi.org>
> Content-Type: text/plain; charset=us-ascii
> 
> You don't need a hostfile to run multiple procs on the localhost.
> 
> What version of OMPI are you using? What was the original issue?
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-25 Thread Kyle Boe
I tried this and got the same result. Any other thing I might be missing...?

>Did you tell it --bind-to-core? If not, then the procs would be unbound to
any particular core - so your code might well think they are "sharing"
cores.


Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-24 Thread Kyle Boe
Right, I tried using a hostfile, and it made no difference. This is running
OpenMPI 1.4.4 on CentOS 5.x machines. The original issue was an error trap
built into my code, where it said one of the cores was asking for
information it already owned. I'm sorry to be vague, but I can't share
anything from the code in this forum. Basically, it is a CFD code,
parallelized by splitting the grid points in the simulation up amongst the
processors assigned to the job. As a pre-processing step, each processor
must figure out which other processors it must communicate with by virtue
of sharing neighboring gridpoints. The error I received told me that the
grid points were not being split amongst different processors. I have used
this exact same code using OpenMPI on other (larger) architectures, which,
combined with the MPI error I shared before, leads me to believe I must
have something not configured correctly, or there is some run time option
I'm not setting properly, etc.

Thanks

Kyle

On Tue, Apr 24, 2012 at 4:15 PM, <users-requ...@open-mpi.org> wrote:

> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] MPI doesn't recognize multiple cores
>    available on    multicore machines
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <f9d4fce5-9974-4814-9bcf-a39124961...@open-mpi.org>
> Content-Type: text/plain; charset=us-ascii
>
> You don't need a hostfile to run multiple procs on the localhost.
>
> What version of OMPI are you using? What was the original issue?
>
>


Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-24 Thread Ralph Castain
You don't need a hostfile to run multiple procs on the localhost.

What version of OMPI are you using? What was the original issue?

On Apr 24, 2012, at 4:07 PM, Jingcha Joba wrote:

> Try using slots in hostfile ?
> 
> --
> Sent from my iPhone
> 
> On Apr 24, 2012, at 2:52 PM, Kyle Boe  wrote:
> 
>> I'm having a problem trying to use OpenMPI on some multicore machines I 
>> have. The code I am running was giving me errors which suggested that MPI 
>> was assigning multiple processes to the same core (which I do not want). So, 
>> I tried launching my job using the -nooversubscribe option, and I get this 
>> error:
>> 
>> bash-3.2$ mpirun -np 2 -nooversubscribe 
>> --
>> There are not enough slots available in the system to satisfy the 2 slots 
>> that were requested by the application:
>>  
>> 
>> Either request fewer slots for your application, or make more slots available
>> for use.
>> --
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> mpirun: clean termination accomplished
>> 
>> I am just trying to run on the localhost, not on any remote machines. This 
>> happens on both my 8 (2*4) core and 24 (4*6) core machines. Relevant info: I 
>> am not using any type of scheduler here, although from the searching I've 
>> done that doesn't seem like a requirement. The only thing I can think is 
>> there must be some type of configuration or option I'm not setting for using 
>> on shared memory machines (either at compile or run time), but I can't find 
>> anyone else who has come across this error. Any thoughts?
>> 
>> Thanks,
>> 
>> Kyle
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI doesn't recognize multiple cores available on multicore machines

2012-04-24 Thread Jingcha Joba
Try using slots in hostfile ?

--
Sent from my iPhone

On Apr 24, 2012, at 2:52 PM, Kyle Boe  wrote:

> I'm having a problem trying to use OpenMPI on some multicore machines I have. 
> The code I am running was giving me errors which suggested that MPI was 
> assigning multiple processes to the same core (which I do not want). So, I 
> tried launching my job using the -nooversubscribe option, and I get this 
> error:
> 
> bash-3.2$ mpirun -np 2 -nooversubscribe 
> --
> There are not enough slots available in the system to satisfy the 2 slots 
> that were requested by the application:
>   
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
> 
> I am just trying to run on the localhost, not on any remote machines. This 
> happens on both my 8 (2*4) core and 24 (4*6) core machines. Relevant info: I 
> am not using any type of scheduler here, although from the searching I've 
> done that doesn't seem like a requirement. The only thing I can think is 
> there must be some type of configuration or option I'm not setting for using 
> on shared memory machines (either at compile or run time), but I can't find 
> anyone else who has come across this error. Any thoughts?
> 
> Thanks,
> 
> Kyle
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users