Re: [OMPI users] compilation error with pgcc Unknown switch

2012-03-01 Thread Abhinav Sarje
yes, I did a full autogen, configure, make clean and make all


On Thu, Mar 1, 2012 at 10:03 PM, Jeffrey Squyres  wrote:
> Did you do a full autogen / configure / make clean / make all ?
>
>
> On Mar 1, 2012, at 8:53 AM, Abhinav Sarje wrote:
>
>> Thanks Ralph. That did help, but only till the next hurdle. Now the
>> build fails at the following point with an 'undefined reference':
>> ---
>> Making all in tools/ompi_info
>> make[2]: Entering directory
>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
>>  CC     ompi_info.o
>>  CC     output.o
>>  CC     param.o
>>  CC     components.o
>>  CC     version.o
>>  CCLD   ompi_info
>> ../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64'
>> make[2]: *** [ompi_info] Error 2
>> make[2]: Leaving directory
>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
>> make: *** [all-recursive] Error 1
>> ---
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castain  wrote:
>>> You need to update your source code - this was identified and fixed on Wed. 
>>> Unfortunately, our trunk is a developer's environment. While we try hard to 
>>> keep it fully functional, bugs do occasionally work their way into the code.
>>>
>>> On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote:
>>>
 Hi Nathan,

 I tried building on an internal login node, and it did not fail at the
 previous point. But, after compiling for a very long time, it failed
 while building libmpi.la, with a multiple definition error:
 --
 ...
  CC     mpiext/mpiext.lo
  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo
  CC     
 mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo
  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo
  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo
  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo
  CCLD   mpi/f77/base/libmpi_f77_base.la
  CCLD   libmpi.la
 mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o):
 In function `local_heap_sort':
 /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c::
 multiple definition of `local_heap_sort'
 mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929:
 first defined here
 make[2]: *** [libmpi.la] Error 2
 make[2]: Leaving directory
 `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory
 `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
 make: *** [all-recursive] Error 1
 --

 Any idea why this is happening, and how to fix it? Again, I am using
 the XE6 platform configuration file.

 Abhinav.

 On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm  wrote:
>
>
> On Mon, 27 Feb 2012, Abhinav Sarje wrote:
>
>> Hi Nathan, Gus, Manju,
>>
>> I got a chance to try out the XE6 support build, but with no success.
>> First I was getting this error: "PGC-F-0010-File write error occurred
>> (temporary pragma .s file)". After searching online about this error,
>> I saw that there is a patch at
>>
>> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch;
>> for this particular error.
>>
>> With the patched version, I did not get this error anymore, but got
>> the unknown switch flag error for the flag "-march=amdfam10"
>> (specified in the XE6 configuration in the dev trunk) at a particular
>> point even if I use the '-noswitcherror' flag with the pgcc compiler.
>>
>> If I remove this flag (-march=amdfam10), the build fails later at the
>> following point:
>> -
>> Making all in mca/ras/alps
>> make[2]: Entering directory
>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>>  CC     ras_alps_component.lo
>>  CC     ras_alps_module.lo
>> PGC-F-0206-Can't find include file alps/apInfo.h
>> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37)
>> PGC/x86-64 Linux 11.10-0: compilation aborted
>> make[2]: *** [ras_alps_module.lo] Error 1
>> make[2]: Leaving directory
>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte'
>> make: *** [all-recursive] Error 1
>> 

Re: [OMPI users] run orterun with more than 200 processes

2012-03-01 Thread Ralph Castain
You might try putting that list of hosts in a hostfile instead of on the cmd 
line - you may be hitting some limits there.

I also don't believe that you can add an orted in that manner - orterun will 
have no idea how it got there and is likely to abort.

On Mar 1, 2012, at 3:20 PM, Jianzhang He wrote:

> Hi,
>  
> I am not sure if this is the right place to post this question. If you know 
> where it is appropriate, please let me know.
>  
> I need to run application that  launches 200 processes with the command:
> 1)orterun --prefix ./ -np 200 -wd ./ -host 
> hostname1.domain.com,1,2,3,4,5,6,7,8,9,…..,196,197,198,199  CMD
>  
> Later,  I will run a command to communicate with 1) with a command like:
> 2)orted -mca ess env -mca orte_ess_ -mca orte_ess_vpid 100 -mca 
> orte_ess_num_procs 200 --hnp-uri "job#;tcp:/ hostname1.domain.com /:port#"
>  
> The problem I have is I can only run with about 100 nodes. If the number is 
> higher, 1) will not invoke CMD and the total number of processes is about 130 
> or so.
>  
> My question is how to remove that limit?
>  
> Thanks in advance.
>  
> Jianzhang
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Ralph Castain
I don't know - I didn't write the app file code, and I've never seen anything 
defining its behavior. So I guess you could say it is intended - or not! :-/


On Mar 1, 2012, at 2:53 PM, Jeffrey Squyres wrote:

> Actually, I should say that I discovered that if you put --prefix on each 
> line of the app context file, then the first case (running the app context 
> file) works fine; it adheres to the --prefix behavior.
> 
> Ralph: is this intended behavior?  (I don't know if I have an opinion either 
> way)
> 
> 
> On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote:
> 
>> I see the problem.
>> 
>> It looks like the use of the app context file is triggering different 
>> behavior, and that behavior is erasing the use of --prefix.  If I replace 
>> the app context file with a complete command line, it works and the --prefix 
>> behavior is observed.
>> 
>> Specifically:
>> 
>> $mpirunfile $mcaparams --app addmpw-hostname
>> 
>> ^^ This one seems to ignore --prefix behavior.
>> 
>> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
>> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 
>> -np 1 hostname
>> 
>> ^^ These two seem to adhere to the proper --prefix behavior.
>> 
>> Ralph -- can you have a look?
>> 
>> 
>> 
>> 
>> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:
>> 
>>> Hi Ralph,
>>> 
>>> Thanks, here is what I did as suggested by Jeff:
>>> 
 What did this command line look like? Can you provide the configure line 
 as well? 
>>> 
>>> As in my previous post, the script as following:
>>> 
>>> (1) debug messages:
>> 
>>> yiguang@gulftown testdmp]$ ./test.bash
>>> [gulftown:28340] mca: base: components_open: Looking for plm components
>>> [gulftown:28340] mca: base: components_open: opening plm components
>>> [gulftown:28340] mca: base: components_open: found loaded component rsh
>>> [gulftown:28340] mca: base: components_open: component rsh has no register 
>>> function
>>> [gulftown:28340] mca: base: components_open: component rsh open function 
>>> successful
>>> [gulftown:28340] mca: base: components_open: found loaded component slurm
>>> [gulftown:28340] mca: base: components_open: component slurm has no 
>>> register function
>>> [gulftown:28340] mca: base: components_open: component slurm open function 
>>> successful
>>> [gulftown:28340] mca: base: components_open: found loaded component tm
>>> [gulftown:28340] mca: base: components_open: component tm has no register 
>>> function
>>> [gulftown:28340] mca: base: components_open: component tm open function 
>>> successful
>>> [gulftown:28340] mca:base:select: Auto-selecting plm components
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
>>> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
>>> [gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
>>> failed to return a module
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
>>> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query 
>>> failed to return a module
>>> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
>>> [gulftown:28340] mca: base: close: component slurm closed
>>> [gulftown:28340] mca: base: close: unloading component slurm
>>> [gulftown:28340] mca: base: close: component tm closed
>>> [gulftown:28340] mca: base: close: unloading component tm
>>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
>>> 3546479048
>>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
>>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
>>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
>>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
>>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
>>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
>>> shell
>>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
>>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>>>  /usr/bin/rsh   orted --daemonize -mca ess env -mca 
>>> orte_ess_jobid 1142816768 -mca 
>>> orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
>>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>>  -
>>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>>> btl openib,sm,self --mca 
>>> orte_tmpdir_base /tmp --mca plm_base_verbose 100
>>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
>>> gulftown
>>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
>>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
>>> [[17438,0],1]
>>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
>>> [/usr/bin/rsh ibnode001  orted --daemonize -mca 
>>> ess env -mca orte_ess_jobid 1142816768 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan

> Actually, I should say that I discovered that if you put --prefix on each 
> line of the app context file, then the first
> case (running the app context file) works fine; it adheres to the --prefix 
> behavior. 

Yes, I confirmed this on our cluster. It works with --prefix on each line of 
the app file.


[OMPI users] run orterun with more than 200 processes

2012-03-01 Thread Jianzhang He
Hi,

I am not sure if this is the right place to post this question. If you know 
where it is appropriate, please let me know.

I need to run application that  launches 200 processes with the command:

1)orterun --prefix ./ -np 200 -wd ./ -host 
hostname1.domain.com,1,2,3,4,5,6,7,8,9,.,196,197,198,199  CMD

Later,  I will run a command to communicate with 1) with a command like:

2)orted -mca ess env -mca orte_ess_ -mca orte_ess_vpid 100 -mca 
orte_ess_num_procs 200 --hnp-uri "job#;tcp:/ hostname1.domain.com /:port#"

The problem I have is I can only run with about 100 nodes. If the number is 
higher, 1) will not invoke CMD and the total number of processes is about 130 
or so.

My question is how to remove that limit?

Thanks in advance.

Jianzhang



Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Jeffrey Squyres
Actually, I should say that I discovered that if you put --prefix on each line 
of the app context file, then the first case (running the app context file) 
works fine; it adheres to the --prefix behavior.

Ralph: is this intended behavior?  (I don't know if I have an opinion either 
way)


On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote:

> I see the problem.
> 
> It looks like the use of the app context file is triggering different 
> behavior, and that behavior is erasing the use of --prefix.  If I replace the 
> app context file with a complete command line, it works and the --prefix 
> behavior is observed.
> 
> Specifically:
> 
> $mpirunfile $mcaparams --app addmpw-hostname
> 
> ^^ This one seems to ignore --prefix behavior.
> 
> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 
> -np 1 hostname
> 
> ^^ These two seem to adhere to the proper --prefix behavior.
> 
> Ralph -- can you have a look?
> 
> 
> 
> 
> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:
> 
>> Hi Ralph,
>> 
>> Thanks, here is what I did as suggested by Jeff:
>> 
>>> What did this command line look like? Can you provide the configure line as 
>>> well? 
>> 
>> As in my previous post, the script as following:
>> 
>> (1) debug messages:
> 
>> yiguang@gulftown testdmp]$ ./test.bash
>> [gulftown:28340] mca: base: components_open: Looking for plm components
>> [gulftown:28340] mca: base: components_open: opening plm components
>> [gulftown:28340] mca: base: components_open: found loaded component rsh
>> [gulftown:28340] mca: base: components_open: component rsh has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component rsh open function 
>> successful
>> [gulftown:28340] mca: base: components_open: found loaded component slurm
>> [gulftown:28340] mca: base: components_open: component slurm has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component slurm open function 
>> successful
>> [gulftown:28340] mca: base: components_open: found loaded component tm
>> [gulftown:28340] mca: base: components_open: component tm has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component tm open function 
>> successful
>> [gulftown:28340] mca:base:select: Auto-selecting plm components
>> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
>> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
>> [gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
>> failed to return a module
>> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
>> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query 
>> failed to return a module
>> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
>> [gulftown:28340] mca: base: close: component slurm closed
>> [gulftown:28340] mca: base: close: unloading component slurm
>> [gulftown:28340] mca: base: close: component tm closed
>> [gulftown:28340] mca: base: close: unloading component tm
>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
>> 3546479048
>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
>> shell
>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>>   /usr/bin/rsh   orted --daemonize -mca ess env -mca 
>> orte_ess_jobid 1142816768 -mca 
>> orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>  -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>> btl openib,sm,self --mca 
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100
>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
>> gulftown
>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
>> [[17438,0],1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
>> [/usr/bin/rsh ibnode001  orted --daemonize -mca 
>> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
>> orte_ess_num_procs 4 --hnp-uri 
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>  -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>> btl openib,sm,self --mca 
>> orte_tmpdir_base /tmp --mca 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan
Hi Ralph,

Thanks, here is what I did as suggested by Jeff:

> What did this command line look like? Can you provide the configure line as 
> well? 

As in my previous post, the script as following:

(1) debug messages:
>>>
yiguang@gulftown testdmp]$ ./test.bash
[gulftown:28340] mca: base: components_open: Looking for plm components
[gulftown:28340] mca: base: components_open: opening plm components
[gulftown:28340] mca: base: components_open: found loaded component rsh
[gulftown:28340] mca: base: components_open: component rsh has no register 
function
[gulftown:28340] mca: base: components_open: component rsh open function 
successful
[gulftown:28340] mca: base: components_open: found loaded component slurm
[gulftown:28340] mca: base: components_open: component slurm has no register 
function
[gulftown:28340] mca: base: components_open: component slurm open function 
successful
[gulftown:28340] mca: base: components_open: found loaded component tm
[gulftown:28340] mca: base: components_open: component tm has no register 
function
[gulftown:28340] mca: base: components_open: component tm open function 
successful
[gulftown:28340] mca:base:select: Auto-selecting plm components
[gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
[gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[gulftown:28340] mca:base:select:(  plm) Querying component [tm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query failed 
to return a module
[gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
[gulftown:28340] mca: base: close: component slurm closed
[gulftown:28340] mca: base: close: unloading component slurm
[gulftown:28340] mca: base: close: component tm closed
[gulftown:28340] mca: base: close: unloading component tm
[gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
3546479048
[gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
[gulftown:28340] [[17438,0],0] plm:base:receive start comm
[gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
[gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
[gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
shell
[gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
/usr/bin/rsh   orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca 
orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100
[gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
gulftown
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode001  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode002  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode003  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
[gulftown:28340] [[17438,0],0] 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Ralph Castain
What did this command line look like? Can you provide the configure line as 
well?

On Mar 1, 2012, at 12:46 PM, Yiguang Yan wrote:

> Hi Jeff,
> 
> Here I made a developer build, and then got the following message 
> with plm_base_verbose:
> 
 
> [gulftown:28340] mca: base: components_open: Looking for plm 
> components
> [gulftown:28340] mca: base: components_open: opening plm 
> components
> [gulftown:28340] mca: base: components_open: found loaded 
> component rsh
> [gulftown:28340] mca: base: components_open: component rsh 
> has no register function
> [gulftown:28340] mca: base: components_open: component rsh 
> open function successful
> [gulftown:28340] mca: base: components_open: found loaded 
> component slurm
> [gulftown:28340] mca: base: components_open: component slurm 
> has no register function
> [gulftown:28340] mca: base: components_open: component slurm 
> open function successful
> [gulftown:28340] mca: base: components_open: found loaded 
> component tm
> [gulftown:28340] mca: base: components_open: component tm 
> has no register function
> [gulftown:28340] mca: base: components_open: component tm 
> open function successful
> [gulftown:28340] mca:base:select: Auto-selecting plm components
> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [gulftown:28340] mca:base:select:(  plm) Querying component 
> [slurm]
> [gulftown:28340] mca:base:select:(  plm) Skipping component 
> [slurm]. Query failed to return a module
> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
> [gulftown:28340] mca: base: close: component slurm closed
> [gulftown:28340] mca: base: close: unloading component slurm
> [gulftown:28340] mca: base: close: component tm closed
> [gulftown:28340] mca: base: close: unloading component tm
> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 
> nodename hash 3546479048
> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote 
> shell as local shell
> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>/usr/bin/rsh   orted --daemonize -mca ess env -
> mca orte_ess_jobid 1142816768 -mca orte_ess_vpid  -
> mca orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100
> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already 
> exists on node gulftown
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode001
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
> [[17438,0],1]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode001  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode002
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
> [[17438,0],2]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode002  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode003
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode003  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan
Hi Jeff,

Here I made a developer build, and then got the following message 
with plm_base_verbose:

>>>
[gulftown:28340] mca: base: components_open: Looking for plm 
components
[gulftown:28340] mca: base: components_open: opening plm 
components
[gulftown:28340] mca: base: components_open: found loaded 
component rsh
[gulftown:28340] mca: base: components_open: component rsh 
has no register function
[gulftown:28340] mca: base: components_open: component rsh 
open function successful
[gulftown:28340] mca: base: components_open: found loaded 
component slurm
[gulftown:28340] mca: base: components_open: component slurm 
has no register function
[gulftown:28340] mca: base: components_open: component slurm 
open function successful
[gulftown:28340] mca: base: components_open: found loaded 
component tm
[gulftown:28340] mca: base: components_open: component tm 
has no register function
[gulftown:28340] mca: base: components_open: component tm 
open function successful
[gulftown:28340] mca:base:select: Auto-selecting plm components
[gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
[gulftown:28340] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[gulftown:28340] mca:base:select:(  plm) Querying component 
[slurm]
[gulftown:28340] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[gulftown:28340] mca:base:select:(  plm) Querying component [tm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
[gulftown:28340] mca: base: close: component slurm closed
[gulftown:28340] mca: base: close: unloading component slurm
[gulftown:28340] mca: base: close: component tm closed
[gulftown:28340] mca: base: close: unloading component tm
[gulftown:28340] plm:base:set_hnp_name: initial bias 28340 
nodename hash 3546479048
[gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
[gulftown:28340] [[17438,0],0] plm:base:receive start comm
[gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
[gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
[gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote 
shell as local shell
[gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
/usr/bin/rsh   orted --daemonize -mca ess env -
mca orte_ess_jobid 1142816768 -mca orte_ess_vpid  -
mca orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100
[gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already 
exists on node gulftown
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode001
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],1]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode001  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode002
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],2]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode002  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode003
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode003  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],3]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:base:daemon_callback
<<<


It 

Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Ralph Castain
Also the sequential mapper may be of help - allows you to specify the node each 
rank is to be place on, one line/rank.


On Mar 1, 2012, at 12:40 PM, Gustavo Correa wrote:

> Hi Claudio
> 
> Check 'man mpirun'.  
> You will find examples of the
> '-byslot', '-bynode', '-loadbalance', and rankfile options, 
> which allow some control of how ranks are mapped into processors/cores.
> 
> I hope this helps,
> Gus Correa
> 
> On Mar 1, 2012, at 2:34 PM, Claudio Pastorino wrote:
> 
>> Hi, thanks for the answer.
>> You are right is not the rank what matters but how do I arrange
>> the physical procs in the cartesian topology. I don't care about the label.
>> So, how do I achieve that?
>> 
>> Regards,
>> Claudio
>> 
>> 
>> 
>> 2012/3/1, Ralph Castain :
>>> Is it really the rank that matters, or where the rank is located? For
>>> example, you could leave the ranks as assigned by the cartesian topology,
>>> but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node,
>>> etc.
>>> 
>>> Is that what you are trying to achieve?
>>> 
>>> 
>>> On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote:
>>> 
 Dear all,
 I apologize in advance if this is not the right list to post this. I
 am a newcomer and please let me know if I should be sending this to
 another list.
 
 I program MPI trying to do HPC parallel programs. In particular I
 wrote a parallel code
 for molecular dynamics simulations. The program splits the work in a
 matrix of procs and
 I send messages along rows and columns in an equal basis. I learnt
 that the typical
 arrangement of  cartesian  topology is not usually  the best option,
 because in a matrix, let's  say of 4x4 procs   with quad procs, the
 procs are arranged so that
 through columns one stays inside the same quad proc and through rows
 you are always going out to the network.  This means procs are
 arranged as one quad per row.
 
 I try to explain this for a 2x2 case. The cartesian topology does this
 assignment, typically:
 cartesianmpi_comm_world
 0,0 -->  0
 0,1 -->  1
 1,0 -->  2
 1,1 -->  3
 The question is, how do I get a "user defined" assignment such as:
 0,0 -->  0
 0,1 -->  2
 1,0 -->  1
 1,1 -->  3
 
 ?
 
 Thanks in advance and I hope to have  made this more or less
 understandable.
 Claudio
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Gustavo Correa
Hi Claudio

Check 'man mpirun'.  
You will find examples of the
'-byslot', '-bynode', '-loadbalance', and rankfile options, 
which allow some control of how ranks are mapped into processors/cores.

I hope this helps,
Gus Correa

On Mar 1, 2012, at 2:34 PM, Claudio Pastorino wrote:

> Hi, thanks for the answer.
> You are right is not the rank what matters but how do I arrange
> the physical procs in the cartesian topology. I don't care about the label.
> So, how do I achieve that?
> 
> Regards,
> Claudio
> 
> 
> 
> 2012/3/1, Ralph Castain :
>> Is it really the rank that matters, or where the rank is located? For
>> example, you could leave the ranks as assigned by the cartesian topology,
>> but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node,
>> etc.
>> 
>> Is that what you are trying to achieve?
>> 
>> 
>> On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote:
>> 
>>> Dear all,
>>> I apologize in advance if this is not the right list to post this. I
>>> am a newcomer and please let me know if I should be sending this to
>>> another list.
>>> 
>>> I program MPI trying to do HPC parallel programs. In particular I
>>> wrote a parallel code
>>> for molecular dynamics simulations. The program splits the work in a
>>> matrix of procs and
>>> I send messages along rows and columns in an equal basis. I learnt
>>> that the typical
>>> arrangement of  cartesian  topology is not usually  the best option,
>>> because in a matrix, let's  say of 4x4 procs   with quad procs, the
>>> procs are arranged so that
>>> through columns one stays inside the same quad proc and through rows
>>> you are always going out to the network.  This means procs are
>>> arranged as one quad per row.
>>> 
>>> I try to explain this for a 2x2 case. The cartesian topology does this
>>> assignment, typically:
>>> cartesianmpi_comm_world
>>> 0,0 -->  0
>>> 0,1 -->  1
>>> 1,0 -->  2
>>> 1,1 -->  3
>>> The question is, how do I get a "user defined" assignment such as:
>>> 0,0 -->  0
>>> 0,1 -->  2
>>> 1,0 -->  1
>>> 1,1 -->  3
>>> 
>>> ?
>>> 
>>> Thanks in advance and I hope to have  made this more or less
>>> understandable.
>>> Claudio
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Claudio Pastorino
Probably yes,
do I have a more systematic way?
Thanks
Claudio


2012/3/1, Jingcha Joba :
> mpirun -np 4 --host node1,node2,node1,node2 ./app
>
> Is this what you want?
>
> On Thu, Mar 1, 2012 at 10:57 AM, Claudio Pastorino <
> claudio.pastor...@gmail.com> wrote:
>
>> Dear all,
>> I apologize in advance if this is not the right list to post this. I
>> am a newcomer and please let me know if I should be sending this to
>> another list.
>>
>> I program MPI trying to do HPC parallel programs. In particular I
>> wrote a parallel code
>> for molecular dynamics simulations. The program splits the work in a
>> matrix of procs and
>> I send messages along rows and columns in an equal basis. I learnt
>> that the typical
>> arrangement of  cartesian  topology is not usually  the best option,
>> because in a matrix, let's  say of 4x4 procs   with quad procs, the
>> procs are arranged so that
>> through columns one stays inside the same quad proc and through rows
>> you are always going out to the network.  This means procs are
>> arranged as one quad per row.
>>
>> I try to explain this for a 2x2 case. The cartesian topology does this
>> assignment, typically:
>> cartesianmpi_comm_world
>> 0,0 -->  0
>> 0,1 -->  1
>> 1,0 -->  2
>> 1,1 -->  3
>> The question is, how do I get a "user defined" assignment such as:
>> 0,0 -->  0
>> 0,1 -->  2
>> 1,0 -->  1
>> 1,1 -->  3
>>
>> ?
>>
>> Thanks in advance and I hope to have  made this more or less
>> understandable.
>> Claudio
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>


Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Claudio Pastorino
Hi, thanks for the answer.
 You are right is not the rank what matters but how do I arrange
the physical procs in the cartesian topology. I don't care about the label.
So, how do I achieve that?

Regards,
Claudio



2012/3/1, Ralph Castain :
> Is it really the rank that matters, or where the rank is located? For
> example, you could leave the ranks as assigned by the cartesian topology,
> but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node,
> etc.
>
> Is that what you are trying to achieve?
>
>
> On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote:
>
>> Dear all,
>> I apologize in advance if this is not the right list to post this. I
>> am a newcomer and please let me know if I should be sending this to
>> another list.
>>
>> I program MPI trying to do HPC parallel programs. In particular I
>> wrote a parallel code
>> for molecular dynamics simulations. The program splits the work in a
>> matrix of procs and
>> I send messages along rows and columns in an equal basis. I learnt
>> that the typical
>> arrangement of  cartesian  topology is not usually  the best option,
>> because in a matrix, let's  say of 4x4 procs   with quad procs, the
>> procs are arranged so that
>> through columns one stays inside the same quad proc and through rows
>> you are always going out to the network.  This means procs are
>> arranged as one quad per row.
>>
>> I try to explain this for a 2x2 case. The cartesian topology does this
>> assignment, typically:
>> cartesianmpi_comm_world
>> 0,0 -->  0
>> 0,1 -->  1
>> 1,0 -->  2
>> 1,1 -->  3
>> The question is, how do I get a "user defined" assignment such as:
>> 0,0 -->  0
>> 0,1 -->  2
>> 1,0 -->  1
>> 1,1 -->  3
>>
>> ?
>>
>> Thanks in advance and I hope to have  made this more or less
>> understandable.
>> Claudio
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Jingcha Joba
mpirun -np 4 --host node1,node2,node1,node2 ./app

Is this what you want?

On Thu, Mar 1, 2012 at 10:57 AM, Claudio Pastorino <
claudio.pastor...@gmail.com> wrote:

> Dear all,
> I apologize in advance if this is not the right list to post this. I
> am a newcomer and please let me know if I should be sending this to
> another list.
>
> I program MPI trying to do HPC parallel programs. In particular I
> wrote a parallel code
> for molecular dynamics simulations. The program splits the work in a
> matrix of procs and
> I send messages along rows and columns in an equal basis. I learnt
> that the typical
> arrangement of  cartesian  topology is not usually  the best option,
> because in a matrix, let's  say of 4x4 procs   with quad procs, the
> procs are arranged so that
> through columns one stays inside the same quad proc and through rows
> you are always going out to the network.  This means procs are
> arranged as one quad per row.
>
> I try to explain this for a 2x2 case. The cartesian topology does this
> assignment, typically:
> cartesianmpi_comm_world
> 0,0 -->  0
> 0,1 -->  1
> 1,0 -->  2
> 1,1 -->  3
> The question is, how do I get a "user defined" assignment such as:
> 0,0 -->  0
> 0,1 -->  2
> 1,0 -->  1
> 1,1 -->  3
>
> ?
>
> Thanks in advance and I hope to have  made this more or less
> understandable.
> Claudio
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Ralph Castain
Is it really the rank that matters, or where the rank is located? For example, 
you could leave the ranks as assigned by the cartesian topology, but then map 
them so that ranks 0 and 2 share a node, 1 and 3 share a node, etc.

Is that what you are trying to achieve?


On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote:

> Dear all,
> I apologize in advance if this is not the right list to post this. I
> am a newcomer and please let me know if I should be sending this to
> another list.
> 
> I program MPI trying to do HPC parallel programs. In particular I
> wrote a parallel code
> for molecular dynamics simulations. The program splits the work in a
> matrix of procs and
> I send messages along rows and columns in an equal basis. I learnt
> that the typical
> arrangement of  cartesian  topology is not usually  the best option,
> because in a matrix, let's  say of 4x4 procs   with quad procs, the
> procs are arranged so that
> through columns one stays inside the same quad proc and through rows
> you are always going out to the network.  This means procs are
> arranged as one quad per row.
> 
> I try to explain this for a 2x2 case. The cartesian topology does this
> assignment, typically:
> cartesianmpi_comm_world
> 0,0 -->  0
> 0,1 -->  1
> 1,0 -->  2
> 1,1 -->  3
> The question is, how do I get a "user defined" assignment such as:
> 0,0 -->  0
> 0,1 -->  2
> 1,0 -->  1
> 1,1 -->  3
> 
> ?
> 
> Thanks in advance and I hope to have  made this more or less understandable.
> Claudio
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Redefine proc in cartesian topologies

2012-03-01 Thread Claudio Pastorino
Dear all,
I apologize in advance if this is not the right list to post this. I
am a newcomer and please let me know if I should be sending this to
another list.

I program MPI trying to do HPC parallel programs. In particular I
wrote a parallel code
for molecular dynamics simulations. The program splits the work in a
matrix of procs and
I send messages along rows and columns in an equal basis. I learnt
that the typical
arrangement of  cartesian  topology is not usually  the best option,
because in a matrix, let's  say of 4x4 procs   with quad procs, the
procs are arranged so that
through columns one stays inside the same quad proc and through rows
you are always going out to the network.  This means procs are
arranged as one quad per row.

I try to explain this for a 2x2 case. The cartesian topology does this
assignment, typically:
cartesianmpi_comm_world
0,0 -->  0
0,1 -->  1
1,0 -->  2
1,1 -->  3
The question is, how do I get a "user defined" assignment such as:
0,0 -->  0
0,1 -->  2
1,0 -->  1
1,1 -->  3

?

Thanks in advance and I hope to have  made this more or less understandable.
Claudio


Re: [OMPI users] compilation error with pgcc Unknown switch

2012-03-01 Thread Jeffrey Squyres
Did you do a full autogen / configure / make clean / make all ?


On Mar 1, 2012, at 8:53 AM, Abhinav Sarje wrote:

> Thanks Ralph. That did help, but only till the next hurdle. Now the
> build fails at the following point with an 'undefined reference':
> ---
> Making all in tools/ompi_info
> make[2]: Entering directory
> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
>  CC ompi_info.o
>  CC output.o
>  CC param.o
>  CC components.o
>  CC version.o
>  CCLD   ompi_info
> ../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64'
> make[2]: *** [ompi_info] Error 2
> make[2]: Leaving directory
> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
> make: *** [all-recursive] Error 1
> ---
> 
> 
> 
> 
> 
> 
> On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castain  wrote:
>> You need to update your source code - this was identified and fixed on Wed. 
>> Unfortunately, our trunk is a developer's environment. While we try hard to 
>> keep it fully functional, bugs do occasionally work their way into the code.
>> 
>> On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote:
>> 
>>> Hi Nathan,
>>> 
>>> I tried building on an internal login node, and it did not fail at the
>>> previous point. But, after compiling for a very long time, it failed
>>> while building libmpi.la, with a multiple definition error:
>>> --
>>> ...
>>>  CC mpiext/mpiext.lo
>>>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo
>>>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo
>>>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo
>>>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo
>>>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo
>>>  CCLD   mpi/f77/base/libmpi_f77_base.la
>>>  CCLD   libmpi.la
>>> mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o):
>>> In function `local_heap_sort':
>>> /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c::
>>> multiple definition of `local_heap_sort'
>>> mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929:
>>> first defined here
>>> make[2]: *** [libmpi.la] Error 2
>>> make[2]: Leaving directory
>>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
>>> make[1]: *** [all-recursive] Error 1
>>> make[1]: Leaving directory
>>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
>>> make: *** [all-recursive] Error 1
>>> --
>>> 
>>> Any idea why this is happening, and how to fix it? Again, I am using
>>> the XE6 platform configuration file.
>>> 
>>> Abhinav.
>>> 
>>> On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm  wrote:
 
 
 On Mon, 27 Feb 2012, Abhinav Sarje wrote:
 
> Hi Nathan, Gus, Manju,
> 
> I got a chance to try out the XE6 support build, but with no success.
> First I was getting this error: "PGC-F-0010-File write error occurred
> (temporary pragma .s file)". After searching online about this error,
> I saw that there is a patch at
> 
> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch;
> for this particular error.
> 
> With the patched version, I did not get this error anymore, but got
> the unknown switch flag error for the flag "-march=amdfam10"
> (specified in the XE6 configuration in the dev trunk) at a particular
> point even if I use the '-noswitcherror' flag with the pgcc compiler.
> 
> If I remove this flag (-march=amdfam10), the build fails later at the
> following point:
> -
> Making all in mca/ras/alps
> make[2]: Entering directory
> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>  CC ras_alps_component.lo
>  CC ras_alps_module.lo
> PGC-F-0206-Can't find include file alps/apInfo.h
> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37)
> PGC/x86-64 Linux 11.10-0: compilation aborted
> make[2]: *** [ras_alps_module.lo] Error 1
> make[2]: Leaving directory
> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte'
> make: *** [all-recursive] Error 1
> --
 
 
 This is a known issue with Cray's frontend environment. Build on one of the
 internal login nodes.
 
 
 -Nathan
 
 ___
 users mailing 

Re: [OMPI users] compilation error with pgcc Unknown switch

2012-03-01 Thread Abhinav Sarje
Thanks Ralph. That did help, but only till the next hurdle. Now the
build fails at the following point with an 'undefined reference':
---
Making all in tools/ompi_info
make[2]: Entering directory
`/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
  CC ompi_info.o
  CC output.o
  CC param.o
  CC components.o
  CC version.o
  CCLD   ompi_info
../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64'
make[2]: *** [ompi_info] Error 2
make[2]: Leaving directory
`/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
make: *** [all-recursive] Error 1
---






On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castain  wrote:
> You need to update your source code - this was identified and fixed on Wed. 
> Unfortunately, our trunk is a developer's environment. While we try hard to 
> keep it fully functional, bugs do occasionally work their way into the code.
>
> On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote:
>
>> Hi Nathan,
>>
>> I tried building on an internal login node, and it did not fail at the
>> previous point. But, after compiling for a very long time, it failed
>> while building libmpi.la, with a multiple definition error:
>> --
>> ...
>>  CC     mpiext/mpiext.lo
>>  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo
>>  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo
>>  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo
>>  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo
>>  CC     mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo
>>  CCLD   mpi/f77/base/libmpi_f77_base.la
>>  CCLD   libmpi.la
>> mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o):
>> In function `local_heap_sort':
>> /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c::
>> multiple definition of `local_heap_sort'
>> mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929:
>> first defined here
>> make[2]: *** [libmpi.la] Error 2
>> make[2]: Leaving directory
>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
>> make: *** [all-recursive] Error 1
>> --
>>
>> Any idea why this is happening, and how to fix it? Again, I am using
>> the XE6 platform configuration file.
>>
>> Abhinav.
>>
>> On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm  wrote:
>>>
>>>
>>> On Mon, 27 Feb 2012, Abhinav Sarje wrote:
>>>
 Hi Nathan, Gus, Manju,

 I got a chance to try out the XE6 support build, but with no success.
 First I was getting this error: "PGC-F-0010-File write error occurred
 (temporary pragma .s file)". After searching online about this error,
 I saw that there is a patch at

 "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch;
 for this particular error.

 With the patched version, I did not get this error anymore, but got
 the unknown switch flag error for the flag "-march=amdfam10"
 (specified in the XE6 configuration in the dev trunk) at a particular
 point even if I use the '-noswitcherror' flag with the pgcc compiler.

 If I remove this flag (-march=amdfam10), the build fails later at the
 following point:
 -
 Making all in mca/ras/alps
 make[2]: Entering directory
 `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
  CC     ras_alps_component.lo
  CC     ras_alps_module.lo
 PGC-F-0206-Can't find include file alps/apInfo.h
 (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37)
 PGC/x86-64 Linux 11.10-0: compilation aborted
 make[2]: *** [ras_alps_module.lo] Error 1
 make[2]: Leaving directory
 `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte'
 make: *** [all-recursive] Error 1
 --
>>>
>>>
>>> This is a known issue with Cray's frontend environment. Build on one of the
>>> internal login nodes.
>>>
>>>
>>> -Nathan
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> 

Re: [OMPI users] Simple question on GRID

2012-03-01 Thread Mohamed Adel
You can use CyberIntegrator (http://isda.ncsa.uiuc.edu/cyberintegrator/) 
developed by NCSA, or UNICORE (http://www.unicore.eu/) developed by Julich to 
integrate resources.

best,
madel

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Shaandar Nyamtulga
Sent: Thursday, March 01, 2012 7:10 AM
To: us...@open-mpi.org
Subject: [OMPI users] Simple question on GRID

Hi
I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is MPICH2).
They run separately in their local network environment.I know there is a way to 
integrate them through Internet, presumably by Grid software,
I guess. Is there any tutorial to do this?




Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-01 Thread Jeffrey Squyres
On Mar 1, 2012, at 1:17 AM, Jingcha Joba wrote:

> Aah...
> So when openMPI is compile with OFED, and run on a Infiniband/RoCE devices, I 
> would use the mpi would simply direct to ofed to do point to point calls in 
> the ofed way?

I'm not quite sure how to parse that.  :-)

The openib BTL uses verbs functions to effect data transfers between MPI 
process peers.  The BTL is one of the lower layers in Open MPI for 
point-to-point communication; BTL plugins are used to effect the 
device-specific transport stuff for MPI_SEND, MPI_RECV, MPI_PUT, ...etc.  
Hence, when you run with the openib BTL and call MPI_SEND (assumedly to a peer 
that is reachable via an OpenFabrics device), the openib BTL will eventually be 
called to actually send the message.  The openib BTL will send the message to 
the peer via calls to some combination of calls to verbs functions.

Mellanox has also introduced a library called "MXM" that can also be used for 
underlying MPI message transport (as opposed to using the openib BTL).  See the 
Open MPI README for some explanations about the different transports that Open 
MPI can use (specifically: "ob1" vs. "cm").

> > More specifically: all things being equal, you don't care which is used.  
> > You just want your message to get to the receiver/target as fast as 
> > possible.  One of the main ideas of MPI is to hide those kinds of details 
> > from the user.  I.e., you call MPI_SEND.  A miracle occurs.  The message is 
> > received on the other side.
> 
> True. Its just that I am digging into the OFED source code and the ompi 
> source code,and trying to understand the way these two interact..

The openib BTL is probably one of the most complex sections of Open MPI, 
unfortunately.  :-\  The verbs API is *quite* complex, and has many different 
options that do not work on all types of OpenFabrics hardware.  This leads to 
many different blocks of code, not all of which are executed on all platforms.  
The verbs model of registering memory also leads to a lot of complications, 
especially since, for performance reasons, MPI has to cache memory 
registrations and interpose itself in the memory subsystem to catch when 
registered memory is freed (see the README for some details here).  

If you have any specific questions about the implementation, post over on the 
devel list.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-01 Thread Jeffrey Squyres
I would just ignore these tests:

1. The use of MPI one-sided functionality is extremely rare out in the real 
world.
2. Brian said there were probably bugs in Open MPI's implementation of the MPI 
one-sided functionality itself, and he's in the middle of re-writing the 
one-sided functionality anyway.


On Mar 1, 2012, at 1:26 AM, Jingcha Joba wrote:

> Well, as Jeff says, looks like its to do with the 1 sided comm. 
> 
> But the reason why I said was because of what I experienced a couple of 
> months ago: When I had a Myri-10G and an Intel gigabit ethernet card lying 
> around, I wanted to test the kernel bypass using open-mx stack and I ran the 
> osu benchmark.
> Though all the tests worked fine with the Myri 10g, I seemed to get this 
> "hanging" issue when running using Intel Gigabit ethernet, esp for a size 
> more than 1K on put/get / bcast. I tried with the tcp stack instead of mx, 
> and it seemed to work fine, though with bad latency numbers (which is kind of 
> obvious, considering that cpu overhead due to tcp).
> I never really got a change to dig deep, but I was pretty much sure that this 
> is to do with the open-mx. 
> 
> 
> On Wed, Feb 29, 2012 at 9:13 PM, Venkateswara Rao Dokku  
> wrote:
> Hi,
>   I tried executing those tests with the other devices like tcp instead 
> of ib with the same open-mpi 1.4.3.. It went fine but it took time to 
> execute, when i tried to execute the same test on the customized OFED ,tests 
> are hanging at the same message size..
> 
> Can u please tel me, what could me the possible issue over there, so that you 
> can narrow down the issue..
> i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the 
> customized OFED ( in RDMA scenario's or anything (if u can specify)).
> 
> 
> On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyres  wrote:
> On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote:
> 
> > So if I understand correctly, if a message size is smaller than it will use 
> > the MPI way (non-RDMA, 2 way communication), if its larger, then it would 
> > use the Open Fabrics, by using the ibverbs (and ofed stack) instead of 
> > using the MPI's stack?
> 
> Er... no.
> 
> So let's talk MPI-over-OpenFabrics-verbs specifically.
> 
> All MPI communication calls will use verbs under the covers.  They may use 
> verbs send/receive semantics in some cases, and RDMA semantics in other 
> cases.  "It depends" -- on a lot of things, actually.  It's hard to come up 
> with a good rule of thumb for when it uses one or the other; this is one of 
> the reasons that the openib BTL code is so complex.  :-)
> 
> The main points here are:
> 
> 1. you can trust the openib BTL to do the Best thing possible to get the 
> message to the other side.  Regardless of whether that message is an MPI_SEND 
> or an MPI_PUT (for example).
> 
> 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET 
> does not necessarily == verbs RDMA read).
> 
> > If so, could that be the reason why the MPI_Put "hangs" when sending a 
> > message more than 512KB (or may be 1MB)?
> 
> No.  I'm guessing that there's some kind of bug in the MPI_PUT implementation.
> 
> > Also is there a way to know if for a particular MPI call, OF uses send/recv 
> > or RDMA exchange?
> 
> Not really.
> 
> More specifically: all things being equal, you don't care which is used.  You 
> just want your message to get to the receiver/target as fast as possible.  
> One of the main ideas of MPI is to hide those kinds of details from the user. 
>  I.e., you call MPI_SEND.  A miracle occurs.  The message is received on the 
> other side.
> 
> :-)
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Thanks & Regards,
> D.Venkateswara Rao,
> Software Engineer,One Convergence Devices Pvt Ltd.,
> Jubille Hills,Hyderabad.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Very slow MPI_GATHER

2012-03-01 Thread Jeffrey Squyres
On Mar 1, 2012, at 3:33 AM, Pinero, Pedro_jose wrote:

> I am launching 200 light processes in two computers with 8 cores each one 
> (Intel i7 processor). They are dedicated and are interconnected through a 
> point-to-point Gigabit Ethernet link.
>  
> I read about oversubscribing nodes in the open-mpi documentation, and for 
> that reason I am using the option
>  
> -Mca mpi_yield_when_idle 1

That's still going to give you terrible performance.

Open MPI was designed to run basically at one process per processor (usually a 
core).  The easiest reason to cite here is that Open MPI busy-polls while 
blocking for message passing progress.  The yield_when_idle option *helps* (in 
some versions of Linux, at least), but it doesn't change that fact that MPI 
processes will be extremely aggressive in clamoring for CPU cycles.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format errorI

2012-03-01 Thread Syed Ahsan Ali
I am able to run the application with LSF now, it strange because I wasn't
able to trace any error.

On Thu, Mar 1, 2012 at 11:34 AM, PukkiMonkey  wrote:

> What Jeff means is that because u didn't have echo "mpirun...>>outfile"
> but
> echo mpirun>>outfile ,
> you were piping the output to the outfile instead of stdout.
>
> Sent from my iPhone
>
> On Feb 29, 2012, at 8:44 PM, Syed Ahsan Ali  wrote:
>
> Sorry Jeff I couldn't get you point.
>
> On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres wrote:
>
>> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:
>>
>> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
>> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
>> ./hrm >> ${OUTFILE}_hrm 2>&1
>> > [pmdtest@pmd02 d00_dayfiles]$
>>
>> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm
>> file, not stdout.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014


Re: [OMPI users] Very slow MPI_GATHER

2012-03-01 Thread Ralph Castain
Wow - with that heavy an oversubscription, your performance experience 
certainly is reasonable. Not much you can do about it except reduce the 
oversubscription, either by increasing the number of computers or reducing the 
number of processes.


On Mar 1, 2012, at 1:33 AM, Pinero, Pedro_jose wrote:

> Thank you for your fast response.
>  
> I am launching 200 light processes in two computers with 8 cores each one 
> (Intel i7 processor). They are dedicated and are interconnected through a 
> point-to-point Gigabit Ethernet link.
>  
> I read about oversubscribing nodes in the open-mpi documentation, and for 
> that reason I am using the option
>  
> -Mca mpi_yield_when_idle 1
>  
> Regards
>  
> Pedro
>  
>  
>  
> >>On Feb 29, 2012, at 11:01 AM, Pinero, Pedro_jose wrote:
>  
> >> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers 
> >> cluster connected though Ethernet, obtaining a very poor performance.
>  
> >Let me making sure I'm parsing this statement properly: are you launching 
> >200 MPI processes on 2 computers?  If so, do >those computers each have 100 
> >cores?
>  
> >I ask because oversubscribing MPI processes (i.e., putting more than 1 
> >process per core) will be disastrous to >performance.
>  
> >> I have measured each operation time and I haver realised that the 
> >> MPI_Gather operation takes about 1 second in each >>synchronization (only 
> >> an integer is send in each case). Is this time range normal or I have a 
> >> synchronization >>problem?  Is there any way to improve this performance?
>  
> >I'm afraid I can't say more without more information about your hardware and 
> >software setup.  Is this a dedicated HPC >cluster?  Are you oversubscribing 
> >the cores?  What kind of Ethernet switching gear do you have?  ...etc.
>  
> >--
> >Jeff Squyres
> >jsquy...@cisco.com
> >For corporate legal information go to: 
> >http://www.cisco.com/web/about/doing_business/legal/cri/
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] compilation error with pgcc Unknown switch

2012-03-01 Thread Ralph Castain
You need to update your source code - this was identified and fixed on Wed. 
Unfortunately, our trunk is a developer's environment. While we try hard to 
keep it fully functional, bugs do occasionally work their way into the code.

On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote:

> Hi Nathan,
> 
> I tried building on an internal login node, and it did not fail at the
> previous point. But, after compiling for a very long time, it failed
> while building libmpi.la, with a multiple definition error:
> --
> ...
>  CC mpiext/mpiext.lo
>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo
>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo
>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo
>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo
>  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo
>  CCLD   mpi/f77/base/libmpi_f77_base.la
>  CCLD   libmpi.la
> mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o):
> In function `local_heap_sort':
> /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c::
> multiple definition of `local_heap_sort'
> mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929:
> first defined here
> make[2]: *** [libmpi.la] Error 2
> make[2]: Leaving directory
> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
> make: *** [all-recursive] Error 1
> --
> 
> Any idea why this is happening, and how to fix it? Again, I am using
> the XE6 platform configuration file.
> 
> Abhinav.
> 
> On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm  wrote:
>> 
>> 
>> On Mon, 27 Feb 2012, Abhinav Sarje wrote:
>> 
>>> Hi Nathan, Gus, Manju,
>>> 
>>> I got a chance to try out the XE6 support build, but with no success.
>>> First I was getting this error: "PGC-F-0010-File write error occurred
>>> (temporary pragma .s file)". After searching online about this error,
>>> I saw that there is a patch at
>>> 
>>> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch;
>>> for this particular error.
>>> 
>>> With the patched version, I did not get this error anymore, but got
>>> the unknown switch flag error for the flag "-march=amdfam10"
>>> (specified in the XE6 configuration in the dev trunk) at a particular
>>> point even if I use the '-noswitcherror' flag with the pgcc compiler.
>>> 
>>> If I remove this flag (-march=amdfam10), the build fails later at the
>>> following point:
>>> -
>>> Making all in mca/ras/alps
>>> make[2]: Entering directory
>>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>>>  CC ras_alps_component.lo
>>>  CC ras_alps_module.lo
>>> PGC-F-0206-Can't find include file alps/apInfo.h
>>> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37)
>>> PGC/x86-64 Linux 11.10-0: compilation aborted
>>> make[2]: *** [ras_alps_module.lo] Error 1
>>> make[2]: Leaving directory
>>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>>> make[1]: *** [all-recursive] Error 1
>>> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte'
>>> make: *** [all-recursive] Error 1
>>> --
>> 
>> 
>> This is a known issue with Cray's frontend environment. Build on one of the
>> internal login nodes.
>> 
>> 
>> -Nathan
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] compilation error with pgcc Unknown switch

2012-03-01 Thread Abhinav Sarje
Hi Nathan,

I tried building on an internal login node, and it did not fail at the
previous point. But, after compiling for a very long time, it failed
while building libmpi.la, with a multiple definition error:
--
...
  CC mpiext/mpiext.lo
  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo
  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo
  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo
  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo
  CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo
  CCLD   mpi/f77/base/libmpi_f77_base.la
  CCLD   libmpi.la
mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o):
In function `local_heap_sort':
/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c::
multiple definition of `local_heap_sort'
mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929:
first defined here
make[2]: *** [libmpi.la] Error 2
make[2]: Leaving directory
`/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi'
make: *** [all-recursive] Error 1
--

Any idea why this is happening, and how to fix it? Again, I am using
the XE6 platform configuration file.

Abhinav.

On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm  wrote:
>
>
> On Mon, 27 Feb 2012, Abhinav Sarje wrote:
>
>> Hi Nathan, Gus, Manju,
>>
>> I got a chance to try out the XE6 support build, but with no success.
>> First I was getting this error: "PGC-F-0010-File write error occurred
>> (temporary pragma .s file)". After searching online about this error,
>> I saw that there is a patch at
>>
>> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch;
>> for this particular error.
>>
>> With the patched version, I did not get this error anymore, but got
>> the unknown switch flag error for the flag "-march=amdfam10"
>> (specified in the XE6 configuration in the dev trunk) at a particular
>> point even if I use the '-noswitcherror' flag with the pgcc compiler.
>>
>> If I remove this flag (-march=amdfam10), the build fails later at the
>> following point:
>> -
>> Making all in mca/ras/alps
>> make[2]: Entering directory
>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>>  CC     ras_alps_component.lo
>>  CC     ras_alps_module.lo
>> PGC-F-0206-Can't find include file alps/apInfo.h
>> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37)
>> PGC/x86-64 Linux 11.10-0: compilation aborted
>> make[2]: *** [ras_alps_module.lo] Error 1
>> make[2]: Leaving directory
>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte'
>> make: *** [all-recursive] Error 1
>> --
>
>
> This is a known issue with Cray's frontend environment. Build on one of the
> internal login nodes.
>
>
> -Nathan
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Very slow MPI_GATHER

2012-03-01 Thread Pinero, Pedro_jose
Thank you for your fast response.

 

I am launching 200 light processes in two computers with 8 cores each
one (Intel i7 processor). They are dedicated and are interconnected
through a point-to-point Gigabit Ethernet link.

 

I read about oversubscribing nodes in the open-mpi documentation, and
for that reason I am using the option 

 

-Mca mpi_yield_when_idle 1

 

Regards

 

Pedro

 

 

 

>>On Feb 29, 2012, at 11:01 AM, Pinero, Pedro_jose wrote:

 

>> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers
cluster connected though Ethernet, obtaining a very poor performance.

 

>Let me making sure I'm parsing this statement properly: are you
launching 200 MPI processes on 2 computers?  If so, do >those computers
each have 100 cores?

 

>I ask because oversubscribing MPI processes (i.e., putting more than 1
process per core) will be disastrous to >performance.

 

>> I have measured each operation time and I haver realised that the
MPI_Gather operation takes about 1 second in each >>synchronization
(only an integer is send in each case). Is this time range normal or I
have a synchronization >>problem?  Is there any way to improve this
performance?

 

>I'm afraid I can't say more without more information about your
hardware and software setup.  Is this a dedicated HPC >cluster?  Are you
oversubscribing the cores?  What kind of Ethernet switching gear do you
have?  ...etc.

 

>-- 

>Jeff Squyres

>jsquy...@cisco.com

>For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

 



Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-03-01 Thread PukkiMonkey
What Jeff means is that because u didn't have echo "mpirun...>>outfile" but  
echo mpirun>>outfile ,
you were piping the output to the outfile instead of stdout. 

Sent from my iPhone

On Feb 29, 2012, at 8:44 PM, Syed Ahsan Ali  wrote:

> Sorry Jeff I couldn't get you point.
> 
> On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres  wrote:
> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:
> 
> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile 
> > $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1 
> > ./hrm >> ${OUTFILE}_hrm 2>&1
> > [pmdtest@pmd02 d00_dayfiles]$
> 
> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm file, 
> not stdout.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> -- 
> Syed Ahsan Ali Bokhari 
> Electronic Engineer (EE)
> 
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
> 


Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-01 Thread Jingcha Joba
Well, as Jeff says, looks like its to do with the 1 sided comm.

But the reason why I said was because of what I experienced a couple of
months ago: When I had a Myri-10G and an Intel gigabit ethernet card lying
around, I wanted to test the kernel bypass using open-mx stack and I ran
the osu benchmark.
Though all the tests worked fine with the Myri 10g, I seemed to get this
"hanging" issue when running using Intel Gigabit ethernet, esp for a size
more than 1K on put/get / bcast. I tried with the tcp stack instead of mx,
and it seemed to work fine, though with bad latency numbers (which is kind
of obvious, considering that cpu overhead due to tcp).
I never really got a change to dig deep, but I was pretty much sure that
this is to do with the open-mx.


On Wed, Feb 29, 2012 at 9:13 PM, Venkateswara Rao Dokku  wrote:

> Hi,
>   I tried executing those tests with the other devices like tcp
> instead of ib with the same open-mpi 1.4.3.. It went fine but it took time
> to execute, when i tried to execute the same test on the customized OFED
> ,tests are hanging at the same message size..
>
> Can u please tel me, what could me the possible issue over there, so that
> you can narrow down the issue..
> i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the
> customized OFED ( in RDMA scenario's or anything (if u can specify)).
>
>
> On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyres wrote:
>
>> On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote:
>>
>> > So if I understand correctly, if a message size is smaller than it will
>> use the MPI way (non-RDMA, 2 way communication), if its larger, then it
>> would use the Open Fabrics, by using the ibverbs (and ofed stack) instead
>> of using the MPI's stack?
>>
>> Er... no.
>>
>> So let's talk MPI-over-OpenFabrics-verbs specifically.
>>
>> All MPI communication calls will use verbs under the covers.  They may
>> use verbs send/receive semantics in some cases, and RDMA semantics in other
>> cases.  "It depends" -- on a lot of things, actually.  It's hard to come up
>> with a good rule of thumb for when it uses one or the other; this is one of
>> the reasons that the openib BTL code is so complex.  :-)
>>
>> The main points here are:
>>
>> 1. you can trust the openib BTL to do the Best thing possible to get the
>> message to the other side.  Regardless of whether that message is an
>> MPI_SEND or an MPI_PUT (for example).
>>
>> 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise,
>> MPI_GET does not necessarily == verbs RDMA read).
>>
>> > If so, could that be the reason why the MPI_Put "hangs" when sending a
>> message more than 512KB (or may be 1MB)?
>>
>> No.  I'm guessing that there's some kind of bug in the MPI_PUT
>> implementation.
>>
>> > Also is there a way to know if for a particular MPI call, OF uses
>> send/recv or RDMA exchange?
>>
>> Not really.
>>
>> More specifically: all things being equal, you don't care which is used.
>>  You just want your message to get to the receiver/target as fast as
>> possible.  One of the main ideas of MPI is to hide those kinds of details
>> from the user.  I.e., you call MPI_SEND.  A miracle occurs.  The message is
>> received on the other side.
>>
>> :-)
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Thanks & Regards,
> D.Venkateswara Rao,
> Software Engineer,One Convergence Devices Pvt Ltd.,
> Jubille Hills,Hyderabad.
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Simple question on GRID

2012-03-01 Thread Alexander Beck-Ratzka

Hi Shaandar,

this is not a simple question! If you want to bring your cluster into 
the Grid, you first have to decide which Grid, because the different 
Grids use different Grid softwares. Having taken this decision, I would 
recommend to look onto the wen page of this Grid community, usually you 
can find here instructions on how to integrate your cluster into their 
Grid. Dependend on the Grid software used, these instructions can be 
really very different, therefore I cannot be more precise here and now. 
If you are deciding for a Grid which is using the Globus software, feel 
free to contact me for further question. In the case of Globus I can 
help you...


Best wishes

Alexander


Hi
I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is 
MPICH2).
They run separately in their local network environment.I know there is 
a way to integrate them through Internet, presumably by Grid software,

I guess. Is there any tutorial to do this?




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-01 Thread Jingcha Joba
Aah...
So when openMPI is compile with OFED, and run on a Infiniband/RoCE devices,
I would use the mpi would simply direct to ofed to do point to point calls
in the ofed way?

>
> More specifically: all things being equal, you don't care which is used.
>  You just want your message to get to the receiver/target as fast as
> possible.  One of the main ideas of MPI is to hide those kinds of details
> from the user.  I.e., you call MPI_SEND.  A miracle occurs.  The message is
> received on the other side.
>
> True. Its just that I am digging into the OFED source code and the ompi
source code,and trying to understand the way these two interact..

>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-01 Thread Venkateswara Rao Dokku
Hi,
  I tried executing those tests with the other devices like tcp instead
of ib with the same open-mpi 1.4.3.. It went fine but it took time to
execute, when i tried to execute the same test on the customized OFED
,tests are hanging at the same message size..

Can u please tel me, what could me the possible issue over there, so that
you can narrow down the issue..
i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the
customized OFED ( in RDMA scenario's or anything (if u can specify)).

On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyres  wrote:

> On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote:
>
> > So if I understand correctly, if a message size is smaller than it will
> use the MPI way (non-RDMA, 2 way communication), if its larger, then it
> would use the Open Fabrics, by using the ibverbs (and ofed stack) instead
> of using the MPI's stack?
>
> Er... no.
>
> So let's talk MPI-over-OpenFabrics-verbs specifically.
>
> All MPI communication calls will use verbs under the covers.  They may use
> verbs send/receive semantics in some cases, and RDMA semantics in other
> cases.  "It depends" -- on a lot of things, actually.  It's hard to come up
> with a good rule of thumb for when it uses one or the other; this is one of
> the reasons that the openib BTL code is so complex.  :-)
>
> The main points here are:
>
> 1. you can trust the openib BTL to do the Best thing possible to get the
> message to the other side.  Regardless of whether that message is an
> MPI_SEND or an MPI_PUT (for example).
>
> 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET
> does not necessarily == verbs RDMA read).
>
> > If so, could that be the reason why the MPI_Put "hangs" when sending a
> message more than 512KB (or may be 1MB)?
>
> No.  I'm guessing that there's some kind of bug in the MPI_PUT
> implementation.
>
> > Also is there a way to know if for a particular MPI call, OF uses
> send/recv or RDMA exchange?
>
> Not really.
>
> More specifically: all things being equal, you don't care which is used.
>  You just want your message to get to the receiver/target as fast as
> possible.  One of the main ideas of MPI is to hide those kinds of details
> from the user.  I.e., you call MPI_SEND.  A miracle occurs.  The message is
> received on the other side.
>
> :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Thanks & Regards,
D.Venkateswara Rao,
Software Engineer,One Convergence Devices Pvt Ltd.,
Jubille Hills,Hyderabad.


[OMPI users] Simple question on GRID

2012-03-01 Thread Shaandar Nyamtulga

Hi
I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is MPICH2).
They run separately in their local network environment.I know there is a way to 
integrate them through Internet, presumably by Grid software,
I guess. Is there any tutorial to do this?