Re: [OMPI users] compilation error with pgcc Unknown switch
yes, I did a full autogen, configure, make clean and make all On Thu, Mar 1, 2012 at 10:03 PM, Jeffrey Squyreswrote: > Did you do a full autogen / configure / make clean / make all ? > > > On Mar 1, 2012, at 8:53 AM, Abhinav Sarje wrote: > >> Thanks Ralph. That did help, but only till the next hurdle. Now the >> build fails at the following point with an 'undefined reference': >> --- >> Making all in tools/ompi_info >> make[2]: Entering directory >> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' >> CC ompi_info.o >> CC output.o >> CC param.o >> CC components.o >> CC version.o >> CCLD ompi_info >> ../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64' >> make[2]: *** [ompi_info] Error 2 >> make[2]: Leaving directory >> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory >> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' >> make: *** [all-recursive] Error 1 >> --- >> >> >> >> >> >> >> On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castain wrote: >>> You need to update your source code - this was identified and fixed on Wed. >>> Unfortunately, our trunk is a developer's environment. While we try hard to >>> keep it fully functional, bugs do occasionally work their way into the code. >>> >>> On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote: >>> Hi Nathan, I tried building on an internal login node, and it did not fail at the previous point. But, after compiling for a very long time, it failed while building libmpi.la, with a multiple definition error: -- ... CC mpiext/mpiext.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo CCLD mpi/f77/base/libmpi_f77_base.la CCLD libmpi.la mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o): In function `local_heap_sort': /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c:: multiple definition of `local_heap_sort' mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929: first defined here make[2]: *** [libmpi.la] Error 2 make[2]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' make: *** [all-recursive] Error 1 -- Any idea why this is happening, and how to fix it? Again, I am using the XE6 platform configuration file. Abhinav. On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm wrote: > > > On Mon, 27 Feb 2012, Abhinav Sarje wrote: > >> Hi Nathan, Gus, Manju, >> >> I got a chance to try out the XE6 support build, but with no success. >> First I was getting this error: "PGC-F-0010-File write error occurred >> (temporary pragma .s file)". After searching online about this error, >> I saw that there is a patch at >> >> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch; >> for this particular error. >> >> With the patched version, I did not get this error anymore, but got >> the unknown switch flag error for the flag "-march=amdfam10" >> (specified in the XE6 configuration in the dev trunk) at a particular >> point even if I use the '-noswitcherror' flag with the pgcc compiler. >> >> If I remove this flag (-march=amdfam10), the build fails later at the >> following point: >> - >> Making all in mca/ras/alps >> make[2]: Entering directory >> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >> CC ras_alps_component.lo >> CC ras_alps_module.lo >> PGC-F-0206-Can't find include file alps/apInfo.h >> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37) >> PGC/x86-64 Linux 11.10-0: compilation aborted >> make[2]: *** [ras_alps_module.lo] Error 1 >> make[2]: Leaving directory >> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte' >> make: *** [all-recursive] Error 1 >>
Re: [OMPI users] run orterun with more than 200 processes
You might try putting that list of hosts in a hostfile instead of on the cmd line - you may be hitting some limits there. I also don't believe that you can add an orted in that manner - orterun will have no idea how it got there and is likely to abort. On Mar 1, 2012, at 3:20 PM, Jianzhang He wrote: > Hi, > > I am not sure if this is the right place to post this question. If you know > where it is appropriate, please let me know. > > I need to run application that launches 200 processes with the command: > 1)orterun --prefix ./ -np 200 -wd ./ -host > hostname1.domain.com,1,2,3,4,5,6,7,8,9,…..,196,197,198,199 CMD > > Later, I will run a command to communicate with 1) with a command like: > 2)orted -mca ess env -mca orte_ess_ -mca orte_ess_vpid 100 -mca > orte_ess_num_procs 200 --hnp-uri "job#;tcp:/ hostname1.domain.com /:port#" > > The problem I have is I can only run with about 100 nodes. If the number is > higher, 1) will not invoke CMD and the total number of processes is about 130 > or so. > > My question is how to remove that limit? > > Thanks in advance. > > Jianzhang > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
I don't know - I didn't write the app file code, and I've never seen anything defining its behavior. So I guess you could say it is intended - or not! :-/ On Mar 1, 2012, at 2:53 PM, Jeffrey Squyres wrote: > Actually, I should say that I discovered that if you put --prefix on each > line of the app context file, then the first case (running the app context > file) works fine; it adheres to the --prefix behavior. > > Ralph: is this intended behavior? (I don't know if I have an opinion either > way) > > > On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote: > >> I see the problem. >> >> It looks like the use of the app context file is triggering different >> behavior, and that behavior is erasing the use of --prefix. If I replace >> the app context file with a complete command line, it works and the --prefix >> behavior is observed. >> >> Specifically: >> >> $mpirunfile $mcaparams --app addmpw-hostname >> >> ^^ This one seems to ignore --prefix behavior. >> >> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname >> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 >> -np 1 hostname >> >> ^^ These two seem to adhere to the proper --prefix behavior. >> >> Ralph -- can you have a look? >> >> >> >> >> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote: >> >>> Hi Ralph, >>> >>> Thanks, here is what I did as suggested by Jeff: >>> What did this command line look like? Can you provide the configure line as well? >>> >>> As in my previous post, the script as following: >>> >>> (1) debug messages: >> >>> yiguang@gulftown testdmp]$ ./test.bash >>> [gulftown:28340] mca: base: components_open: Looking for plm components >>> [gulftown:28340] mca: base: components_open: opening plm components >>> [gulftown:28340] mca: base: components_open: found loaded component rsh >>> [gulftown:28340] mca: base: components_open: component rsh has no register >>> function >>> [gulftown:28340] mca: base: components_open: component rsh open function >>> successful >>> [gulftown:28340] mca: base: components_open: found loaded component slurm >>> [gulftown:28340] mca: base: components_open: component slurm has no >>> register function >>> [gulftown:28340] mca: base: components_open: component slurm open function >>> successful >>> [gulftown:28340] mca: base: components_open: found loaded component tm >>> [gulftown:28340] mca: base: components_open: component tm has no register >>> function >>> [gulftown:28340] mca: base: components_open: component tm open function >>> successful >>> [gulftown:28340] mca:base:select: Auto-selecting plm components >>> [gulftown:28340] mca:base:select:( plm) Querying component [rsh] >>> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [gulftown:28340] mca:base:select:( plm) Querying component [slurm] >>> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query >>> failed to return a module >>> [gulftown:28340] mca:base:select:( plm) Querying component [tm] >>> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query >>> failed to return a module >>> [gulftown:28340] mca:base:select:( plm) Selected component [rsh] >>> [gulftown:28340] mca: base: close: component slurm closed >>> [gulftown:28340] mca: base: close: unloading component slurm >>> [gulftown:28340] mca: base: close: component tm closed >>> [gulftown:28340] mca: base: close: unloading component tm >>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash >>> 3546479048 >>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 >>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm >>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] >>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] >>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) >>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local >>> shell >>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) >>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >>> /usr/bin/rsh orted --daemonize -mca ess env -mca >>> orte_ess_jobid 1142816768 -mca >>> orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri >>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >>> - >>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >>> btl openib,sm,self --mca >>> orte_tmpdir_base /tmp --mca plm_base_verbose 100 >>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node >>> gulftown >>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 >>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon >>> [[17438,0],1] >>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) >>> [/usr/bin/rsh ibnode001 orted --daemonize -mca >>> ess env -mca orte_ess_jobid 1142816768
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
> Actually, I should say that I discovered that if you put --prefix on each > line of the app context file, then the first > case (running the app context file) works fine; it adheres to the --prefix > behavior. Yes, I confirmed this on our cluster. It works with --prefix on each line of the app file.
[OMPI users] run orterun with more than 200 processes
Hi, I am not sure if this is the right place to post this question. If you know where it is appropriate, please let me know. I need to run application that launches 200 processes with the command: 1)orterun --prefix ./ -np 200 -wd ./ -host hostname1.domain.com,1,2,3,4,5,6,7,8,9,.,196,197,198,199 CMD Later, I will run a command to communicate with 1) with a command like: 2)orted -mca ess env -mca orte_ess_ -mca orte_ess_vpid 100 -mca orte_ess_num_procs 200 --hnp-uri "job#;tcp:/ hostname1.domain.com /:port#" The problem I have is I can only run with about 100 nodes. If the number is higher, 1) will not invoke CMD and the total number of processes is about 130 or so. My question is how to remove that limit? Thanks in advance. Jianzhang
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Actually, I should say that I discovered that if you put --prefix on each line of the app context file, then the first case (running the app context file) works fine; it adheres to the --prefix behavior. Ralph: is this intended behavior? (I don't know if I have an opinion either way) On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote: > I see the problem. > > It looks like the use of the app context file is triggering different > behavior, and that behavior is erasing the use of --prefix. If I replace the > app context file with a complete command line, it works and the --prefix > behavior is observed. > > Specifically: > > $mpirunfile $mcaparams --app addmpw-hostname > > ^^ This one seems to ignore --prefix behavior. > > $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname > $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 > -np 1 hostname > > ^^ These two seem to adhere to the proper --prefix behavior. > > Ralph -- can you have a look? > > > > > On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote: > >> Hi Ralph, >> >> Thanks, here is what I did as suggested by Jeff: >> >>> What did this command line look like? Can you provide the configure line as >>> well? >> >> As in my previous post, the script as following: >> >> (1) debug messages: > >> yiguang@gulftown testdmp]$ ./test.bash >> [gulftown:28340] mca: base: components_open: Looking for plm components >> [gulftown:28340] mca: base: components_open: opening plm components >> [gulftown:28340] mca: base: components_open: found loaded component rsh >> [gulftown:28340] mca: base: components_open: component rsh has no register >> function >> [gulftown:28340] mca: base: components_open: component rsh open function >> successful >> [gulftown:28340] mca: base: components_open: found loaded component slurm >> [gulftown:28340] mca: base: components_open: component slurm has no register >> function >> [gulftown:28340] mca: base: components_open: component slurm open function >> successful >> [gulftown:28340] mca: base: components_open: found loaded component tm >> [gulftown:28340] mca: base: components_open: component tm has no register >> function >> [gulftown:28340] mca: base: components_open: component tm open function >> successful >> [gulftown:28340] mca:base:select: Auto-selecting plm components >> [gulftown:28340] mca:base:select:( plm) Querying component [rsh] >> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [gulftown:28340] mca:base:select:( plm) Querying component [slurm] >> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query >> failed to return a module >> [gulftown:28340] mca:base:select:( plm) Querying component [tm] >> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query >> failed to return a module >> [gulftown:28340] mca:base:select:( plm) Selected component [rsh] >> [gulftown:28340] mca: base: close: component slurm closed >> [gulftown:28340] mca: base: close: unloading component slurm >> [gulftown:28340] mca: base: close: component tm closed >> [gulftown:28340] mca: base: close: unloading component tm >> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash >> 3546479048 >> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 >> [gulftown:28340] [[17438,0],0] plm:base:receive start comm >> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] >> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] >> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) >> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local >> shell >> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) >> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >> /usr/bin/rsh orted --daemonize -mca ess env -mca >> orte_ess_jobid 1142816768 -mca >> orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri >> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >> - >> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >> btl openib,sm,self --mca >> orte_tmpdir_base /tmp --mca plm_base_verbose 100 >> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node >> gulftown >> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 >> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon >> [[17438,0],1] >> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) >> [/usr/bin/rsh ibnode001 orted --daemonize -mca >> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca >> orte_ess_num_procs 4 --hnp-uri >> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >> - >> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >> btl openib,sm,self --mca >> orte_tmpdir_base /tmp --mca
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Hi Ralph, Thanks, here is what I did as suggested by Jeff: > What did this command line look like? Can you provide the configure line as > well? As in my previous post, the script as following: (1) debug messages: >>> yiguang@gulftown testdmp]$ ./test.bash [gulftown:28340] mca: base: components_open: Looking for plm components [gulftown:28340] mca: base: components_open: opening plm components [gulftown:28340] mca: base: components_open: found loaded component rsh [gulftown:28340] mca: base: components_open: component rsh has no register function [gulftown:28340] mca: base: components_open: component rsh open function successful [gulftown:28340] mca: base: components_open: found loaded component slurm [gulftown:28340] mca: base: components_open: component slurm has no register function [gulftown:28340] mca: base: components_open: component slurm open function successful [gulftown:28340] mca: base: components_open: found loaded component tm [gulftown:28340] mca: base: components_open: component tm has no register function [gulftown:28340] mca: base: components_open: component tm open function successful [gulftown:28340] mca:base:select: Auto-selecting plm components [gulftown:28340] mca:base:select:( plm) Querying component [rsh] [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10 [gulftown:28340] mca:base:select:( plm) Querying component [slurm] [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Querying component [tm] [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Selected component [rsh] [gulftown:28340] mca: base: close: component slurm closed [gulftown:28340] mca: base: close: unloading component slurm [gulftown:28340] mca: base: close: component tm closed [gulftown:28340] mca: base: close: unloading component tm [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048 [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 [gulftown:28340] [[17438,0],0] plm:base:receive start comm [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: /usr/bin/rsh orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100 [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003 [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] [gulftown:28340] [[17438,0],0]
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
What did this command line look like? Can you provide the configure line as well? On Mar 1, 2012, at 12:46 PM, Yiguang Yan wrote: > Hi Jeff, > > Here I made a developer build, and then got the following message > with plm_base_verbose: > > [gulftown:28340] mca: base: components_open: Looking for plm > components > [gulftown:28340] mca: base: components_open: opening plm > components > [gulftown:28340] mca: base: components_open: found loaded > component rsh > [gulftown:28340] mca: base: components_open: component rsh > has no register function > [gulftown:28340] mca: base: components_open: component rsh > open function successful > [gulftown:28340] mca: base: components_open: found loaded > component slurm > [gulftown:28340] mca: base: components_open: component slurm > has no register function > [gulftown:28340] mca: base: components_open: component slurm > open function successful > [gulftown:28340] mca: base: components_open: found loaded > component tm > [gulftown:28340] mca: base: components_open: component tm > has no register function > [gulftown:28340] mca: base: components_open: component tm > open function successful > [gulftown:28340] mca:base:select: Auto-selecting plm components > [gulftown:28340] mca:base:select:( plm) Querying component [rsh] > [gulftown:28340] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [gulftown:28340] mca:base:select:( plm) Querying component > [slurm] > [gulftown:28340] mca:base:select:( plm) Skipping component > [slurm]. Query failed to return a module > [gulftown:28340] mca:base:select:( plm) Querying component [tm] > [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [gulftown:28340] mca:base:select:( plm) Selected component [rsh] > [gulftown:28340] mca: base: close: component slurm closed > [gulftown:28340] mca: base: close: unloading component slurm > [gulftown:28340] mca: base: close: component tm closed > [gulftown:28340] mca: base: close: unloading component tm > [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 > nodename hash 3546479048 > [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 > [gulftown:28340] [[17438,0],0] plm:base:receive start comm > [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] > [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] > [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) > [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote > shell as local shell > [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) > [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >/usr/bin/rsh orted --daemonize -mca ess env - > mca orte_ess_jobid 1142816768 -mca orte_ess_vpid - > mca orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100 > [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already > exists on node gulftown > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode001 > [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon > [[17438,0],1] > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100] > bash: orted: command not found > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode002 > [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon > [[17438,0],2] > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100] > bash: orted: command not found > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode003 > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159"
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Hi Jeff, Here I made a developer build, and then got the following message with plm_base_verbose: >>> [gulftown:28340] mca: base: components_open: Looking for plm components [gulftown:28340] mca: base: components_open: opening plm components [gulftown:28340] mca: base: components_open: found loaded component rsh [gulftown:28340] mca: base: components_open: component rsh has no register function [gulftown:28340] mca: base: components_open: component rsh open function successful [gulftown:28340] mca: base: components_open: found loaded component slurm [gulftown:28340] mca: base: components_open: component slurm has no register function [gulftown:28340] mca: base: components_open: component slurm open function successful [gulftown:28340] mca: base: components_open: found loaded component tm [gulftown:28340] mca: base: components_open: component tm has no register function [gulftown:28340] mca: base: components_open: component tm open function successful [gulftown:28340] mca:base:select: Auto-selecting plm components [gulftown:28340] mca:base:select:( plm) Querying component [rsh] [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10 [gulftown:28340] mca:base:select:( plm) Querying component [slurm] [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Querying component [tm] [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Selected component [rsh] [gulftown:28340] mca: base: close: component slurm closed [gulftown:28340] mca: base: close: unloading component slurm [gulftown:28340] mca: base: close: component tm closed [gulftown:28340] mca: base: close: unloading component tm [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048 [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 [gulftown:28340] [[17438,0],0] plm:base:receive start comm [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: /usr/bin/rsh orted --daemonize -mca ess env - mca orte_ess_jobid 1142816768 -mca orte_ess_vpid - mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100 [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003 [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],3] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:base:daemon_callback <<< It
Re: [OMPI users] Redefine proc in cartesian topologies
Also the sequential mapper may be of help - allows you to specify the node each rank is to be place on, one line/rank. On Mar 1, 2012, at 12:40 PM, Gustavo Correa wrote: > Hi Claudio > > Check 'man mpirun'. > You will find examples of the > '-byslot', '-bynode', '-loadbalance', and rankfile options, > which allow some control of how ranks are mapped into processors/cores. > > I hope this helps, > Gus Correa > > On Mar 1, 2012, at 2:34 PM, Claudio Pastorino wrote: > >> Hi, thanks for the answer. >> You are right is not the rank what matters but how do I arrange >> the physical procs in the cartesian topology. I don't care about the label. >> So, how do I achieve that? >> >> Regards, >> Claudio >> >> >> >> 2012/3/1, Ralph Castain: >>> Is it really the rank that matters, or where the rank is located? For >>> example, you could leave the ranks as assigned by the cartesian topology, >>> but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node, >>> etc. >>> >>> Is that what you are trying to achieve? >>> >>> >>> On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote: >>> Dear all, I apologize in advance if this is not the right list to post this. I am a newcomer and please let me know if I should be sending this to another list. I program MPI trying to do HPC parallel programs. In particular I wrote a parallel code for molecular dynamics simulations. The program splits the work in a matrix of procs and I send messages along rows and columns in an equal basis. I learnt that the typical arrangement of cartesian topology is not usually the best option, because in a matrix, let's say of 4x4 procs with quad procs, the procs are arranged so that through columns one stays inside the same quad proc and through rows you are always going out to the network. This means procs are arranged as one quad per row. I try to explain this for a 2x2 case. The cartesian topology does this assignment, typically: cartesianmpi_comm_world 0,0 --> 0 0,1 --> 1 1,0 --> 2 1,1 --> 3 The question is, how do I get a "user defined" assignment such as: 0,0 --> 0 0,1 --> 2 1,0 --> 1 1,1 --> 3 ? Thanks in advance and I hope to have made this more or less understandable. Claudio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Redefine proc in cartesian topologies
Hi Claudio Check 'man mpirun'. You will find examples of the '-byslot', '-bynode', '-loadbalance', and rankfile options, which allow some control of how ranks are mapped into processors/cores. I hope this helps, Gus Correa On Mar 1, 2012, at 2:34 PM, Claudio Pastorino wrote: > Hi, thanks for the answer. > You are right is not the rank what matters but how do I arrange > the physical procs in the cartesian topology. I don't care about the label. > So, how do I achieve that? > > Regards, > Claudio > > > > 2012/3/1, Ralph Castain: >> Is it really the rank that matters, or where the rank is located? For >> example, you could leave the ranks as assigned by the cartesian topology, >> but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node, >> etc. >> >> Is that what you are trying to achieve? >> >> >> On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote: >> >>> Dear all, >>> I apologize in advance if this is not the right list to post this. I >>> am a newcomer and please let me know if I should be sending this to >>> another list. >>> >>> I program MPI trying to do HPC parallel programs. In particular I >>> wrote a parallel code >>> for molecular dynamics simulations. The program splits the work in a >>> matrix of procs and >>> I send messages along rows and columns in an equal basis. I learnt >>> that the typical >>> arrangement of cartesian topology is not usually the best option, >>> because in a matrix, let's say of 4x4 procs with quad procs, the >>> procs are arranged so that >>> through columns one stays inside the same quad proc and through rows >>> you are always going out to the network. This means procs are >>> arranged as one quad per row. >>> >>> I try to explain this for a 2x2 case. The cartesian topology does this >>> assignment, typically: >>> cartesianmpi_comm_world >>> 0,0 --> 0 >>> 0,1 --> 1 >>> 1,0 --> 2 >>> 1,1 --> 3 >>> The question is, how do I get a "user defined" assignment such as: >>> 0,0 --> 0 >>> 0,1 --> 2 >>> 1,0 --> 1 >>> 1,1 --> 3 >>> >>> ? >>> >>> Thanks in advance and I hope to have made this more or less >>> understandable. >>> Claudio >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Redefine proc in cartesian topologies
Probably yes, do I have a more systematic way? Thanks Claudio 2012/3/1, Jingcha Joba: > mpirun -np 4 --host node1,node2,node1,node2 ./app > > Is this what you want? > > On Thu, Mar 1, 2012 at 10:57 AM, Claudio Pastorino < > claudio.pastor...@gmail.com> wrote: > >> Dear all, >> I apologize in advance if this is not the right list to post this. I >> am a newcomer and please let me know if I should be sending this to >> another list. >> >> I program MPI trying to do HPC parallel programs. In particular I >> wrote a parallel code >> for molecular dynamics simulations. The program splits the work in a >> matrix of procs and >> I send messages along rows and columns in an equal basis. I learnt >> that the typical >> arrangement of cartesian topology is not usually the best option, >> because in a matrix, let's say of 4x4 procs with quad procs, the >> procs are arranged so that >> through columns one stays inside the same quad proc and through rows >> you are always going out to the network. This means procs are >> arranged as one quad per row. >> >> I try to explain this for a 2x2 case. The cartesian topology does this >> assignment, typically: >> cartesianmpi_comm_world >> 0,0 --> 0 >> 0,1 --> 1 >> 1,0 --> 2 >> 1,1 --> 3 >> The question is, how do I get a "user defined" assignment such as: >> 0,0 --> 0 >> 0,1 --> 2 >> 1,0 --> 1 >> 1,1 --> 3 >> >> ? >> >> Thanks in advance and I hope to have made this more or less >> understandable. >> Claudio >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >
Re: [OMPI users] Redefine proc in cartesian topologies
Hi, thanks for the answer. You are right is not the rank what matters but how do I arrange the physical procs in the cartesian topology. I don't care about the label. So, how do I achieve that? Regards, Claudio 2012/3/1, Ralph Castain: > Is it really the rank that matters, or where the rank is located? For > example, you could leave the ranks as assigned by the cartesian topology, > but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node, > etc. > > Is that what you are trying to achieve? > > > On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote: > >> Dear all, >> I apologize in advance if this is not the right list to post this. I >> am a newcomer and please let me know if I should be sending this to >> another list. >> >> I program MPI trying to do HPC parallel programs. In particular I >> wrote a parallel code >> for molecular dynamics simulations. The program splits the work in a >> matrix of procs and >> I send messages along rows and columns in an equal basis. I learnt >> that the typical >> arrangement of cartesian topology is not usually the best option, >> because in a matrix, let's say of 4x4 procs with quad procs, the >> procs are arranged so that >> through columns one stays inside the same quad proc and through rows >> you are always going out to the network. This means procs are >> arranged as one quad per row. >> >> I try to explain this for a 2x2 case. The cartesian topology does this >> assignment, typically: >> cartesianmpi_comm_world >> 0,0 --> 0 >> 0,1 --> 1 >> 1,0 --> 2 >> 1,1 --> 3 >> The question is, how do I get a "user defined" assignment such as: >> 0,0 --> 0 >> 0,1 --> 2 >> 1,0 --> 1 >> 1,1 --> 3 >> >> ? >> >> Thanks in advance and I hope to have made this more or less >> understandable. >> Claudio >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Redefine proc in cartesian topologies
mpirun -np 4 --host node1,node2,node1,node2 ./app Is this what you want? On Thu, Mar 1, 2012 at 10:57 AM, Claudio Pastorino < claudio.pastor...@gmail.com> wrote: > Dear all, > I apologize in advance if this is not the right list to post this. I > am a newcomer and please let me know if I should be sending this to > another list. > > I program MPI trying to do HPC parallel programs. In particular I > wrote a parallel code > for molecular dynamics simulations. The program splits the work in a > matrix of procs and > I send messages along rows and columns in an equal basis. I learnt > that the typical > arrangement of cartesian topology is not usually the best option, > because in a matrix, let's say of 4x4 procs with quad procs, the > procs are arranged so that > through columns one stays inside the same quad proc and through rows > you are always going out to the network. This means procs are > arranged as one quad per row. > > I try to explain this for a 2x2 case. The cartesian topology does this > assignment, typically: > cartesianmpi_comm_world > 0,0 --> 0 > 0,1 --> 1 > 1,0 --> 2 > 1,1 --> 3 > The question is, how do I get a "user defined" assignment such as: > 0,0 --> 0 > 0,1 --> 2 > 1,0 --> 1 > 1,1 --> 3 > > ? > > Thanks in advance and I hope to have made this more or less > understandable. > Claudio > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Redefine proc in cartesian topologies
Is it really the rank that matters, or where the rank is located? For example, you could leave the ranks as assigned by the cartesian topology, but then map them so that ranks 0 and 2 share a node, 1 and 3 share a node, etc. Is that what you are trying to achieve? On Mar 1, 2012, at 11:57 AM, Claudio Pastorino wrote: > Dear all, > I apologize in advance if this is not the right list to post this. I > am a newcomer and please let me know if I should be sending this to > another list. > > I program MPI trying to do HPC parallel programs. In particular I > wrote a parallel code > for molecular dynamics simulations. The program splits the work in a > matrix of procs and > I send messages along rows and columns in an equal basis. I learnt > that the typical > arrangement of cartesian topology is not usually the best option, > because in a matrix, let's say of 4x4 procs with quad procs, the > procs are arranged so that > through columns one stays inside the same quad proc and through rows > you are always going out to the network. This means procs are > arranged as one quad per row. > > I try to explain this for a 2x2 case. The cartesian topology does this > assignment, typically: > cartesianmpi_comm_world > 0,0 --> 0 > 0,1 --> 1 > 1,0 --> 2 > 1,1 --> 3 > The question is, how do I get a "user defined" assignment such as: > 0,0 --> 0 > 0,1 --> 2 > 1,0 --> 1 > 1,1 --> 3 > > ? > > Thanks in advance and I hope to have made this more or less understandable. > Claudio > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Redefine proc in cartesian topologies
Dear all, I apologize in advance if this is not the right list to post this. I am a newcomer and please let me know if I should be sending this to another list. I program MPI trying to do HPC parallel programs. In particular I wrote a parallel code for molecular dynamics simulations. The program splits the work in a matrix of procs and I send messages along rows and columns in an equal basis. I learnt that the typical arrangement of cartesian topology is not usually the best option, because in a matrix, let's say of 4x4 procs with quad procs, the procs are arranged so that through columns one stays inside the same quad proc and through rows you are always going out to the network. This means procs are arranged as one quad per row. I try to explain this for a 2x2 case. The cartesian topology does this assignment, typically: cartesianmpi_comm_world 0,0 --> 0 0,1 --> 1 1,0 --> 2 1,1 --> 3 The question is, how do I get a "user defined" assignment such as: 0,0 --> 0 0,1 --> 2 1,0 --> 1 1,1 --> 3 ? Thanks in advance and I hope to have made this more or less understandable. Claudio
Re: [OMPI users] compilation error with pgcc Unknown switch
Did you do a full autogen / configure / make clean / make all ? On Mar 1, 2012, at 8:53 AM, Abhinav Sarje wrote: > Thanks Ralph. That did help, but only till the next hurdle. Now the > build fails at the following point with an 'undefined reference': > --- > Making all in tools/ompi_info > make[2]: Entering directory > `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' > CC ompi_info.o > CC output.o > CC param.o > CC components.o > CC version.o > CCLD ompi_info > ../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64' > make[2]: *** [ompi_info] Error 2 > make[2]: Leaving directory > `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory > `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' > make: *** [all-recursive] Error 1 > --- > > > > > > > On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castainwrote: >> You need to update your source code - this was identified and fixed on Wed. >> Unfortunately, our trunk is a developer's environment. While we try hard to >> keep it fully functional, bugs do occasionally work their way into the code. >> >> On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote: >> >>> Hi Nathan, >>> >>> I tried building on an internal login node, and it did not fail at the >>> previous point. But, after compiling for a very long time, it failed >>> while building libmpi.la, with a multiple definition error: >>> -- >>> ... >>> CC mpiext/mpiext.lo >>> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo >>> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo >>> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo >>> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo >>> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo >>> CCLD mpi/f77/base/libmpi_f77_base.la >>> CCLD libmpi.la >>> mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o): >>> In function `local_heap_sort': >>> /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c:: >>> multiple definition of `local_heap_sort' >>> mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929: >>> first defined here >>> make[2]: *** [libmpi.la] Error 2 >>> make[2]: Leaving directory >>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' >>> make[1]: *** [all-recursive] Error 1 >>> make[1]: Leaving directory >>> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' >>> make: *** [all-recursive] Error 1 >>> -- >>> >>> Any idea why this is happening, and how to fix it? Again, I am using >>> the XE6 platform configuration file. >>> >>> Abhinav. >>> >>> On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm wrote: On Mon, 27 Feb 2012, Abhinav Sarje wrote: > Hi Nathan, Gus, Manju, > > I got a chance to try out the XE6 support build, but with no success. > First I was getting this error: "PGC-F-0010-File write error occurred > (temporary pragma .s file)". After searching online about this error, > I saw that there is a patch at > > "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch; > for this particular error. > > With the patched version, I did not get this error anymore, but got > the unknown switch flag error for the flag "-march=amdfam10" > (specified in the XE6 configuration in the dev trunk) at a particular > point even if I use the '-noswitcherror' flag with the pgcc compiler. > > If I remove this flag (-march=amdfam10), the build fails later at the > following point: > - > Making all in mca/ras/alps > make[2]: Entering directory > `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' > CC ras_alps_component.lo > CC ras_alps_module.lo > PGC-F-0206-Can't find include file alps/apInfo.h > (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37) > PGC/x86-64 Linux 11.10-0: compilation aborted > make[2]: *** [ras_alps_module.lo] Error 1 > make[2]: Leaving directory > `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte' > make: *** [all-recursive] Error 1 > -- This is a known issue with Cray's frontend environment. Build on one of the internal login nodes. -Nathan ___ users mailing
Re: [OMPI users] compilation error with pgcc Unknown switch
Thanks Ralph. That did help, but only till the next hurdle. Now the build fails at the following point with an 'undefined reference': --- Making all in tools/ompi_info make[2]: Entering directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' CC ompi_info.o CC output.o CC param.o CC components.o CC version.o CCLD ompi_info ../../../ompi/.libs/libmpi.so: undefined reference to `opal_atomic_swap_64' make[2]: *** [ompi_info] Error 2 make[2]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/tools/ompi_info' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' make: *** [all-recursive] Error 1 --- On Thu, Mar 1, 2012 at 5:25 PM, Ralph Castainwrote: > You need to update your source code - this was identified and fixed on Wed. > Unfortunately, our trunk is a developer's environment. While we try hard to > keep it fully functional, bugs do occasionally work their way into the code. > > On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote: > >> Hi Nathan, >> >> I tried building on an internal login node, and it did not fail at the >> previous point. But, after compiling for a very long time, it failed >> while building libmpi.la, with a multiple definition error: >> -- >> ... >> CC mpiext/mpiext.lo >> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo >> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo >> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo >> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo >> CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo >> CCLD mpi/f77/base/libmpi_f77_base.la >> CCLD libmpi.la >> mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o): >> In function `local_heap_sort': >> /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c:: >> multiple definition of `local_heap_sort' >> mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929: >> first defined here >> make[2]: *** [libmpi.la] Error 2 >> make[2]: Leaving directory >> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory >> `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' >> make: *** [all-recursive] Error 1 >> -- >> >> Any idea why this is happening, and how to fix it? Again, I am using >> the XE6 platform configuration file. >> >> Abhinav. >> >> On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelm wrote: >>> >>> >>> On Mon, 27 Feb 2012, Abhinav Sarje wrote: >>> Hi Nathan, Gus, Manju, I got a chance to try out the XE6 support build, but with no success. First I was getting this error: "PGC-F-0010-File write error occurred (temporary pragma .s file)". After searching online about this error, I saw that there is a patch at "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch; for this particular error. With the patched version, I did not get this error anymore, but got the unknown switch flag error for the flag "-march=amdfam10" (specified in the XE6 configuration in the dev trunk) at a particular point even if I use the '-noswitcherror' flag with the pgcc compiler. If I remove this flag (-march=amdfam10), the build fails later at the following point: - Making all in mca/ras/alps make[2]: Entering directory `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' CC ras_alps_component.lo CC ras_alps_module.lo PGC-F-0206-Can't find include file alps/apInfo.h (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37) PGC/x86-64 Linux 11.10-0: compilation aborted make[2]: *** [ras_alps_module.lo] Error 1 make[2]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte' make: *** [all-recursive] Error 1 -- >>> >>> >>> This is a known issue with Cray's frontend environment. Build on one of the >>> internal login nodes. >>> >>> >>> -Nathan >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >
Re: [OMPI users] Simple question on GRID
You can use CyberIntegrator (http://isda.ncsa.uiuc.edu/cyberintegrator/) developed by NCSA, or UNICORE (http://www.unicore.eu/) developed by Julich to integrate resources. best, madel From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Shaandar Nyamtulga Sent: Thursday, March 01, 2012 7:10 AM To: us...@open-mpi.org Subject: [OMPI users] Simple question on GRID Hi I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is MPICH2). They run separately in their local network environment.I know there is a way to integrate them through Internet, presumably by Grid software, I guess. Is there any tutorial to do this?
Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1
On Mar 1, 2012, at 1:17 AM, Jingcha Joba wrote: > Aah... > So when openMPI is compile with OFED, and run on a Infiniband/RoCE devices, I > would use the mpi would simply direct to ofed to do point to point calls in > the ofed way? I'm not quite sure how to parse that. :-) The openib BTL uses verbs functions to effect data transfers between MPI process peers. The BTL is one of the lower layers in Open MPI for point-to-point communication; BTL plugins are used to effect the device-specific transport stuff for MPI_SEND, MPI_RECV, MPI_PUT, ...etc. Hence, when you run with the openib BTL and call MPI_SEND (assumedly to a peer that is reachable via an OpenFabrics device), the openib BTL will eventually be called to actually send the message. The openib BTL will send the message to the peer via calls to some combination of calls to verbs functions. Mellanox has also introduced a library called "MXM" that can also be used for underlying MPI message transport (as opposed to using the openib BTL). See the Open MPI README for some explanations about the different transports that Open MPI can use (specifically: "ob1" vs. "cm"). > > More specifically: all things being equal, you don't care which is used. > > You just want your message to get to the receiver/target as fast as > > possible. One of the main ideas of MPI is to hide those kinds of details > > from the user. I.e., you call MPI_SEND. A miracle occurs. The message is > > received on the other side. > > True. Its just that I am digging into the OFED source code and the ompi > source code,and trying to understand the way these two interact.. The openib BTL is probably one of the most complex sections of Open MPI, unfortunately. :-\ The verbs API is *quite* complex, and has many different options that do not work on all types of OpenFabrics hardware. This leads to many different blocks of code, not all of which are executed on all platforms. The verbs model of registering memory also leads to a lot of complications, especially since, for performance reasons, MPI has to cache memory registrations and interpose itself in the memory subsystem to catch when registered memory is freed (see the README for some details here). If you have any specific questions about the implementation, post over on the devel list. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1
I would just ignore these tests: 1. The use of MPI one-sided functionality is extremely rare out in the real world. 2. Brian said there were probably bugs in Open MPI's implementation of the MPI one-sided functionality itself, and he's in the middle of re-writing the one-sided functionality anyway. On Mar 1, 2012, at 1:26 AM, Jingcha Joba wrote: > Well, as Jeff says, looks like its to do with the 1 sided comm. > > But the reason why I said was because of what I experienced a couple of > months ago: When I had a Myri-10G and an Intel gigabit ethernet card lying > around, I wanted to test the kernel bypass using open-mx stack and I ran the > osu benchmark. > Though all the tests worked fine with the Myri 10g, I seemed to get this > "hanging" issue when running using Intel Gigabit ethernet, esp for a size > more than 1K on put/get / bcast. I tried with the tcp stack instead of mx, > and it seemed to work fine, though with bad latency numbers (which is kind of > obvious, considering that cpu overhead due to tcp). > I never really got a change to dig deep, but I was pretty much sure that this > is to do with the open-mx. > > > On Wed, Feb 29, 2012 at 9:13 PM, Venkateswara Rao Dokku> wrote: > Hi, > I tried executing those tests with the other devices like tcp instead > of ib with the same open-mpi 1.4.3.. It went fine but it took time to > execute, when i tried to execute the same test on the customized OFED ,tests > are hanging at the same message size.. > > Can u please tel me, what could me the possible issue over there, so that you > can narrow down the issue.. > i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the > customized OFED ( in RDMA scenario's or anything (if u can specify)). > > > On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyres wrote: > On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote: > > > So if I understand correctly, if a message size is smaller than it will use > > the MPI way (non-RDMA, 2 way communication), if its larger, then it would > > use the Open Fabrics, by using the ibverbs (and ofed stack) instead of > > using the MPI's stack? > > Er... no. > > So let's talk MPI-over-OpenFabrics-verbs specifically. > > All MPI communication calls will use verbs under the covers. They may use > verbs send/receive semantics in some cases, and RDMA semantics in other > cases. "It depends" -- on a lot of things, actually. It's hard to come up > with a good rule of thumb for when it uses one or the other; this is one of > the reasons that the openib BTL code is so complex. :-) > > The main points here are: > > 1. you can trust the openib BTL to do the Best thing possible to get the > message to the other side. Regardless of whether that message is an MPI_SEND > or an MPI_PUT (for example). > > 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET > does not necessarily == verbs RDMA read). > > > If so, could that be the reason why the MPI_Put "hangs" when sending a > > message more than 512KB (or may be 1MB)? > > No. I'm guessing that there's some kind of bug in the MPI_PUT implementation. > > > Also is there a way to know if for a particular MPI call, OF uses send/recv > > or RDMA exchange? > > Not really. > > More specifically: all things being equal, you don't care which is used. You > just want your message to get to the receiver/target as fast as possible. > One of the main ideas of MPI is to hide those kinds of details from the user. > I.e., you call MPI_SEND. A miracle occurs. The message is received on the > other side. > > :-) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Thanks & Regards, > D.Venkateswara Rao, > Software Engineer,One Convergence Devices Pvt Ltd., > Jubille Hills,Hyderabad. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Very slow MPI_GATHER
On Mar 1, 2012, at 3:33 AM, Pinero, Pedro_jose wrote: > I am launching 200 light processes in two computers with 8 cores each one > (Intel i7 processor). They are dedicated and are interconnected through a > point-to-point Gigabit Ethernet link. > > I read about oversubscribing nodes in the open-mpi documentation, and for > that reason I am using the option > > -Mca mpi_yield_when_idle 1 That's still going to give you terrible performance. Open MPI was designed to run basically at one process per processor (usually a core). The easiest reason to cite here is that Open MPI busy-polls while blocking for message passing progress. The yield_when_idle option *helps* (in some versions of Linux, at least), but it doesn't change that fact that MPI processes will be extremely aggressive in clamoring for CPU cycles. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format errorI
I am able to run the application with LSF now, it strange because I wasn't able to trace any error. On Thu, Mar 1, 2012 at 11:34 AM, PukkiMonkeywrote: > What Jeff means is that because u didn't have echo "mpirun...>>outfile" > but > echo mpirun>>outfile , > you were piping the output to the outfile instead of stdout. > > Sent from my iPhone > > On Feb 29, 2012, at 8:44 PM, Syed Ahsan Ali wrote: > > Sorry Jeff I couldn't get you point. > > On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres wrote: > >> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote: >> >> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile >> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1 >> ./hrm >> ${OUTFILE}_hrm 2>&1 >> > [pmdtest@pmd02 d00_dayfiles]$ >> >> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm >> file, not stdout. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014
Re: [OMPI users] Very slow MPI_GATHER
Wow - with that heavy an oversubscription, your performance experience certainly is reasonable. Not much you can do about it except reduce the oversubscription, either by increasing the number of computers or reducing the number of processes. On Mar 1, 2012, at 1:33 AM, Pinero, Pedro_jose wrote: > Thank you for your fast response. > > I am launching 200 light processes in two computers with 8 cores each one > (Intel i7 processor). They are dedicated and are interconnected through a > point-to-point Gigabit Ethernet link. > > I read about oversubscribing nodes in the open-mpi documentation, and for > that reason I am using the option > > -Mca mpi_yield_when_idle 1 > > Regards > > Pedro > > > > >>On Feb 29, 2012, at 11:01 AM, Pinero, Pedro_jose wrote: > > >> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers > >> cluster connected though Ethernet, obtaining a very poor performance. > > >Let me making sure I'm parsing this statement properly: are you launching > >200 MPI processes on 2 computers? If so, do >those computers each have 100 > >cores? > > >I ask because oversubscribing MPI processes (i.e., putting more than 1 > >process per core) will be disastrous to >performance. > > >> I have measured each operation time and I haver realised that the > >> MPI_Gather operation takes about 1 second in each >>synchronization (only > >> an integer is send in each case). Is this time range normal or I have a > >> synchronization >>problem? Is there any way to improve this performance? > > >I'm afraid I can't say more without more information about your hardware and > >software setup. Is this a dedicated HPC >cluster? Are you oversubscribing > >the cores? What kind of Ethernet switching gear do you have? ...etc. > > >-- > >Jeff Squyres > >jsquy...@cisco.com > >For corporate legal information go to: > >http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] compilation error with pgcc Unknown switch
You need to update your source code - this was identified and fixed on Wed. Unfortunately, our trunk is a developer's environment. While we try hard to keep it fully functional, bugs do occasionally work their way into the code. On Mar 1, 2012, at 1:37 AM, Abhinav Sarje wrote: > Hi Nathan, > > I tried building on an internal login node, and it did not fail at the > previous point. But, after compiling for a very long time, it failed > while building libmpi.la, with a multiple definition error: > -- > ... > CC mpiext/mpiext.lo > CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo > CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo > CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo > CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo > CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo > CCLD mpi/f77/base/libmpi_f77_base.la > CCLD libmpi.la > mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o): > In function `local_heap_sort': > /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c:: > multiple definition of `local_heap_sort' > mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929: > first defined here > make[2]: *** [libmpi.la] Error 2 > make[2]: Leaving directory > `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory > `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' > make: *** [all-recursive] Error 1 > -- > > Any idea why this is happening, and how to fix it? Again, I am using > the XE6 platform configuration file. > > Abhinav. > > On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelmwrote: >> >> >> On Mon, 27 Feb 2012, Abhinav Sarje wrote: >> >>> Hi Nathan, Gus, Manju, >>> >>> I got a chance to try out the XE6 support build, but with no success. >>> First I was getting this error: "PGC-F-0010-File write error occurred >>> (temporary pragma .s file)". After searching online about this error, >>> I saw that there is a patch at >>> >>> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch; >>> for this particular error. >>> >>> With the patched version, I did not get this error anymore, but got >>> the unknown switch flag error for the flag "-march=amdfam10" >>> (specified in the XE6 configuration in the dev trunk) at a particular >>> point even if I use the '-noswitcherror' flag with the pgcc compiler. >>> >>> If I remove this flag (-march=amdfam10), the build fails later at the >>> following point: >>> - >>> Making all in mca/ras/alps >>> make[2]: Entering directory >>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >>> CC ras_alps_component.lo >>> CC ras_alps_module.lo >>> PGC-F-0206-Can't find include file alps/apInfo.h >>> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37) >>> PGC/x86-64 Linux 11.10-0: compilation aborted >>> make[2]: *** [ras_alps_module.lo] Error 1 >>> make[2]: Leaving directory >>> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >>> make[1]: *** [all-recursive] Error 1 >>> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte' >>> make: *** [all-recursive] Error 1 >>> -- >> >> >> This is a known issue with Cray's frontend environment. Build on one of the >> internal login nodes. >> >> >> -Nathan >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] compilation error with pgcc Unknown switch
Hi Nathan, I tried building on an internal login node, and it did not fail at the previous point. But, after compiling for a very long time, it failed while building libmpi.la, with a multiple definition error: -- ... CC mpiext/mpiext.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-attr_fn_f.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-conversion_fn_null_f.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-f90_accessors.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-strings.lo CC mpi/f77/base/mpi_f77_base_libmpi_f77_base_la-test_constants_f.lo CCLD mpi/f77/base/libmpi_f77_base.la CCLD libmpi.la mca/fcoll/dynamic/.libs/libmca_fcoll_dynamic.a(fcoll_dynamic_file_write_all.o): In function `local_heap_sort': /global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/dynamic/../../../../../ompi/mca/fcoll/dynamic/fcoll_dynamic_file_write_all.c:: multiple definition of `local_heap_sort' mca/fcoll/static/.libs/libmca_fcoll_static.a(fcoll_static_file_write_all.o):/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi/mca/fcoll/static/../../../../../ompi/mca/fcoll/static/fcoll_static_file_write_all.c:929: first defined here make[2]: *** [libmpi.la] Error 2 make[2]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/global/u1/a/asarje/hopper/openmpi-dev-trunk/build/ompi' make: *** [all-recursive] Error 1 -- Any idea why this is happening, and how to fix it? Again, I am using the XE6 platform configuration file. Abhinav. On Wed, Feb 29, 2012 at 12:13 AM, Nathan Hjelmwrote: > > > On Mon, 27 Feb 2012, Abhinav Sarje wrote: > >> Hi Nathan, Gus, Manju, >> >> I got a chance to try out the XE6 support build, but with no success. >> First I was getting this error: "PGC-F-0010-File write error occurred >> (temporary pragma .s file)". After searching online about this error, >> I saw that there is a patch at >> >> "https://svn.open-mpi.org/trac/ompi/attachment/ticket/2913/openmpi-trunk-ident_string.patch; >> for this particular error. >> >> With the patched version, I did not get this error anymore, but got >> the unknown switch flag error for the flag "-march=amdfam10" >> (specified in the XE6 configuration in the dev trunk) at a particular >> point even if I use the '-noswitcherror' flag with the pgcc compiler. >> >> If I remove this flag (-march=amdfam10), the build fails later at the >> following point: >> - >> Making all in mca/ras/alps >> make[2]: Entering directory >> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >> CC ras_alps_component.lo >> CC ras_alps_module.lo >> PGC-F-0206-Can't find include file alps/apInfo.h >> (../../../../../orte/mca/ras/alps/ras_alps_module.c: 37) >> PGC/x86-64 Linux 11.10-0: compilation aborted >> make[2]: *** [ras_alps_module.lo] Error 1 >> make[2]: Leaving directory >> `/{mydir}/openmpi-dev-trunk/build/orte/mca/ras/alps' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/{mydir}/openmpi-dev-trunk/build/orte' >> make: *** [all-recursive] Error 1 >> -- > > > This is a known issue with Cray's frontend environment. Build on one of the > internal login nodes. > > > -Nathan > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Very slow MPI_GATHER
Thank you for your fast response. I am launching 200 light processes in two computers with 8 cores each one (Intel i7 processor). They are dedicated and are interconnected through a point-to-point Gigabit Ethernet link. I read about oversubscribing nodes in the open-mpi documentation, and for that reason I am using the option -Mca mpi_yield_when_idle 1 Regards Pedro >>On Feb 29, 2012, at 11:01 AM, Pinero, Pedro_jose wrote: >> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers cluster connected though Ethernet, obtaining a very poor performance. >Let me making sure I'm parsing this statement properly: are you launching 200 MPI processes on 2 computers? If so, do >those computers each have 100 cores? >I ask because oversubscribing MPI processes (i.e., putting more than 1 process per core) will be disastrous to >performance. >> I have measured each operation time and I haver realised that the MPI_Gather operation takes about 1 second in each >>synchronization (only an integer is send in each case). Is this time range normal or I have a synchronization >>problem? Is there any way to improve this performance? >I'm afraid I can't say more without more information about your hardware and software setup. Is this a dedicated HPC >cluster? Are you oversubscribing the cores? What kind of Ethernet switching gear do you have? ...etc. >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error
What Jeff means is that because u didn't have echo "mpirun...>>outfile" but echo mpirun>>outfile , you were piping the output to the outfile instead of stdout. Sent from my iPhone On Feb 29, 2012, at 8:44 PM, Syed Ahsan Aliwrote: > Sorry Jeff I couldn't get you point. > > On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres wrote: > On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote: > > > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile > > $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1 > > ./hrm >> ${OUTFILE}_hrm 2>&1 > > [pmdtest@pmd02 d00_dayfiles]$ > > Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm file, > not stdout. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 >
Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1
Well, as Jeff says, looks like its to do with the 1 sided comm. But the reason why I said was because of what I experienced a couple of months ago: When I had a Myri-10G and an Intel gigabit ethernet card lying around, I wanted to test the kernel bypass using open-mx stack and I ran the osu benchmark. Though all the tests worked fine with the Myri 10g, I seemed to get this "hanging" issue when running using Intel Gigabit ethernet, esp for a size more than 1K on put/get / bcast. I tried with the tcp stack instead of mx, and it seemed to work fine, though with bad latency numbers (which is kind of obvious, considering that cpu overhead due to tcp). I never really got a change to dig deep, but I was pretty much sure that this is to do with the open-mx. On Wed, Feb 29, 2012 at 9:13 PM, Venkateswara Rao Dokkuwrote: > Hi, > I tried executing those tests with the other devices like tcp > instead of ib with the same open-mpi 1.4.3.. It went fine but it took time > to execute, when i tried to execute the same test on the customized OFED > ,tests are hanging at the same message size.. > > Can u please tel me, what could me the possible issue over there, so that > you can narrow down the issue.. > i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the > customized OFED ( in RDMA scenario's or anything (if u can specify)). > > > On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyres wrote: > >> On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote: >> >> > So if I understand correctly, if a message size is smaller than it will >> use the MPI way (non-RDMA, 2 way communication), if its larger, then it >> would use the Open Fabrics, by using the ibverbs (and ofed stack) instead >> of using the MPI's stack? >> >> Er... no. >> >> So let's talk MPI-over-OpenFabrics-verbs specifically. >> >> All MPI communication calls will use verbs under the covers. They may >> use verbs send/receive semantics in some cases, and RDMA semantics in other >> cases. "It depends" -- on a lot of things, actually. It's hard to come up >> with a good rule of thumb for when it uses one or the other; this is one of >> the reasons that the openib BTL code is so complex. :-) >> >> The main points here are: >> >> 1. you can trust the openib BTL to do the Best thing possible to get the >> message to the other side. Regardless of whether that message is an >> MPI_SEND or an MPI_PUT (for example). >> >> 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, >> MPI_GET does not necessarily == verbs RDMA read). >> >> > If so, could that be the reason why the MPI_Put "hangs" when sending a >> message more than 512KB (or may be 1MB)? >> >> No. I'm guessing that there's some kind of bug in the MPI_PUT >> implementation. >> >> > Also is there a way to know if for a particular MPI call, OF uses >> send/recv or RDMA exchange? >> >> Not really. >> >> More specifically: all things being equal, you don't care which is used. >> You just want your message to get to the receiver/target as fast as >> possible. One of the main ideas of MPI is to hide those kinds of details >> from the user. I.e., you call MPI_SEND. A miracle occurs. The message is >> received on the other side. >> >> :-) >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Thanks & Regards, > D.Venkateswara Rao, > Software Engineer,One Convergence Devices Pvt Ltd., > Jubille Hills,Hyderabad. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Simple question on GRID
Hi Shaandar, this is not a simple question! If you want to bring your cluster into the Grid, you first have to decide which Grid, because the different Grids use different Grid softwares. Having taken this decision, I would recommend to look onto the wen page of this Grid community, usually you can find here instructions on how to integrate your cluster into their Grid. Dependend on the Grid software used, these instructions can be really very different, therefore I cannot be more precise here and now. If you are deciding for a Grid which is using the Globus software, feel free to contact me for further question. In the case of Globus I can help you... Best wishes Alexander Hi I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is MPICH2). They run separately in their local network environment.I know there is a way to integrate them through Internet, presumably by Grid software, I guess. Is there any tutorial to do this? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1
Aah... So when openMPI is compile with OFED, and run on a Infiniband/RoCE devices, I would use the mpi would simply direct to ofed to do point to point calls in the ofed way? > > More specifically: all things being equal, you don't care which is used. > You just want your message to get to the receiver/target as fast as > possible. One of the main ideas of MPI is to hide those kinds of details > from the user. I.e., you call MPI_SEND. A miracle occurs. The message is > received on the other side. > > True. Its just that I am digging into the OFED source code and the ompi source code,and trying to understand the way these two interact.. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1
Hi, I tried executing those tests with the other devices like tcp instead of ib with the same open-mpi 1.4.3.. It went fine but it took time to execute, when i tried to execute the same test on the customized OFED ,tests are hanging at the same message size.. Can u please tel me, what could me the possible issue over there, so that you can narrow down the issue.. i.e.. Do i have to move to open-mpi 1.5 tree or there is a issue with the customized OFED ( in RDMA scenario's or anything (if u can specify)). On Thu, Mar 1, 2012 at 1:45 AM, Jeffrey Squyreswrote: > On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote: > > > So if I understand correctly, if a message size is smaller than it will > use the MPI way (non-RDMA, 2 way communication), if its larger, then it > would use the Open Fabrics, by using the ibverbs (and ofed stack) instead > of using the MPI's stack? > > Er... no. > > So let's talk MPI-over-OpenFabrics-verbs specifically. > > All MPI communication calls will use verbs under the covers. They may use > verbs send/receive semantics in some cases, and RDMA semantics in other > cases. "It depends" -- on a lot of things, actually. It's hard to come up > with a good rule of thumb for when it uses one or the other; this is one of > the reasons that the openib BTL code is so complex. :-) > > The main points here are: > > 1. you can trust the openib BTL to do the Best thing possible to get the > message to the other side. Regardless of whether that message is an > MPI_SEND or an MPI_PUT (for example). > > 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET > does not necessarily == verbs RDMA read). > > > If so, could that be the reason why the MPI_Put "hangs" when sending a > message more than 512KB (or may be 1MB)? > > No. I'm guessing that there's some kind of bug in the MPI_PUT > implementation. > > > Also is there a way to know if for a particular MPI call, OF uses > send/recv or RDMA exchange? > > Not really. > > More specifically: all things being equal, you don't care which is used. > You just want your message to get to the receiver/target as fast as > possible. One of the main ideas of MPI is to hide those kinds of details > from the user. I.e., you call MPI_SEND. A miracle occurs. The message is > received on the other side. > > :-) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Thanks & Regards, D.Venkateswara Rao, Software Engineer,One Convergence Devices Pvt Ltd., Jubille Hills,Hyderabad.
[OMPI users] Simple question on GRID
Hi I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is MPICH2). They run separately in their local network environment.I know there is a way to integrate them through Internet, presumably by Grid software, I guess. Is there any tutorial to do this?