On Wed, Dec 29, 2010 at 10:10 AM, Advanced Computing Group University of Padova <acg.un...@gmail.com> wrote:
> Thank you Ralph, > Your suspects seems to be quite interesting :) > I try to run the same program from node 192.168.1/2.11 using also > 192.168.2.12 "tracing" .12 activities. > I attach the two files (_succ: using --debug-daemons , _fail:without > --debug-daemons) > I notice that orted daemon on the second node is called in a different > way..... > Moreover when i launch without --debug-daemons a process called orted...... > remain active on the second node after i kill (ctrl+c) the command on the > first node. > > Can you continue to help me ? > > > On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> All --debug-daemons really does is keep the ssh session open after >> launching the remote daemon and turn on some output. Otherwise, we close >> that session as most systems only allow a limited number of concurrent ssh >> sessions to be open. >> >> I suspect you have a system setting that kills any running job upon ssh >> close. It would be best if you removed that restriction. If you cannot, then >> you can always run your MPI jobs with --no-daemonize. This will keep the ssh >> session open, but without all the debug output. >> >> That flag is just shorthand for an MCA param, so you can set it in your >> environ or put it in your default MCA param file. >> >> >> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of Padova >> wrote: >> >> yes i've tested 'em >> In fact using the --debug-daemons switch everything works fine! (and i see >> that on the nodes a process calles orted... is started whenever i launch a >> test application) >> I believe this is a environment variables problem.... >> >> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang <solarbik...@gmail.com>wrote: >> >>> have you tested your ssh key setup, fire wall, and switch settings to >>> ensure all nodes are talking to each other? >>> >>> On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of >>> Padova <acg.un...@gmail.com> wrote: >>> >>>> using openmpi 1.4.2 >>>> >>>> >>>> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University of >>>> Padova <acg.un...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> i am building a small 16 nodes cluster gentoo based. >>>>> I succesfully installed openmpi and i succesfully tried some simple >>>>> small test parallel program on a single host but... >>>>> i can't run parallel program on more than one nodes >>>>> >>>>> >>>>> The nodes are cloned (so they are equals). >>>>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a >>>>> nfs share. >>>>> I modified .bashrc >>>>> >>>>> ------------------------- >>>>> PATH=/usr/bin:$PATH ; export PATH ; >>>>> LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>> >>>>> # already present below >>>>> if [[ $- != *i* ]] ; then >>>>> # Shell is non-interactive. Be done now! >>>>> return >>>>> fi >>>>> --------------------- >>>>> >>>>> The very very strange behaviour is that using the --debug-daemons let >>>>> my program run succesfully..... >>>>> >>>>> Thank you in advance and sorry for my bad english >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> >>> -- >>> David Zhang >>> University of California, San Diego >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer tcp 0 0 192.168.1.12:37279 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:22 192.168.1.1:47833 ESTABLISHED 11747/0 keepalive (6520.89/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50074 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:37283 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:37280 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:44065 192.168.1.11:52888 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:37284 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:900 192.168.1.10:2049 ESTABLISHED - off (0.00/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50079 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50073 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:707 192.168.1.1:2049 ESTABLISHED - off (0.00/0/0) tcp 0 0 192.168.1.12:37285 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:37281 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50082 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:37282 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50072 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:45825 192.168.1.11:51518 ESTABLISHED 12246/orted off (0.00/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50078 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 127.0.0.1:7634 127.0.0.1:47481 TIME_WAIT - timewait (44.55/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50080 TIME_WAIT - timewait (42.12/0/0) tcp 0 0 192.168.1.12:37278 192.168.1.12:57255 TIME_WAIT - timewait (58.39/0/0) tcp 0 0 192.168.2.12:22 192.168.2.11:39690 ESTABLISHED 12243/sshd: root@no keepalive (7196.73/0/0) tcp 0 0 192.168.1.12:41590 192.168.1.12:50081 TIME_WAIT - timewait (42.12/0/0) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 3892 640 ? Ss Dec24 0:02 init [3] root 11747 0.0 0.0 67256 2992 ? Ss 09:43 0:00 sshd: root@pts/0 root 11749 0.0 0.0 17980 2032 pts/0 Ss 09:43 0:00 -bash root 12243 0.0 0.0 67256 2876 ? Ss 09:54 0:00 sshd: root@notty root 12245 0.0 0.0 9320 1124 ? Ss 09:54 0:00 bash -c PATH=/usr/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/bin/orted --debug-daemons -mca ess env -mca orte_ess_jobid 3283288064 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "3283288064.0;tcp://192.168.1.11:51518;tcp://192.168.2.11:51518" root 12246 0.0 0.0 55952 2092 ? Sl 09:54 0:00 /usr/bin/orted --debug-daemons -mca ess env -mca orte_ess_jobid 3283288064 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 3283288064.0;tcp://192.168.1.11:51518;tcp://192.168.2.11:51518 root 12299 0.0 0.0 14808 976 pts/0 R+ 09:55 0:00 ps aux
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer tcp 0 0 192.168.1.12:22 192.168.1.1:47833 ESTABLISHED 11747/0 keepalive (6405.96/0/0) tcp 0 0 192.168.1.12:37705 192.168.1.11:44889 ESTABLISHED 12347/orted off (0.00/0/0) tcp 0 0 127.0.0.1:7634 127.0.0.1:58811 ESTABLISHED 15817/hddtemp off (0.00/0/0) tcp 0 0 192.168.1.12:900 192.168.1.10:2049 ESTABLISHED - off (0.00/0/0) tcp 0 0 192.168.1.12:707 192.168.1.1:2049 ESTABLISHED - off (0.00/0/0) tcp 0 0 127.0.0.1:58811 127.0.0.1:7634 ESTABLISHED 15936/gkrellmd off (0.00/0/0) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 3892 640 ? Ss Dec24 0:02 init [3] root 11747 0.0 0.0 67256 2992 ? Ss 09:43 0:00 sshd: root@pts/0 root 11749 0.0 0.0 17980 2036 pts/0 Ss 09:43 0:00 -bash root 12347 0.0 0.0 55952 1016 ? Ss 09:56 0:00 /usr/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 3286827008 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 3286827008.0;tcp://192.168.1.11:44889;tcp://192.168.2.11:44889 root 12349 0.0 0.0 14808 976 pts/0 R+ 09:56 0:00 ps aux