On Wed, Dec 29, 2010 at 10:10 AM, Advanced Computing Group University of
Padova <acg.un...@gmail.com> wrote:

> Thank you Ralph,
> Your suspects seems to be quite interesting :)
> I try to run the same program from node 192.168.1/2.11 using also
> 192.168.2.12 "tracing" .12 activities.
> I attach the two files (_succ: using --debug-daemons , _fail:without
> --debug-daemons)
> I notice that orted daemon on the second node is called in a different
> way.....
> Moreover when i launch without --debug-daemons a process called orted......
> remain active on the second node after i kill (ctrl+c) the command on the
> first node.
>
> Can you continue to help me ?
>
>
> On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> All --debug-daemons really does is keep the ssh session open after
>> launching the remote daemon and turn on some output. Otherwise, we close
>> that session as most systems only allow a limited number of concurrent ssh
>> sessions to be open.
>>
>> I suspect you have a system setting that kills any running job upon ssh
>> close. It would be best if you removed that restriction. If you cannot, then
>> you can always run your MPI jobs with --no-daemonize. This will keep the ssh
>> session open, but without all the debug output.
>>
>> That flag is just shorthand for an MCA param, so you can set it in your
>> environ or put it in your default MCA param file.
>>
>>
>> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of Padova
>> wrote:
>>
>> yes i've tested 'em
>> In fact using the --debug-daemons switch everything works fine! (and i see
>> that on the nodes a process calles orted... is started whenever i launch a
>> test application)
>> I believe this is a environment variables problem....
>>
>> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang <solarbik...@gmail.com>wrote:
>>
>>> have you tested your ssh key setup, fire wall, and switch settings to
>>> ensure all nodes are talking to each other?
>>>
>>> On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of
>>> Padova <acg.un...@gmail.com> wrote:
>>>
>>>> using openmpi 1.4.2
>>>>
>>>>
>>>> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University of
>>>> Padova <acg.un...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> i am building a small 16 nodes cluster gentoo based.
>>>>> I succesfully installed openmpi and i succesfully tried some simple
>>>>> small test parallel program on a single host but...
>>>>> i can't run parallel program on more than one nodes
>>>>>
>>>>>
>>>>> The nodes are cloned (so they are equals).
>>>>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a
>>>>> nfs share.
>>>>> I modified .bashrc
>>>>>
>>>>> -------------------------
>>>>> PATH=/usr/bin:$PATH ; export PATH ;
>>>>> LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
>>>>>
>>>>> # already present below
>>>>> if [[ $- != *i* ]] ; then
>>>>>         # Shell is non-interactive.  Be done now!
>>>>>         return
>>>>> fi
>>>>> ---------------------
>>>>>
>>>>> The very very strange behaviour is that using the --debug-daemons let
>>>>> my program run succesfully.....
>>>>>
>>>>> Thank you in advance and sorry for my bad english
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>>
>>> --
>>> David Zhang
>>> University of California, San Diego
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
PID/Program name     Timer
tcp        0      0 192.168.1.12:37279      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:22         192.168.1.1:47833       ESTABLISHED 
11747/0              keepalive (6520.89/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50074      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:37283      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:37280      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:44065      192.168.1.11:52888      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:37284      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:900        192.168.1.10:2049       ESTABLISHED 
-                    off (0.00/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50079      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50073      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:707        192.168.1.1:2049        ESTABLISHED 
-                    off (0.00/0/0)
tcp        0      0 192.168.1.12:37285      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:37281      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50082      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:37282      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50072      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:45825      192.168.1.11:51518      ESTABLISHED 
12246/orted          off (0.00/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50078      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 127.0.0.1:7634          127.0.0.1:47481         TIME_WAIT   
-                    timewait (44.55/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50080      TIME_WAIT   
-                    timewait (42.12/0/0)
tcp        0      0 192.168.1.12:37278      192.168.1.12:57255      TIME_WAIT   
-                    timewait (58.39/0/0)
tcp        0      0 192.168.2.12:22         192.168.2.11:39690      ESTABLISHED 
12243/sshd: root@no  keepalive (7196.73/0/0)
tcp        0      0 192.168.1.12:41590      192.168.1.12:50081      TIME_WAIT   
-                    timewait (42.12/0/0)
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   3892   640 ?        Ss   Dec24   0:02 init [3]   
root     11747  0.0  0.0  67256  2992 ?        Ss   09:43   0:00 sshd: 
root@pts/0 
root     11749  0.0  0.0  17980  2032 pts/0    Ss   09:43   0:00 -bash
root     12243  0.0  0.0  67256  2876 ?        Ss   09:54   0:00 sshd: 
root@notty 
root     12245  0.0  0.0   9320  1124 ?        Ss   09:54   0:00 bash -c  
PATH=/usr/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH 
; export LD_LIBRARY_PATH ;  /usr/bin/orted --debug-daemons -mca ess env -mca 
orte_ess_jobid 3283288064 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 
--hnp-uri "3283288064.0;tcp://192.168.1.11:51518;tcp://192.168.2.11:51518"
root     12246  0.0  0.0  55952  2092 ?        Sl   09:54   0:00 /usr/bin/orted 
--debug-daemons -mca ess env -mca orte_ess_jobid 3283288064 -mca orte_ess_vpid 
1 -mca orte_ess_num_procs 2 --hnp-uri 
3283288064.0;tcp://192.168.1.11:51518;tcp://192.168.2.11:51518
root     12299  0.0  0.0  14808   976 pts/0    R+   09:55   0:00 ps aux

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
PID/Program name     Timer
tcp        0      0 192.168.1.12:22         192.168.1.1:47833       ESTABLISHED 
11747/0              keepalive (6405.96/0/0)
tcp        0      0 192.168.1.12:37705      192.168.1.11:44889      ESTABLISHED 
12347/orted          off (0.00/0/0)
tcp        0      0 127.0.0.1:7634          127.0.0.1:58811         ESTABLISHED 
15817/hddtemp        off (0.00/0/0)
tcp        0      0 192.168.1.12:900        192.168.1.10:2049       ESTABLISHED 
-                    off (0.00/0/0)
tcp        0      0 192.168.1.12:707        192.168.1.1:2049        ESTABLISHED 
-                    off (0.00/0/0)
tcp        0      0 127.0.0.1:58811         127.0.0.1:7634          ESTABLISHED 
15936/gkrellmd       off (0.00/0/0)

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   3892   640 ?        Ss   Dec24   0:02 init [3]   
root     11747  0.0  0.0  67256  2992 ?        Ss   09:43   0:00 sshd: 
root@pts/0 
root     11749  0.0  0.0  17980  2036 pts/0    Ss   09:43   0:00 -bash
root     12347  0.0  0.0  55952  1016 ?        Ss   09:56   0:00 /usr/bin/orted 
--daemonize -mca ess env -mca orte_ess_jobid 3286827008 -mca orte_ess_vpid 1 
-mca orte_ess_num_procs 2 --hnp-uri 
3286827008.0;tcp://192.168.1.11:44889;tcp://192.168.2.11:44889
root     12349  0.0  0.0  14808   976 pts/0    R+   09:56   0:00 ps aux

Reply via email to