btw, what is a rationale to run in chroot env? is it dockers-like env?

does "ibv_devinfo -v" works for you from chroot env?



On Tue, May 26, 2015 at 7:08 AM, Rahul Yadav <robora...@gmail.com> wrote:

> Yes Ralph, MXM cards are on the node. Command runs fine if I run it out of
> the chroot environment.
>
> Thanks
> Rahul
>
> On Mon, May 25, 2015 at 9:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Well, it isn’t finding any MXM cards on NAE27 - do you have any there?
>>
>> You can’t use yalla without MXM cards on all nodes
>>
>>
>> On May 25, 2015, at 8:51 PM, Rahul Yadav <robora...@gmail.com> wrote:
>>
>> We were able to solve ssh problem.
>>
>> But now MPI is not able to use component yalla. We are running following
>> command
>>
>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>
>> command is run in chroot environment on JARVICENAE27 and other node is
>> JARVICENAE125. JARVICENAE125 is able to select yalla since that is a
>> remote node and thus is not trying to run the job in chroot environment.
>> But JARVICENAE27 is throwing few MXM related errors and yalla is not
>> selected.
>>
>> Following are the logs of the command with verbose.
>>
>> Any idea what might be wrong ?
>>
>> [1432283901.548917]         sys.c:719  MXM  WARN  Conflicting CPU
>> frequencies detected, using: 2601.00
>> [JARVICENAE125:00909] mca: base: components_register: registering pml
>> components
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component v
>> [JARVICENAE125:00909] mca: base: components_register: component v
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component bfo
>> [JARVICENAE125:00909] mca: base: components_register: component bfo
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component cm
>> [JARVICENAE125:00909] mca: base: components_register: component cm
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component ob1
>> [JARVICENAE125:00909] mca: base: components_register: component ob1
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component yalla
>> [JARVICENAE125:00909] mca: base: components_register: component yalla
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_open: opening pml components
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component v
>> [JARVICENAE125:00909] mca: base: components_open: component v open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> bfo
>> [JARVICENAE125:00909] mca: base: components_open: component bfo open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> cm
>> [JARVICENAE125:00909] mca: base: components_open: component cm open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> ob1
>> [JARVICENAE125:00909] mca: base: components_open: component ob1 open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> yalla
>> [JARVICENAE125:00909] mca: base: components_open: component yalla open
>> function successful
>> [JARVICENAE125:00909] select: component v not in the include list
>> [JARVICENAE125:00909] select: component bfo not in the include list
>> [JARVICENAE125:00909] select: initializing pml component cm
>> [JARVICENAE27:06474] mca: base: components_register: registering pml
>> components
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component v
>> [JARVICENAE27:06474] mca: base: components_register: component v register
>> function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component bfo
>> [JARVICENAE27:06474] mca: base: components_register: component bfo
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component cm
>> [JARVICENAE27:06474] mca: base: components_register: component cm
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component ob1
>> [JARVICENAE27:06474] mca: base: components_register: component ob1
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component yalla
>> [JARVICENAE27:06474] mca: base: components_register: component yalla
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_open: opening pml components
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component v
>> [JARVICENAE27:06474] mca: base: components_open: component v open
>> function successful
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component
>> bfo
>> [JARVICENAE27:06474] mca: base: components_open: component bfo open
>> function successful
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component cm
>> libibverbs: Warning: no userspace device-specific driver found for
>> /sys/class/infiniband_verbs/uverbs0
>> [1432283901.559929]         sys.c:719  MXM  WARN  Conflicting CPU
>> frequencies detected, using: 2601.00
>> [1432283901.561294] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
>> There are no Mellanox cards detected.
>> [JARVICENAE27:06474] mca: base: close: component cm closed
>> [JARVICENAE27:06474] mca: base: close: unloading component cm
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component
>> ob1
>> [JARVICENAE27:06474] mca: base: components_open: component ob1 open
>> function successful
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component
>> yalla
>> [1432283901.561732] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
>> There are no Mellanox cards detected.
>> [JARVICENAE27:06474] mca: base: components_open: component yalla open
>> function failed
>> [JARVICENAE27:06474] mca: base: close: component yalla closed
>> [JARVICENAE27:06474] mca: base: close: unloading component yalla
>> [JARVICENAE27:06474] select: component v not in the include list
>> [JARVICENAE27:06474] select: component bfo not in the include list
>> [JARVICENAE27:06474] select: initializing pml component ob1
>> [JARVICENAE27:06474] select: init returned priority 20
>> [JARVICENAE27:06474] selected ob1 best priority 20
>> [JARVICENAE27:06474] select: component ob1 selected
>> [JARVICENAE27:06474] mca: base: close: component v closed
>> [JARVICENAE27:06474] mca: base: close: unloading component v
>> [JARVICENAE27:06474] mca: base: close: component bfo closed
>> [JARVICENAE27:06474] mca: base: close: unloading component bfo
>> [JARVICENAE125:00909] select: init returned priority 30
>> [JARVICENAE125:00909] select: initializing pml component ob1
>> [JARVICENAE125:00909] select: init returned failure for component ob1
>> [JARVICENAE125:00909] select: initializing pml component yalla
>> [JARVICENAE125:00909] select: init returned priority 50
>> [JARVICENAE125:00909] selected yalla best priority 50
>> [JARVICENAE125:00909] select: component cm not selected / finalized
>> [JARVICENAE125:00909] select: component yalla selected
>> [JARVICENAE125:00909] mca: base: close: component v closed
>> [JARVICENAE125:00909] mca: base: close: unloading component v
>> [JARVICENAE125:00909] mca: base: close: component bfo closed
>> [JARVICENAE125:00909] mca: base: close: unloading component bfo
>> [JARVICENAE125:00909] mca: base: close: component cm closed
>> [JARVICENAE125:00909] mca: base: close: unloading component cm
>> [JARVICENAE125:00909] mca: base: close: component ob1 closed
>> [JARVICENAE125:00909] mca: base: close: unloading component ob1
>> [JARVICENAE27:06474] check:select: modex not reqd
>>
>>
>> On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Okay, so we see two nodes have been allocated:
>>>
>>> 1. JARVICENAE27 - appears to be the node where mpirun is running
>>>
>>> 2. 10.3.0.176
>>>
>>> Does that match what you expected?
>>>
>>> If you cannot ssh (without a password) between machines, then we will
>>> not be able to run.
>>>
>>>
>>> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com> wrote:
>>>
>>> I get following output with verbose
>>>
>>> [JARVICENAE27:00654] mca: base: components_register: registering ras
>>> components
>>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>>> component loadleveler
>>> [JARVICENAE27:00654] mca: base: components_register: component
>>> loadleveler register function successful
>>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>>> component simulator
>>> [JARVICENAE27:00654] mca: base: components_register: component simulator
>>> register function successful
>>> [JARVICENAE27:00654] mca: base: components_register: found loaded
>>> component slurm
>>> [JARVICENAE27:00654] mca: base: components_register: component slurm
>>> register function successful
>>> [JARVICENAE27:00654] mca: base: components_open: opening ras components
>>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>>> loadleveler
>>> [JARVICENAE27:00654] mca: base: components_open: component loadleveler
>>> open function successful
>>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>>> simulator
>>> [JARVICENAE27:00654] mca: base: components_open: found loaded component
>>> slurm
>>> [JARVICENAE27:00654] mca: base: components_open: component slurm open
>>> function successful
>>> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
>>> [loadleveler]
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
>>> [loadleveler]. Query failed to return a module
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
>>> [simulator]
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
>>> [simulator]. Query failed to return a module
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [slurm]
>>> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component [slurm].
>>> Query failed to return a module
>>> [JARVICENAE27:00654] mca:base:select:(  ras) No component selected!
>>>
>>> ======================   ALLOCATED NODES   ======================
>>>        JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>        10.3.0.176: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>>>
>>> Also, I am not able to ssh to other machine from one machine in chroot
>>> environment. Can that be a problem ?
>>>
>>> Thanks
>>> Rahul
>>>
>>> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what
>>>> it thinks it is doing. Which OMPI version are you using - master?
>>>>
>>>>
>>>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have been trying to run MPI jobs (consisting of two different
>>>> binaries, one each ) in two nodes,  using hostfile option as following
>>>>
>>>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
>>>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>>>
>>>> We are doing this in chroot environment. We have set the HPCX env in
>>>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot
>>>> env) contains IPs of two nodes respectively.
>>>>
>>>> We are getting following error
>>>>
>>>> " all nodes which are allocated for this job are already filled "
>>>>
>>>> However when we use chroot but don't use hostfile option (both
>>>> processes run in same node) OR we use hostfile option but outside chroot,
>>>> it works.
>>>>
>>>> Anyone has any idea if chroot can cause above error and how to solve it
>>>> ?
>>>>
>>>> Thanks
>>>> Rahul
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/05/26847.php
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/05/26860.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/05/26861.php
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26927.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26929.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26930.php
>



-- 

Kind Regards,

M.

Reply via email to