The long story is you need always need a subnet manager to initialize
the fabric.
That means you can run the subnet manager and stop it once so each HCA
is assigned a LID.
In that case, the commands that interact with the SM (ibhosts,
ibdiagnet) will obviously fail.
Cheers,
Gilles
On
Xie, as far as I know you need to run OpenSM even on two hosts.
On 15 May 2018 at 03:29, Blade Shieh wrote:
> Hi, John:
>
> You are right on the network framework. I do have no IB switch and just
> connect the servers with an IB cable. I did not even open the opensmd
>
Hi Gilles,
Thank you for pointing out my error on *-N*.
And you are right that I opened opensmd service before so the link up can
be set up correctly. But many IB-related command cannot be executed
correctly, like ibhosts and ibdiagnet.
As for pml, I am pretty sure I was using ob1, because
Xie Bin,
According to the man page, -N is equivalent to npernode, which is
equivalent to --map-by ppr:N:node.
This is *not* equivalent to -map-by node :
The former packs tasks to the same node, and the latter scatters tasks
accross the nodes
[gilles@login ~]$ mpirun --host n0:2,n1:2 -N
Hi, George:
My command lines are:
1) single node
mpirun --allow-run-as-root -mca btl self,tcp(or openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=2 -n 32 myapp
2) 2-node cluster
mpirun --allow-run-as-root -mca btl ^tcp(or ^openib) -mca
btl_tcp_if_include
Hi, John:
You are right on the network framework. I do have no IB switch and just
connect the servers with an IB cable. I did not even open the opensmd
service because it seems unnecessary in this situation. Can this be the
reason why IB performs poorer?
Interconnection details are in the
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is only part
of your issue here.
You haven't specified how your processes will be mapped on your resources.
As a result rank 0 and 1 will be on the same node, so you
Xie Bin, I do hate to ask this. You say "in a two-node cluster (IB
direcet-connected). "
Does that mean that you have no IB switch, and that there is a single IB
cable joining up these two servers?
If so please run:ibstatusibhosts ibdiagnet
I am trying to check if the IB fabric is
Hi, Nathan:
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance between
tcp and openib. Besides, I will run the application in cluster, so vader is
not
I see several problems
1) osu_latency only works with two procs.
2) You explicitly excluded shared memory support by specifying only self and
openib (or tcp). If you want to just disable tcp or openib use —mca btl ^tcp or
—mca btl ^openib
Also, it looks like you have multiple ports active
10 matches
Mail list logo