My 0.02 US$
first, the root cause of the problem was a default gateway was
configured on the node,
but this gateway was unreachable.
imho, this is incorrect system setting that can lead to unpredictable
results :
- openmpi 1.8.1 works (you are lucky, good for you)
- openmpi 1.8.3 fails (no luck
Ralph Castain writes:
>> I think there's a problem with documentation at least not being
>> explicit, and it would really help to have it clarified unless I'm
>> missing some.
>
> Not quite sure I understand this comment - the problem is that we
> aren’t correctly reading the
Ralph Castain writes:
cn6050 16 par6.q@cn6050
cn6045 16 par6.q@cn6045
>>
>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>
> Hey Reuti
>
> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine
> module, and it
On 11/13/2014 11:14 AM, Ralph Castain wrote:
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
assign different hostnames to their interfaces - I’ve seen it in the
Hadoop world, but not in HPC. Still, no law against it.
No, not so unusual.
I have clusters from respectable
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to assign
different hostnames to their interfaces - I’ve seen it in the Hadoop world, but
not in HPC. Still, no law against it.
This will take a little thought to figure out a solution. One problem that
immediately occurs is
Am 13.11.2014 um 00:34 schrieb Ralph Castain:
>> On Nov 12, 2014, at 2:45 PM, Reuti wrote:
>>
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>
Another thing you can do is (a) ensure you built with —enable-debug,
Gus,
Am 13.11.2014 um 02:59 schrieb Gus Correa:
> On 11/12/2014 05:45 PM, Reuti wrote:
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>
Another thing you can do is (a) ensure you built with —enable-debug,
>> and then (b) run it with -mca
Am 13.11.2014 um 00:55 schrieb Gilles Gouaillardet:
> Could you please send the output of netstat -nr on both head and compute node
> ?
Head node:
annemarie:~ # netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0
netstat don't show loopback interface even on head node while ifconfig
shows Loopback up and running on compute nodes as well as master node.
[root@pmd ~]# netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.3.0 0.0.0.0
but it is running on your head node isnt't it ?
you might want to double check why there is no loopback interface on
your compute nodes.
in the mean time, you can disable lo and ib0 interfaces
Cheers,
Gilles
On 2014/11/13 16:59, Syed Ahsan Ali wrote:
> I don't see it running
>
>
I don't see it running
[pmdtest@compute-01-01 ~]$ netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U
This is really weird ?
is the loopback interface up and running on both nodes and with the same
ip ?
can you run on both compute nodes ?
netstat -nr
On 2014/11/13 16:50, Syed Ahsan Ali wrote:
> Now it looks through the loopback address
>
> [pmdtest@pmd ~]$ mpirun --host
Ok ok I can disable that as well.
Thank you guys. :)
On Thu, Nov 13, 2014 at 12:50 PM, Syed Ahsan Ali wrote:
> Now it looks through the loopback address
>
> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
> btl_tcp_if_exclude ib0 ring_c
> Process 0 sending
Now it looks through the loopback address
[pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
btl_tcp_if_exclude ib0 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
--mca btl ^openib
disables the openib btl, which is native infiniband only.
ib0 is treated as any TCP interface and then handled by the tcp btl
an other option is you to use
--mca btl_tcp_if_exclude ib0
On 2014/11/13 16:43, Syed Ahsan Ali wrote:
> You are right it is running on 10.0.0.0
You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
btl_tcp_if_include 10.0.0.0/8 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0
mpirun complains about the 192.168.108.10 ip address, but ping reports a
10.0.0.8 address
is the 192.168.* network a point to point network (for example between a
host and a mic) so two nodes
cannot ping each other via this address ?
/* e.g. from compute-01-01 can you ping the 192.168.108.* ip
Same result in both cases
[pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
compute-01-01,compute-01-06 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Hi,
it seems you messed up the command line
could you try
$ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
can you also try to run mpirun from a compute node instead of the head
node ?
Cheers,
Gilles
On 2014/11/13 16:07, Syed Ahsan Ali wrote:
> Here is what I see when
Here is what I see when disabling openib support.\
[pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
compute-01-01,compute-01-06 ring_c
ssh: orted: Temporary failure in name resolution
ssh: orted: Temporary failure in name resolution
Hi Jefff
No firewall is enabled. Running the diagnostics I found that non
communication mpi job is running . While ring_c remains stuck. There
are of course warnings for open fabrics but in my case I an running
application by disabling openib., Please see below
[pmdtest@pmd ~]$ mpirun --host
21 matches
Mail list logo