from:"Syed Ahsan Ali"

Re: [OMPI users] Fatal Error: Cannot read module file 'mpi.mod' opened at (1), because it was created by a different version of GNU Fortran

2015-07-28 Thread Syed Ahsan Ali

Thanks Gilles

It solved my issue. Your support is much appreciated.

Ahsan
On Tue, Jul 28, 2015 at 10:15 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> Hi,
>
> you can run
> zcat mpi.mod | head to confirm which gfortran was used to build the
> application
>
> GFORTRAN module version '10' => gcc 4.8.3
> GFORTRAN module version '12' => gcc 4.9.2
> GFORTRAN module version '14' => gcc 5.1.0
>
> i assume the failing command is mpifort ...
> so you can run
> mpifort -showme ...
> to see the how gfortran is invoked.
>
> it is likely mpifort simply run gfortran, and your PATH does not point to
> gfortran 4.9.2
>
> Cheers,
>
> Gilles
>
>
> On 7/28/2015 1:47 PM, Syed Ahsan Ali wrote:
>>
>> I am getting this error during installation of an application.
>> Apparently the error seems to be complaining about openmpi being
>> compiled with different version of gnu fortran but I am sure that it
>> was compiled with gcc-4.9.2. The same is also being used for
>> application compilation.
>>
>> I am using openmpi-1.8.4
>>
>> Ahsan
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/07/27341.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27342.php

[OMPI users] Fatal Error: Cannot read module file 'mpi.mod' opened at (1), because it was created by a different version of GNU Fortran

2015-07-28 Thread Syed Ahsan Ali

I am getting this error during installation of an application.
Apparently the error seems to be complaining about openmpi being
compiled with different version of gnu fortran but I am sure that it
was compiled with gcc-4.9.2. The same is also being used for
application compilation.

I am using openmpi-1.8.4

Ahsan

Re: [OMPI users] mpirun fails across cluster

2015-02-27 Thread Syed Ahsan Ali

Oh sorry. That is related to application. I need to recompile
application too I guess.

On Fri, Feb 27, 2015 at 10:44 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> Dear Gus
>
> Thanks once again for suggestion. Yes I did that before installation
> to new path. I am getting error now about some library
> tstint2lm: error while loading shared libraries:
> libmpi_usempif08.so.0: cannot open shared object file: No such file or
> directory
>
> While library is present
> [pmdtest@hpc bin]$ locate libmpi_usempif08.so.0
> /state/partition1/apps/openmpi-1.8.4_gcc-4.9.2/lib/libmpi_usempif08.so.0
> /state/partition1/apps/openmpi-1.8.4_gcc-4.9.2/lib/libmpi_usempif08.so.0.6.0
> in path as well
>
> echo $LD_LIBRARY_PATH
> /share/apps/openmpi-1.8.4_gcc-4.9.2/lib:/share/apps/libpng-1.6.16/lib:/share/apps/netcdf-fortran-4.4.1_gcc-4.9.2_wo_hdf5/lib:/share/apps/netcdf-4.3.2_gcc_wo_hdf5/lib:/share/apps/grib_api-1.11.0/lib:/share/apps/jasper-1.900.1/lib:/share/apps/zlib-1.2.8_gcc-4.9.2/lib:/share/apps/gcc-4.9.2/lib64:/share/apps/gcc-4.9.2/lib:/usr/lib64:/usr/share/Modules/lib:/opt/python/lib
> [pmdtest@hpc bin]$
>
> Ahsan
>
> On Fri, Feb 27, 2015 at 10:17 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>> Hi Syed Ahsan Ali
>>
>> To avoid any leftovers and further confusion,
>> I suggest that you delete completely the old installation directory.
>> Then start fresh from the configure step with the prefix pointing to
>> --prefix=/share/apps/openmpi-1.8.4_gcc-4.9.2
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 02/27/2015 12:11 PM, Syed Ahsan Ali wrote:
>>>
>>> Hi Gus
>>>
>>> Thanks for prompt response. Well judged, I compiled with /export/apps
>>> prefix so that is most probably the reason. I'll check and update you.
>>>
>>> Best wishes
>>> Ahsan
>>>
>>> On Fri, Feb 27, 2015 at 10:07 PM, Gus Correa <g...@ldeo.columbia.edu>
>>> wrote:
>>>>
>>>> Hi Syed
>>>>
>>>> This really sounds as a problem specific to Rocks Clusters,
>>>> not an issue with Open MPI.
>>>> A confusion related to mount points, and soft links used by Rocks.
>>>>
>>>> I haven't used Rocks Clusters in a while,
>>>> and I don't remember the details anymore, so please take my
>>>> suggestions with a grain of salt, and check them out
>>>> before committing to them
>>>>
>>>> Which --prefix did you use when you configured Open MPI?
>>>> My suggestion is that you don't use "/export/apps" as a prefix
>>>> (and this goes to any application that you install).
>>>> but instead use a /share/apps subdirectory, something like:
>>>>
>>>> --prefix=/share/apps/openmpi-1.8.4_gcc-4.9.2
>>>>
>>>> This is because /export/apps is just a mount point on the
>>>> frontend/head node, whereas /share/apps is a mount point
>>>> across all nodes in the cluster (and, IIRR, a soft link on the
>>>> head node).
>>>>
>>>> My recollection is that the Rocks documentation was obscure
>>>> about this, not making clear the difference between
>>>> /export/apps and /share/apps.
>>>>
>>>> Issuing the Rocks commands:
>>>> "tentakel 'ls -d /export/apps'"
>>>> "tentakel 'ls -d /share/apps'"
>>>> may show something useful.
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>>
>>>> On 02/27/2015 11:47 AM, Syed Ahsan Ali wrote:
>>>>>
>>>>>
>>>>> I am trying to run openmpi application on my cluster.  But the mpirun
>>>>> fails, simple hostname command gives this error
>>>>>
>>>>> [pmdtest@hpc bin]$ mpirun --host compute-0-0 hostname
>>>>>
>>>>> --
>>>>> Sorry!  You were supposed to get help about:
>>>>>   opal_init:startup:internal-failure
>>>>> But I couldn't open the help file:
>>>>>
>>>>>
>>>>> /export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
>>>>> No such file or directory.  Sorry!
>>>>>
>>>>> --
>>>>>
>>>>> --
>>>>> Sorry!  You were supposed to

Re: [OMPI users] mpirun fails across cluster

2015-02-27 Thread Syed Ahsan Ali

Hi Gus

Thanks for prompt response. Well judged, I compiled with /export/apps
prefix so that is most probably the reason. I'll check and update you.

Best wishes
Ahsan

On Fri, Feb 27, 2015 at 10:07 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
> Hi Syed
>
> This really sounds as a problem specific to Rocks Clusters,
> not an issue with Open MPI.
> A confusion related to mount points, and soft links used by Rocks.
>
> I haven't used Rocks Clusters in a while,
> and I don't remember the details anymore, so please take my
> suggestions with a grain of salt, and check them out
> before committing to them
>
> Which --prefix did you use when you configured Open MPI?
> My suggestion is that you don't use "/export/apps" as a prefix
> (and this goes to any application that you install).
> but instead use a /share/apps subdirectory, something like:
>
> --prefix=/share/apps/openmpi-1.8.4_gcc-4.9.2
>
> This is because /export/apps is just a mount point on the
> frontend/head node, whereas /share/apps is a mount point
> across all nodes in the cluster (and, IIRR, a soft link on the
> head node).
>
> My recollection is that the Rocks documentation was obscure
> about this, not making clear the difference between
> /export/apps and /share/apps.
>
> Issuing the Rocks commands:
> "tentakel 'ls -d /export/apps'"
> "tentakel 'ls -d /share/apps'"
> may show something useful.
>
> I hope this helps,
> Gus Correa
>
>
> On 02/27/2015 11:47 AM, Syed Ahsan Ali wrote:
>>
>> I am trying to run openmpi application on my cluster.  But the mpirun
>> fails, simple hostname command gives this error
>>
>> [pmdtest@hpc bin]$ mpirun --host compute-0-0 hostname
>> --
>> Sorry!  You were supposed to get help about:
>>  opal_init:startup:internal-failure
>> But I couldn't open the help file:
>>
>> /export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
>> No such file or directory.  Sorry!
>> --
>> --
>> Sorry!  You were supposed to get help about:
>>  orte_init:startup:internal-failure
>> But I couldn't open the help file:
>>  /export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-orte-runtime:
>> No such file or directory.  Sorry!
>> --
>> [compute-0-0.local:03410] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in
>> file orted/orted_main.c at line 369
>> --
>> ORTE was unable to reliably start one or more daemons.
>>
>> I am using Environment modules to load OpenMPI 1.8.4 and PATH and
>> LD_LIBRARY_PATH points to same openmpi on nodes
>>
>> [pmdtest@hpc bin]$ which mpirun
>> /share/apps/openmpi-1.8.4_gcc-4.9.2/bin/mpirun
>> [pmdtest@hpc bin]$ ssh compute-0-0
>> Last login: Sat Feb 28 02:15:50 2015 from hpc.local
>> Rocks Compute Node
>> Rocks 6.1.1 (Sand Boa)
>> Profile built 01:53 28-Feb-2015
>> Kickstarted 01:59 28-Feb-2015
>> [pmdtest@compute-0-0 ~]$ which mpirun
>> /share/apps/openmpi-1.8.4_gcc-4.9.2/bin/mpirun
>>
>> The only this I notice important is that in the error it is referring to
>>
>> /export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
>>
>> While it should have shown
>> /share/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
>> which is the path compute nodes see.
>>
>> Please help!
>> Ahsan
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/02/26411.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26412.php



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] mpirun fails across cluster

2015-02-27 Thread Syed Ahsan Ali

I am trying to run openmpi application on my cluster.  But the mpirun
fails, simple hostname command gives this error

[pmdtest@hpc bin]$ mpirun --host compute-0-0 hostname
--
Sorry!  You were supposed to get help about:
opal_init:startup:internal-failure
But I couldn't open the help file:
/export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
No such file or directory.  Sorry!
--
--
Sorry!  You were supposed to get help about:
orte_init:startup:internal-failure
But I couldn't open the help file:
/export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-orte-runtime:
No such file or directory.  Sorry!
--
[compute-0-0.local:03410] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in
file orted/orted_main.c at line 369
--
ORTE was unable to reliably start one or more daemons.

I am using Environment modules to load OpenMPI 1.8.4 and PATH and
LD_LIBRARY_PATH points to same openmpi on nodes

[pmdtest@hpc bin]$ which mpirun
/share/apps/openmpi-1.8.4_gcc-4.9.2/bin/mpirun
[pmdtest@hpc bin]$ ssh compute-0-0
Last login: Sat Feb 28 02:15:50 2015 from hpc.local
Rocks Compute Node
Rocks 6.1.1 (Sand Boa)
Profile built 01:53 28-Feb-2015
Kickstarted 01:59 28-Feb-2015
[pmdtest@compute-0-0 ~]$ which mpirun
/share/apps/openmpi-1.8.4_gcc-4.9.2/bin/mpirun

The only this I notice important is that in the error it is referring to
/export/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:

While it should have shown
/share/apps/openmpi-1.8.4_gcc-4.9.2/share/openmpi/help-opal-runtime.txt:
which is the path compute nodes see.

Please help!
Ahsan

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

netstat don't show loopback interface even on head node while ifconfig
shows Loopback up and running on compute nodes as well as master node.

[root@pmd ~]# netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
192.168.3.0 0.0.0.0 255.255.255.0   U 0 0  0 eth1
192.168.108.0   0.0.0.0 255.255.255.0   U 0 0  0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 ib0
239.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 eth0
10.0.0.00.0.0.0 255.0.0.0   U 0 0  0 eth0
0.0.0.0 192.168.3.1 0.0.0.0 UG0 0  0 eth1

[root@compute-01-01 ~]# ifconfig
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:880 errors:0 dropped:0 overruns:0 frame:0
  TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:150329 (146.8 KiB)  TX bytes:150329 (146.8 KiB)



On Thu, Nov 13, 2014 at 1:02 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> but it is running on your head node isnt't it ?
>
> you might want to double check why there is no loopback interface on
> your compute nodes.
> in the mean time, you can disable lo and ib0 interfaces
>
> Cheers,
>
> Gilles
>
> On 2014/11/13 16:59, Syed Ahsan Ali wrote:
>>  I don't see it running
>>
>> [pmdtest@compute-01-01 ~]$ netstat -nr
>> Kernel IP routing table
>> Destination Gateway Genmask Flags   MSS Window  irtt 
>> Iface
>> 192.168.108.0   0.0.0.0 255.255.255.0   U 0 0  0 ib0
>> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 ib0
>> 239.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 eth0
>> 10.0.0.00.0.0.0 255.0.0.0   U 0 0  0 eth0
>> 0.0.0.0 10.0.0.10.0.0.0 UG0 0  0 eth0
>> [pmdtest@compute-01-01 ~]$ exit
>> logout
>> Connection to compute-01-01 closed.
>> [pmdtest@pmd ~]$ ssh compute-01-06
>> Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone
>> [pmdtest@compute-01-06 ~]$ netstat -nr
>> Kernel IP routing table
>> Destination Gateway Genmask Flags   MSS Window  irtt 
>> Iface
>> 192.168.108.0   0.0.0.0 255.255.255.0   U 0 0  0 ib0
>> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 ib0
>> 239.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 eth0
>> 10.0.0.00.0.0.0 255.0.0.0   U 0 0  0 eth0
>> 0.0.0.0 10.0.0.10.0.0.0 UG0 0  0 eth0
>> [pmdtest@compute-01-06 ~]$
>> 
>>
>> On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> This is really weird ?
>>>
>>> is the loopback interface up and running on both nodes and with the same
>>> ip ?
>>>
>>> can you run on both compute nodes ?
>>> netstat -nr
>>>
>>>
>>> On 2014/11/13 16:50, Syed Ahsan Ali wrote:
>>>> Now it looks through the loopback address
>>>>
>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_exclude ib0 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 127.0.0.1 failed: Connection refused (111)
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> [pmd.pakmet.com:30867] 1 more process has sent help message
>>>> help-mpi-btl-openib.txt / no active ports found
>>>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>>
>>>>
>>>>
>>>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>> --mca btl ^openib
>>>>> disables the openib btl, which is native infiniband only.
>>>>>
>>>>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>>>>
>>>>> an other option is you to use
>>>>> --mca btl_tcp_if_exclude ib0
>>>>>
>>>>>

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

 I don't see it running

[pmdtest@compute-01-01 ~]$ netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
192.168.108.0   0.0.0.0 255.255.255.0   U 0 0  0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 ib0
239.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 eth0
10.0.0.00.0.0.0 255.0.0.0   U 0 0  0 eth0
0.0.0.0 10.0.0.10.0.0.0 UG0 0  0 eth0
[pmdtest@compute-01-01 ~]$ exit
logout
Connection to compute-01-01 closed.
[pmdtest@pmd ~]$ ssh compute-01-06
Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone
[pmdtest@compute-01-06 ~]$ netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
192.168.108.0   0.0.0.0 255.255.255.0   U 0 0  0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 ib0
239.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 eth0
10.0.0.00.0.0.0 255.0.0.0   U 0 0  0 eth0
0.0.0.0 10.0.0.10.0.0.0 UG0 0  0 eth0
[pmdtest@compute-01-06 ~]$


On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> This is really weird ?
>
> is the loopback interface up and running on both nodes and with the same
> ip ?
>
> can you run on both compute nodes ?
> netstat -nr
>
>
> On 2014/11/13 16:50, Syed Ahsan Ali wrote:
>> Now it looks through the loopback address
>>
>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
>> btl_tcp_if_exclude ib0 ring_c
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 127.0.0.1 failed: Connection refused (111)
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> [pmd.pakmet.com:30867] 1 more process has sent help message
>> help-mpi-btl-openib.txt / no active ports found
>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
>> 0 to see all help / error messages
>>
>>
>>
>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> --mca btl ^openib
>>> disables the openib btl, which is native infiniband only.
>>>
>>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>>
>>> an other option is you to use
>>> --mca btl_tcp_if_exclude ib0
>>>
>>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> Process 0 decremented value: 8
>>>> Process 0 decremented value: 7
>>>> Process 0 decremented value: 6
>>>> Process 1 exiting
>>>> Process 0 decremented value: 5
>>>> Process 0 decremented value: 4
>>>> Process 0 decremented value: 3
>>>> Process 0 decremented value: 2
>>>> Process 0 decremented value: 1
>>>> Process 0 decremented value: 0
>>>> Process 0 exiting
>>>> [pmdtest@pmd ~]$
>>>>
>>>> While the ip addresses 192.168.108* are for ib interface.
>>>>
>>>>  [root@compute-01-01 ~]# ifconfig
>>>> eth0  Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>>   inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>>   inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>   RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>>   TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>>   collisions:0 txqueuelen:1000
>>>>   RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>>   Interrupt:169 Memory:dc00-dc012100
>>>> ib0   Link encap:InfiniBand  HWaddr
>>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>>   inet addr:192.168.108.14  Bcast:192.168.108.255  
>>>> Mask:255.255.255.0
>>>>   UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>>

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

Ok ok I can disable that as well.
Thank you guys. :)

On Thu, Nov 13, 2014 at 12:50 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> Now it looks through the loopback address
>
> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
> btl_tcp_if_exclude ib0 ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 127.0.0.1 failed: Connection refused (111)
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> [pmd.pakmet.com:30867] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
>
>
>
> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>> --mca btl ^openib
>> disables the openib btl, which is native infiniband only.
>>
>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>
>> an other option is you to use
>> --mca btl_tcp_if_exclude ib0
>>
>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> Process 0 decremented value: 8
>>> Process 0 decremented value: 7
>>> Process 0 decremented value: 6
>>> Process 1 exiting
>>> Process 0 decremented value: 5
>>> Process 0 decremented value: 4
>>> Process 0 decremented value: 3
>>> Process 0 decremented value: 2
>>> Process 0 decremented value: 1
>>> Process 0 decremented value: 0
>>> Process 0 exiting
>>> [pmdtest@pmd ~]$
>>>
>>> While the ip addresses 192.168.108* are for ib interface.
>>>
>>>  [root@compute-01-01 ~]# ifconfig
>>> eth0  Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>   inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>   inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>   RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:1000
>>>   RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>   Interrupt:169 Memory:dc00-dc012100
>>> ib0   Link encap:InfiniBand  HWaddr
>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>   inet addr:192.168.108.14  Bcast:192.168.108.255  
>>> Mask:255.255.255.0
>>>   UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:256
>>>   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>>
>>>
>>>
>>> So the point is why mpirun is following the ib  path while I it has
>>> been disabled. Possible solutions?
>>>
>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>>> 10.0.0.8 address
>>>>
>>>> is the 192.168.* network a point to point network (for example between a
>>>> host and a mic) so two nodes
>>>> cannot ping each other via this address ?
>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>>> compute-01-06 ? */
>>>>
>>>> could you also run
>>>>
>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>
>>>> and see whether it helps ?
>>>>
>>>>
>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>>> Same result in both cases
>>>>>
>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>>> compute-01-01,compute-01-06 ring_c
>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>> Process 0 sent to 1
>>>>> Process 0 decremented value: 9
>>

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

Now it looks through the loopback address

[pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
btl_tcp_if_exclude ib0 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 127.0.0.1 failed: Connection refused (111)
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[pmd.pakmet.com:30867] 1 more process has sent help message
help-mpi-btl-openib.txt / no active ports found
[pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages



On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> --mca btl ^openib
> disables the openib btl, which is native infiniband only.
>
> ib0 is treated as any TCP interface and then handled by the tcp btl
>
> an other option is you to use
> --mca btl_tcp_if_exclude ib0
>
> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>> btl_tcp_if_include 10.0.0.0/8 ring_c
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 1 exiting
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> [pmdtest@pmd ~]$
>>
>> While the ip addresses 192.168.108* are for ib interface.
>>
>>  [root@compute-01-01 ~]# ifconfig
>> eth0  Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>   inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>   inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:1000
>>   RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>   Interrupt:169 Memory:dc00-dc012100
>> ib0   Link encap:InfiniBand  HWaddr
>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>   inet addr:192.168.108.14  Bcast:192.168.108.255  Mask:255.255.255.0
>>   UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:256
>>   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>
>>
>>
>> So the point is why mpirun is following the ib  path while I it has
>> been disabled. Possible solutions?
>>
>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>> 10.0.0.8 address
>>>
>>> is the 192.168.* network a point to point network (for example between a
>>> host and a mic) so two nodes
>>> cannot ping each other via this address ?
>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>> compute-01-06 ? */
>>>
>>> could you also run
>>>
>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>
>>> and see whether it helps ?
>>>
>>>
>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>> Same result in both cases
>>>>
>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>> compute-01-01,compute-01-06 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>
>>>>
>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>>> compute-01-01,compute-01-06 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> [compute

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
btl_tcp_if_include 10.0.0.0/8 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 1 exiting
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
[pmdtest@pmd ~]$

While the ip addresses 192.168.108* are for ib interface.

 [root@compute-01-01 ~]# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
  inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
  inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
  TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
  Interrupt:169 Memory:dc00-dc012100
ib0   Link encap:InfiniBand  HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  inet addr:192.168.108.14  Bcast:192.168.108.255  Mask:255.255.255.0
  UP BROADCAST MULTICAST  MTU:65520  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)



So the point is why mpirun is following the ib  path while I it has
been disabled. Possible solutions?

On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> mpirun complains about the 192.168.108.10 ip address, but ping reports a
> 10.0.0.8 address
>
> is the 192.168.* network a point to point network (for example between a
> host and a mic) so two nodes
> cannot ping each other via this address ?
> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
> compute-01-06 ? */
>
> could you also run
>
> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
> btl_tcp_if_include 10.0.0.0/8 ring_c
>
> and see whether it helps ?
>
>
> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>> Same result in both cases
>>
>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>> compute-01-01,compute-01-06 ring_c
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.10 failed: No route to host (113)
>>
>>
>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>> compute-01-01,compute-01-06 ring_c
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.10 failed: No route to host (113)
>>
>>
>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> Hi,
>>>
>>> it seems you messed up the command line
>>>
>>> could you try
>>>
>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>
>>>
>>> can you also try to run mpirun from a compute node instead of the head
>>> node ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>> Here is what I see when disabling openib support.\
>>>>
>>>>
>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>> compute-01-01,compute-01-06 ring_c
>>>> ssh:  orted: Temporary failure in name resolution
>>>> ssh:  orted: Temporary failure in name resolution
>>>> ----------
>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>>> to launch so we are aborting.
>>>>
>>>> While nodes can still ssh each other
>>>>
>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
>>>> [pmdtest@compute-01-06 ~]$
>>>>
>>>>
>>>>
>>>>
>>>> On Thu,

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

Same result in both cases

[pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
compute-01-01,compute-01-06 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
[compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)


[pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
compute-01-01,compute-01-06 ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
[compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)


On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> Hi,
>
> it seems you messed up the command line
>
> could you try
>
> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>
>
> can you also try to run mpirun from a compute node instead of the head
> node ?
>
> Cheers,
>
> Gilles
>
> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>> Here is what I see when disabling openib support.\
>>
>>
>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>> compute-01-01,compute-01-06 ring_c
>> ssh:  orted: Temporary failure in name resolution
>> ssh:  orted: Temporary failure in name resolution
>> --
>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> While nodes can still ssh each other
>>
>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
>> [pmdtest@compute-01-06 ~]$
>>
>>
>>
>>
>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>> wrote:
>>>  Hi Jefff
>>>
>>> No firewall is enabled. Running the diagnostics I found that non
>>> communication mpi job is running . While ring_c remains stuck. There
>>> are of course warnings for open fabrics but in my case I an running
>>> application by disabling openib., Please see below
>>>
>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
>>> --
>>> WARNING: There is at least one OpenFabrics device found but there are
>>> no active ports detected (or Open MPI was unable to use them).  This
>>> is most certainly not what you wanted.  Check your cables, subnet
>>> manager configuration, etc.  The openib BTL will be ignored for this
>>> job.
>>>   Local host: compute-01-01.private.dns.zone
>>> --
>>> Hello, world, I am 0 of 2
>>> Hello, world, I am 1 of 2
>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>> help-mpi-btl-openib.txt / no active ports found
>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>> --
>>> WARNING: There is at least one OpenFabrics device found but there are
>>> no active ports detected (or Open MPI was unable to use them).  This
>>> is most certainly not what you wanted.  Check your cables, subnet
>>> manager configuration, etc.  The openib BTL will be ignored for this
>>> job.
>>>   Local host: compute-01-01.private.dns.zone
>>> --
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.10 failed: No route to host (113)
>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>> help-mpi-btl-openib.txt / no active ports found
>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>> 
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>> <jsquy...@cisco.com> wrote:
>>>> Do you have firewalling enab

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

Here is what I see when disabling openib support.\


[pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
compute-01-01,compute-01-06 ring_c
ssh:  orted: Temporary failure in name resolution
ssh:  orted: Temporary failure in name resolution
--
A daemon (pid 7608) died unexpectedly with status 255 while attempting
to launch so we are aborting.

While nodes can still ssh each other

[pmdtest@compute-01-01 ~]$ ssh compute-01-06
Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
[pmdtest@compute-01-06 ~]$




On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>  Hi Jefff
>
> No firewall is enabled. Running the diagnostics I found that non
> communication mpi job is running . While ring_c remains stuck. There
> are of course warnings for open fabrics but in my case I an running
> application by disabling openib., Please see below
>
>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
> --
> WARNING: There is at least one OpenFabrics device found but there are
> no active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>   Local host: compute-01-01.private.dns.zone
> --
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
> [pmd.pakmet.com:06386] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
>
> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
> --
> WARNING: There is at least one OpenFabrics device found but there are
> no active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>   Local host: compute-01-01.private.dns.zone
> --
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [pmd.pakmet.com:15965] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
> 
>
>
>
>
>
> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
> <jsquy...@cisco.com> wrote:
>> Do you have firewalling enabled on either server?
>>
>> See this FAQ item:
>>
>> 
>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>
>>
>>
>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>>
>>> Dear All
>>>
>>> I need your advice. While trying to run mpirun job across nodes I get
>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>> compute-01-06 are not able to communicate with each other. While nodes
>>> see each other on ping.
>>>
>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>
>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.10 failed: No route to host (113)
>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect

Re: [OMPI users] mpirun fails across nodes

2014-11-13 Thread Syed Ahsan Ali

 Hi Jefff

No firewall is enabled. Running the diagnostics I found that non
communication mpi job is running . While ring_c remains stuck. There
are of course warnings for open fabrics but in my case I an running
application by disabling openib., Please see below

 [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
--
WARNING: There is at least one OpenFabrics device found but there are
no active ports detected (or Open MPI was unable to use them).  This
is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this
job.
  Local host: compute-01-01.private.dns.zone
--
Hello, world, I am 0 of 2
Hello, world, I am 1 of 2
[pmd.pakmet.com:06386] 1 more process has sent help message
help-mpi-btl-openib.txt / no active ports found
[pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages

[pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
--
WARNING: There is at least one OpenFabrics device found but there are
no active ports detected (or Open MPI was unable to use them).  This
is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this
job.
  Local host: compute-01-01.private.dns.zone
--
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
[compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[pmd.pakmet.com:15965] 1 more process has sent help message
help-mpi-btl-openib.txt / no active ports found
[pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages






On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> Do you have firewalling enabled on either server?
>
> See this FAQ item:
>
> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>
>
>
> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
>> Dear All
>>
>> I need your advice. While trying to run mpirun job across nodes I get
>> following error. It seems that the two nodes i.e, compute-01-01 and
>> compute-01-06 are not able to communicate with each other. While nodes
>> see each other on ping.
>>
>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>> ^openib ../bin/regcmMPICLM45 regcm.in
>>
>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.14 failed: No route to host (113)
>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.14 failed: No route to host (113)
>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.14 failed: No route to host (113)
>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.108.10 failed: No route to host (113)
>> connect() to 192.168.108.10 failed: No route to host (113)
>>
>> mpirun: killing job...
>>
>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>> ttl=64 time=0.108 ms
>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>> ttl=64 time=0.088 ms
>>
>> --- compute-01-06.private.dns.zone ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>> [pmdtest@compute-01-01 ~]$
>>
>> Thanks in advance.
>>
>> Ahsan
>> ___
>> users mailing lis

[OMPI users] mpirun fails across nodes

2014-11-12 Thread Syed Ahsan Ali

Dear All

I need your advice. While trying to run mpirun job across nodes I get
following error. It seems that the two nodes i.e, compute-01-01 and
compute-01-06 are not able to communicate with each other. While nodes
see each other on ping.

[pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
^openib ../bin/regcmMPICLM45 regcm.in

[compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)
[compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)
[compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)
[compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
[compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
connect() to 192.168.108.10 failed: No route to host (113)

mpirun: killing job...

[pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
[pmdtest@compute-01-01 ~]$ ping compute-01-06
PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
ttl=64 time=0.108 ms
64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
ttl=64 time=0.088 ms

--- compute-01-06.private.dns.zone ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
[pmdtest@compute-01-01 ~]$

Thanks in advance.

Ahsan

Re: [OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Syed Ahsan Ali

Hi Jeff and Ralph

I could have figured out the issue but the problem was that I cannot find
the exact error line in config.log just as you identified. The shared
library libquadmath is present in lib64 directory. So, adding the path to
the environment removed the error.

Thank you guys for helping me :)



On Tue, Aug 26, 2014 at 7:29 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Just to elaborate: as the error message implies, this error message was
> put there specifically to ensure that the Fortran compiler works before
> continuing any further.  If the Fortran compiler is busted, configure exits
> with this help message.
>
> You can either fix your Fortran compiler, or use --disable-mpi-fortran to
> disable all Fortran support from Open MPI (and therefore this "test whether
> the Fortran compiler works" test will be skipped).
>
> Here's the specific log section showing the failure:
>
> -
> configure:32389: checking if Fortran compiler works
> configure:32418: gfortran -o conftestconftest.f  >&5
> configure:32418: $? = 0
> configure:32418: ./conftest
> ./conftest: error while loading shared libraries: libquadmath.so.0: wrong
> ELF class: ELFCLASS32
> configure:32418: $? = 127
> configure: program exited with status 127
> configure: failed program was:
> |   program main
> |
> |   end
> configure:32434: result: no
> configure:32448: error: Could not run a simple Fortran program.  Aborting.
> -
>
>
> On Aug 26, 2014, at 10:10 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> > Looks like there is something wrong with your gfortran install:
> >
> > *** Fortran compiler
> > checking for gfortran... gfortran
> > checking whether we are using the GNU Fortran compiler... yes
> > checking whether gfortran accepts -g... yes
> > checking whether ln -s works... yes
> > checking if Fortran compiler works... no
> > **
> > * It appears that your Fortran compiler is unable to produce working
> > * executables.  A simple test application failed to properly
> > * execute.  Note that this is likely not a problem with Open MPI,
> > * but a problem with the local compiler installation.  More
> > * information (including exactly what command was given to the
> > * compiler and what error resulted when the command was executed) is
> > * available in the config.log file in the Open MPI build directory.
> > **
> > configure: error: Could not run a simple Fortran program.  Aborting.
> >
> >
> > FWIW: I can compile and run on my CentOS6.5 system just fine. I have
> gfortran 4.4.7 installed on it
> >
> > On Aug 26, 2014, at 2:59 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
> >
> >>
> >>
> >> I have problems in compilation of openmpi-1.8.1 on Linux machine.
> Kindly see the logs attached.
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25147.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25150.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25151.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Syed Ahsan Ali

I have problems in compilation of openmpi-1.8.1 on Linux machine. Kindly
see the logs attached.


configure.bz2
Description: BZip2 compressed data

Re: [OMPI users] openmpi 1.8.1 error witg gfortran

2014-08-06 Thread Syed Ahsan Ali

Issue resolved.


On Wed, Aug 6, 2014 at 2:48 PM, Syed Ahsan Ali <ahsansha...@gmail.com>
wrote:

> I have following error while compiling
>
>
> *** Fortran compiler
> checking whether we are using the GNU Fortran compiler... yes
> checking whether /opt/gcc-4.9.1/bin/gfortran accepts -g... yes
> configure: WARNING: Open MPI now ignores the F77 and FFLAGS environment
> variables; only the FC and FCFLAGS environment variables are used.
> checking whether ln -s works... yes
> checking if Fortran compiler works... no
> **
> * It appears that your Fortran compiler is unable to produce working
> * executables.  A simple test application failed to properly
> * execute.  Note that this is likely not a problem with Open MPI,
> * but a problem with the local compiler installation.  More
> * information (including exactly what command was given to the
> * compiler and what error resulted when the command was executed) is
> * available in the config.log file in the Open MPI build directory.
> **
> configure: error: Could not run a simple Fortran program.  Aborting.
> [root@rcm openmpi-1.8.1]#
>
>
> Ahsan
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] How to keep multiple installations at same time

2014-08-05 Thread Syed Ahsan Ali

I want to compile openmpi with both intel and gnu compilers. How can
I install both at the same time and then specify which one to use during
job submission.


Regards
Ahsan

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali

Hi Josh

It was my mistake. The status of error generating node is pasted below

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0018:8b90:97fe:94fe
base lid:0x0
sm lid:  0x0
state:   1: DOWN
phys state:  4: PortConfigurationTraining
rate:10 Gb/sec (4X)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80::::0018:8b90:97fe:94ff
base lid:0x29
sm lid:  0x15

As you see one port is down. I have all sysadmin rights as I am managing
the cluster, but my level of knowledge is not expert. Can you explain a bit
about ports. Does each infiniband card in a system has 2 physical ports?
What to look for if one port status is down.?

Ahsan

On Tue, Jul 22, 2014 at 6:14 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:

>  Sayed,
>
> You might try this link (or have your sysadmin do it if you do not have
> admin privileges.) To me it looks like your second port is in the "INIT"
> state but has not been added by the subnet manager.
>
>
> https://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-ofed-tools
>
> You might also try try running only over port 1 with the mca parameter:
>
> -mca btl_openib_if_include mlx4_0:1
>
> Hope this helps.
>
> Josh
>
>
>  On Tue, Jul 22, 2014 at 12:10 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
>
>>  And where I can find run/job/submission ?
>>
>>  On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:
>>
>>>
>>> You have to check the ports states on *all* nodes in the
>>> run/job/submission. Checking on a single node is not enough.
>>> My guess is the 01-00 tries to connect 01-01 and the ports are down on
>>> 01-01.
>>>
>>> You may disable support for infiniband by adding --mca btl ^openib.
>>>
>>> Best,
>>> Pavel (Pasha) Shamis
>>> ---
>>> Computer Science Research Group
>>> Computer Science and Math Division
>>> Oak Ridge National Laboratory
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com
>>> <mailto:ahsansha...@gmail.com>> wrote:
>>>
>>> Dear All
>>>
>>> I need your help to solve this cluster related issue causing mpirun
>>> malfunction. I get following warning for some of the nodes and then the
>>> route failure message comes causing failure to mpirun.
>>>
>>>
>>> WARNING: There is at least one OpenFabrics device found but there are no
>>> active ports detected (or Open MPI was unable to use them).  This
>>> is most certainly not what you wanted.  Check your cables, subnet
>>> manager configuration, etc.  The openib BTL will be ignored for this
>>> job.
>>>Local host: compute-01-01.private.dns.zone
>>>
>>> --
>>>SETUP OF THE LM
>>>  INITIALIZATIONS
>>>  INPUT OF THE NAMELISTS
>>> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes
>>> have sent help message help-mpi-btl-openib.txt / no active ports found
>>> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter
>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> My questions are.
>>> I don't include flags for running openmpi over infiniband then why it
>>> still gives warning. If the infiniband ports are not active then it should
>>> start the job over gigabit ethernet of cluster. Why it is unable to find
>>> the route while the node can be pinged and ssh from other nodes and master
>>> node as well.
>>> The ibstatus of the above node (for which I was getting error) shows
>>> that both ports are up. What is causing error then?
>>>
>>> [root@compute-01-00 ~]# ibstatus
>>> Infiniband device 'mlx4_0' port 1 status:
>>> default gid: fe80::::0024:e890:97ff:1c61
>>> base lid:0x5
>>> sm lid:  0x1
>>>

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-23 Thread Syed Ahsan Ali

Dear Pasha

The ibstatus is not of two different machines it is of the same machine.
There are two infiband ports showing up on all nodes. I checked on all the
nodes that one of the port in always in INIT status and other one active.
Now please see below the ibstatus of the problem causing node
(compute-01-01). Its one port is down. May be this is the reason for
error?. Is it a physical port?

[root@compute-01-01 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0018:8b90:97fe:94fe
base lid:0x0
sm lid:  0x0
state:   1: DOWN
phys state:  4: PortConfigurationTraining
rate:10 Gb/sec (4X)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80::::0018:8b90:97fe:94ff
base lid:0x29
sm lid:  0x15
state:   4: ACTIVE
phys state:  5: LinkUp
rate:20 Gb/sec (4X DDR)
On Tue, Jul 22, 2014 at 6:50 PM, Shamis, Pavel <sham...@ornl.gov> wrote:

> Hmm, this does not make sense.
> Your copy-n-paste shows that both machines (00 and 01) have the same
> guid/lid (sort of equivalent of mac address in ethernet world).
> As you can guess these two can not be identical for two different machines
> (unless you moved the card around).
>
> Best,
> Pasha
>
> On Jul 21, 2014, at 11:26 PM, Syed Ahsan Ali <ahsansha...@gmail.com
> <mailto:ahsansha...@gmail.com>> wrote:
>
> Yes I had checked running mpirun on all nodes one by one to see the
> problematic one. I had already mentioned that compute-01-01 is causing
> problem, when I remove it from the hostlist mpirun works fine. Here is
> ibstatus of compute-01-01.
>
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80::::0024:e890:97ff:1c61
> base lid:0x5
> sm lid:  0x1
> state:   4: ACTIVE
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
> default gid: fe80::::0024:e890:97ff:1c62
> base lid:0x0
> sm lid:  0x0
> state:   2: INIT
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
>
>
> On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov sham...@ornl.gov>> wrote:
>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the ports are down on
> 01-01.
>
> You may disable support for infiniband by adding --mca btl ^openib.
>
> Best,
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com ahsansha...@gmail.com><mailto:ahsansha...@gmail.com ahsansha...@gmail.com>>> wrote:
>
> Dear All
>
> I need your help to solve this cluster related issue causing mpirun
> malfunction. I get following warning for some of the nodes and then the
> route failure message comes causing failure to mpirun.
>
>
> WARNING: There is at least one OpenFabrics device found but there are no
> active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>Local host: compute-01-01.private.dns.zone
> --
>SETUP OF THE LM
>  INITIALIZATIONS
>  INPUT OF THE NAMELISTS
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><
> http://pmd.pakmet.com:30198/>] 7 more processes have sent help message
> help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><
> http://pmd.pakmet.com:30198/>] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
>  
> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> My questions are.
> I don't include flags for running openmpi over infiniband then why it
> still gives warning. If the infiniband ports are not active then it should
> start the job over gigabit ethernet of cluster. Why it is unable to fi

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali

And where I can find run/job/submission ?

On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:

>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the ports are down on
> 01-01.
>
> You may disable support for infiniband by adding --mca btl ^openib.
>
> Best,
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com ahsansha...@gmail.com>> wrote:
>
> Dear All
>
> I need your help to solve this cluster related issue causing mpirun
> malfunction. I get following warning for some of the nodes and then the
> route failure message comes causing failure to mpirun.
>
>
> WARNING: There is at least one OpenFabrics device found but there are no
> active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>Local host: compute-01-01.private.dns.zone
> --
>SETUP OF THE LM
>  INITIALIZATIONS
>  INPUT OF THE NAMELISTS
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes
> have sent help message help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> My questions are.
> I don't include flags for running openmpi over infiniband then why it
> still gives warning. If the infiniband ports are not active then it should
> start the job over gigabit ethernet of cluster. Why it is unable to find
> the route while the node can be pinged and ssh from other nodes and master
> node as well.
> The ibstatus of the above node (for which I was getting error) shows that
> both ports are up. What is causing error then?
>
> [root@compute-01-00 ~]# ibstatus
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80::::0024:e890:97ff:1c61
> base lid:0x5
> sm lid:  0x1
> state:   4: ACTIVE
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
> default gid: fe80::::0024:e890:97ff:1c62
> base lid:0x0
> sm lid:  0x0
> state:   2: INIT
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
>
>
> Thank you in advance for your guidance and support.
>
> Regards
>
> --
> Ahsan
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24833.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24835.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Errors for openib, mpirun fails

2014-07-22 Thread Syed Ahsan Ali

Yes I had checked running mpirun on all nodes one by one to see the
problematic one. I had already mentioned that compute-01-01 is causing
problem, when I remove it from the hostlist mpirun works fine. Here is
ibstatus of compute-01-01.

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0024:e890:97ff:1c61
base lid:0x5
sm lid:  0x1
state:   4: ACTIVE
phys state:  5: LinkUp
rate:20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80::::0024:e890:97ff:1c62
base lid:0x0
sm lid:  0x0
state:   2: INIT
phys state:  5: LinkUp
rate:20 Gb/sec (4X DDR)


On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:

>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the ports are down on
> 01-01.
>
> You may disable support for infiniband by adding --mca btl ^openib.
>
> Best,
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com ahsansha...@gmail.com>> wrote:
>
> Dear All
>
> I need your help to solve this cluster related issue causing mpirun
> malfunction. I get following warning for some of the nodes and then the
> route failure message comes causing failure to mpirun.
>
>
> WARNING: There is at least one OpenFabrics device found but there are no
> active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>Local host: compute-01-01.private.dns.zone
> --
>SETUP OF THE LM
>  INITIALIZATIONS
>  INPUT OF THE NAMELISTS
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes
> have sent help message help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> My questions are.
> I don't include flags for running openmpi over infiniband then why it
> still gives warning. If the infiniband ports are not active then it should
> start the job over gigabit ethernet of cluster. Why it is unable to find
> the route while the node can be pinged and ssh from other nodes and master
> node as well.
> The ibstatus of the above node (for which I was getting error) shows that
> both ports are up. What is causing error then?
>
> [root@compute-01-00 ~]# ibstatus
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80::::0024:e890:97ff:1c61
> base lid:0x5
> sm lid:  0x1
> state:   4: ACTIVE
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
> default gid: fe80::::0024:e890:97ff:1c62
> base lid:0x0
> sm lid:  0x0
> state:   2: INIT
> phys state:  5: LinkUp
> rate:20 Gb/sec (4X DDR)
>
>
> Thank you in advance for your guidance and support.
>
> Regards
>
> --
> Ahsan
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24833.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24835.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Errors for openib, mpirun fails

2014-07-21 Thread Syed Ahsan Ali

Dear All

I need your help to solve this cluster related issue causing mpirun
malfunction. I get following warning for some of the nodes and then the
route failure message comes causing failure to mpirun.



*WARNING: There is at least one OpenFabrics device found but there are no
active ports detected (or Open MPI was unable to use them).  This*

*is most certainly not what you wanted.  Check your cables, subnet*

*manager configuration, etc.  The openib BTL will be ignored for this*

*job.*

*   Local host: compute-01-01.private.dns.zone*

*--*

*   SETUP OF THE LM*

* INITIALIZATIONS *

* INPUT OF THE NAMELISTS*

*[pmd.pakmet.com:30198 ] 7 more processes have
sent help message help-mpi-btl-openib.txt / no active ports found*

*[pmd.pakmet.com:30198 ] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages*

*[compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)*

*[compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)*

*My questions are.*

I don't include flags for running openmpi over infiniband then why it still
gives warning. If the infiniband ports are not active then it should start
the job over gigabit ethernet of cluster. Why it is unable to find the
route while the node can be pinged and ssh from other nodes and master node
as well.

The ibstatus of the above node (for which I was getting error) shows that
both ports are up. What is causing error then?


[root@compute-01-00 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0024:e890:97ff:1c61
base lid:0x5
sm lid:  0x1
state:   4: ACTIVE
phys state:  5: LinkUp
rate:20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80::::0024:e890:97ff:1c62
base lid:0x0
sm lid:  0x0
state:   2: INIT
phys state:  5: LinkUp
rate:20 Gb/sec (4X DDR)


Thank you in advance for your guidance and support.

Regards

-- 
Ahsan

Re: [OMPI users] Error message related to infiniband

2014-01-20 Thread Syed Ahsan Ali

My email was mixture of error messages/warnings.

IB Card on compute-01-10 is faulty on ibstatus.

In ibstat on other nodes as well as on compute-01-15 there are dual ports
as I see status of both ports in ibstat.

Firewall in not a problem, I am sure about it. How can I check bad ethernet
port. I mean I can ping among master and compute nodes.

/etc/hosts is ok for name resolution.

Thank you very much for responding and helping me out.


Ahsan



On Mon, Jan 20, 2014 at 9:27 AM, Gustavo Correa <g...@ldeo.columbia.edu>wrote:

> Is your IB card in compute-01-10.private.dns.zone working?
> Did you check it with ibstat?
>
> Do you have a dual port IB card in compute-01-15.private.dns.zone?
> Did you connect both ports to the same switch on the same subnet?
>
> TCP "no route to host":
> If it is not a firewall problem, could it bad Ethernet port on a node
> perhaps?
>
> Also, if you use host names in your hostfile, I guess they need to be able
> to
> resolve the names into IP addresses.
> Check if your /etc/hosts file, DNS server, or whatever you
> use for name resolution, is correct and consistent across the cluster.
>
> On Jan 19, 2014, at 10:18 PM, Syed Ahsan Ali wrote:
>
> > I agree with you and still struglling with subnet ID settings because I
> couldn't find /var/cache/opensm/opensm.opts file.
> >
> > Secondly, if OMPI is going for TCP then it should be able to find as
> compute nodes are available via ping and ssh
> >
> >
> > On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > If OMPI finds infiniband support on the node, it will attempt to use it.
> In this case, it would appear you have an incorrectly configured IB adaptor
> on the node, so you get the additional warning about that fact.
> >
> > OMPI then falls back to look for another transport, in this case TCP.
> However, the TCP transport is unable to create a socket to the remote host.
> The most likely cause is a firewall, so you might want to check that and
> turn it off.
> >
> >
> > On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
> >
> >> Dear All
> >>
> >> I am getting infiniband errors while running mpirun applications on
> cluster. I get these errors even when I don't include infiniband usage
> flags in mpirun command. Please guide
> >>
> >> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in
> >>
> >>
> --
> >> [[59183,1],24]: A high-performance Open MPI point-to-point messaging
> module
> >> was unable to find any relevant network interfaces:
> >> Module: OpenFabrics (openib)
> >>   Host: compute-01-10.private.dns.zone
> >>
> >> Another transport will be used instead, although this may result in
> >> lower performance.
> >>
> --
> >>
> --
> >> WARNING: There are more than one active ports on host
> 'compute-01-15.private.dns.zone', but the
> >> default subnet GID prefix was detected on more than one of these
> >> ports.  If these ports are connected to different physical IB
> >> networks, this configuration will fail in Open MPI.  This version of
> >> Open MPI requires that every physically separate IB subnet that is
> >> used between connected MPI processes must have different subnet ID
> >> values.
> >>
> >> Please see this FAQ entry for more details:
> >>
> >>
> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >>   btl_openib_warn_default_gid_prefix to 0.
> >>
> --
> >>
> >>   This is RegCM trunk
> >>SVN Revision: tag 4.3.5.6 compiled at: data : Sep  3 2013  time:
> 05:10:53
> >>
> >> [pmd.pakmet.com:03309] 15 more processes have sent help message
> help-mpi-btl-base.txt / btl:no-nics
> >> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
> >> [pmd.pakmet.com:03309] 47 more processes have sent help message
> help-mpi-btl-openib.txt / default subnet prefix
> >>
> [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
&g

Re: [OMPI users] Error message related to infiniband

2014-01-19 Thread Syed Ahsan Ali

I agree with you and still struglling with subnet ID settings because I
couldn't find /var/cache/opensm/opensm.opts file.

Secondly, if OMPI is going for TCP then it should be able to find as
compute nodes are available via ping and ssh


On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <r...@open-mpi.org> wrote:

> If OMPI finds infiniband support on the node, it will attempt to use it.
> In this case, it would appear you have an incorrectly configured IB adaptor
> on the node, so you get the additional warning about that fact.
>
> OMPI then falls back to look for another transport, in this case TCP.
> However, the TCP transport is unable to create a socket to the remote host.
> The most likely cause is a firewall, so you might want to check that and
> turn it off.
>
>
> On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
> Dear All
>
> I am getting infiniband errors while running mpirun applications on
> cluster. I get these errors even when I don't include infiniband usage
> flags in mpirun command. Please guide
>
> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in
>
> --
> [[59183,1],24]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
>   Host: compute-01-10.private.dns.zone
>
> Another transport will be used instead, although this may result in
> lower performance.
> --
> --
> WARNING: There are more than one active ports on host
> 'compute-01-15.private.dns.zone', but the
> default subnet GID prefix was detected on more than one of these
> ports.  If these ports are connected to different physical IB
> networks, this configuration will fail in Open MPI.  This version of
> Open MPI requires that every physically separate IB subnet that is
> used between connected MPI processes must have different subnet ID
> values.
>
> Please see this FAQ entry for more details:
>
>   http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
>
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_default_gid_prefix to 0.
> --
>
>   This is RegCM trunk
>SVN Revision: tag 4.3.5.6 compiled at: data : Sep  3 2013  time:
> 05:10:53
>
> [pmd.pakmet.com:03309] 15 more processes have sent help message
> help-mpi-btl-base.txt / btl:no-nics
> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> [pmd.pakmet.com:03309] 47 more processes have sent help message
> help-mpi-btl-openib.txt / default subnet prefix
> [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> [compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> [compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
>
> Ahsan
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Error message related to infiniband

2014-01-19 Thread Syed Ahsan Ali

Dear All

I am getting infiniband errors while running mpirun applications on
cluster. I get these errors even when I don't include infiniband usage
flags in mpirun command. Please guide

mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in

--
[[59183,1],24]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: compute-01-10.private.dns.zone

Another transport will be used instead, although this may result in
lower performance.
--
--
WARNING: There are more than one active ports on host
'compute-01-15.private.dns.zone', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_default_gid_prefix to 0.
--

  This is RegCM trunk
   SVN Revision: tag 4.3.5.6 compiled at: data : Sep  3 2013  time: 05:10:53

[pmd.pakmet.com:03309] 15 more processes have sent help message
help-mpi-btl-base.txt / btl:no-nics
[pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[pmd.pakmet.com:03309] 47 more processes have sent help message
help-mpi-btl-openib.txt / default subnet prefix
[compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
[compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)
[compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.10 failed: No route to host (113)

Ahsan

Re: [OMPI users] compilation aborted for Handler.cpp (code 2)

2013-09-27 Thread Syed Ahsan Ali

 Thank you very much Jeff. It worked now. But I got error when I ran make check

1 of 5 tests failed
Please report to http://www.open-mpi.org/community/help/

make[3]: *** [check-TESTS] Error 1
make[3]: Leaving directory `/home/openmpi-1.6.5/test/datatype'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/home/openmpi-1.6.5/test/datatype'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/home/openmpi-1.6.5/test'
make: *** [check-recursive] Error 1
[root@r720 openmpi-1.6.5]#

But I find openmpi installed. Will it work fine? Secondly missing the
optional package VT will effect functionality of openmpi or it will
work fine.

On Tue, Sep 24, 2013 at 2:49 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> I suspect that something is wrong with your Intel C++ compiler installation, 
> but you can simply avoid the issue if you add --disable-vt to Open MPI's 
> ./configure command line.  This will skip building the (optional) Vampir 
> Trace package, which is where you are running into this problem (VT is 
> written in C++; the vast majority of the rest of Open MPI is written in C).
>
>
> On Sep 22, 2013, at 10:40 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
>> Its ok Jeff.
>> I am not sure about other C++ codes and STL with icpc because it never
>> happened and I don't know anything about STL.(pardon my less
>> knowledge). What do you suggest in this case? installation of
>> different version of openmpi or intel compilers? or any other
>> solution.
>>
>> On Fri, Sep 20, 2013 at 8:35 PM, Jeff Squyres (jsquyres)
>> <jsquy...@cisco.com> wrote:
>>> Sorry for the delay replying -- I actually replied on the original thread 
>>> yesterday, but it got hung up in my outbox and I didn't notice that it 
>>> didn't actually go out until a few moments ago.  :-(
>>>
>>> I'm *guessing* that this is a problem with your local icpc installation.
>>>
>>> Can you compile / run other C++ codes that use the STL with icpc?
>>>
>>>
>>> On Sep 20, 2013, at 6:59 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>>>
>>>> Output of make V=1 is attached. Again same error. If intel compiler is
>>>> using C++ headers from gfortran then how can we avoid this.
>>>>
>>>> On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg
>>>> <bert.wes...@googlemail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>> wrote:
>>>>>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
>>>>>> but getting the subject error. config.out and make.out is attached.
>>>>>> Following command was used for configure
>>>>>>
>>>>>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
>>>>>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
>>>>>> tee config.out
>>>>>
>>>>> could you also run make with 'make V=1' and send the output. Anyway it
>>>>> looks like the intel compiler uses the C++ headers from GCC 4.6.3 and
>>>>> I don't know if this is supported.
>>>>>
>>>>> Bert
>>>>>
>>>>>> Please help/advise.
>>>>>> Thank you and best regards
>>>>>> Ahsan
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Syed Ahsan Ali Bokhari
>>>> Electronic Engineer (EE)
>>>>
>>>> Research & Development Division
>>>> Pakistan Meteorological Department H-8/4, Islamabad.
>>>> Phone # off  +92518358714
>>>> Cell # +923155145014
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> Syed Ahsan Ali Bokhari
>> Electronic Engineer (EE)
>>
>> Research & Development Division
>> Pakistan Meteorological Department H-8/4, Islamabad.
>> Phone # off  +92518358714
>> Cell # +923155145014
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] compilation aborted for Handler.cpp (code 2)

2013-09-22 Thread Syed Ahsan Ali

Its ok Jeff.
I am not sure about other C++ codes and STL with icpc because it never
happened and I don't know anything about STL.(pardon my less
knowledge). What do you suggest in this case? installation of
different version of openmpi or intel compilers? or any other
solution.

On Fri, Sep 20, 2013 at 8:35 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> Sorry for the delay replying -- I actually replied on the original thread 
> yesterday, but it got hung up in my outbox and I didn't notice that it didn't 
> actually go out until a few moments ago.  :-(
>
> I'm *guessing* that this is a problem with your local icpc installation.
>
> Can you compile / run other C++ codes that use the STL with icpc?
>
>
> On Sep 20, 2013, at 6:59 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
>> Output of make V=1 is attached. Again same error. If intel compiler is
>> using C++ headers from gfortran then how can we avoid this.
>>
>> On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg
>> <bert.wes...@googlemail.com> wrote:
>>> Hi,
>>>
>>> On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>> wrote:
>>>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
>>>> but getting the subject error. config.out and make.out is attached.
>>>> Following command was used for configure
>>>>
>>>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
>>>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
>>>> tee config.out
>>>
>>> could you also run make with 'make V=1' and send the output. Anyway it
>>> looks like the intel compiler uses the C++ headers from GCC 4.6.3 and
>>> I don't know if this is supported.
>>>
>>> Bert
>>>
>>>> Please help/advise.
>>>> Thank you and best regards
>>>> Ahsan
>>>>
>>
>>
>>
>> --
>> Syed Ahsan Ali Bokhari
>> Electronic Engineer (EE)
>>
>> Research & Development Division
>> Pakistan Meteorological Department H-8/4, Islamabad.
>> Phone # off  +92518358714
>> Cell # +923155145014
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Fwd: compilation aborted for Handler.cpp (code 2)

2013-09-20 Thread Syed Ahsan Ali

Output of make V=1 is attached. Again same error. If intel compiler is
using C++ headers from gfortran then how can we avoid this.

On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg
<bert.wes...@googlemail.com> wrote:
> Hi,
>
> On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
>> but getting the subject error. config.out and make.out is attached.
>> Following command was used for configure
>>
>>  ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
>> tee config.out
>
> could you also run make with 'make V=1' and send the output. Anyway it
> looks like the intel compiler uses the C++ headers from GCC 4.6.3 and
> I don't know if this is supported.
>
> Bert
>
>> Please help/advise.
>> Thank you and best regards
>> Ahsan
>>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014
<>

[OMPI users] Fwd: compilation aborted for Handler.cpp (code 2)

2013-09-19 Thread Syed Ahsan Ali

I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
but getting the subject error. config.out and make.out is attached.
Following command was used for configure

 ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
--prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
tee config.out
Please help/advise.
Thank you and best regards
Ahsan
<>

Re: [OMPI users] compilation aborted for Handler.cpp (code 2)

2013-09-18 Thread Syed Ahsan Ali

Please find attached again.

On Tue, Sep 17, 2013 at 11:35 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> On Sep 16, 2013, at 9:00 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
>> I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
>> but getting the subject error. config.out and make.out is attached.
>> Following command was used for configure
>>
>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
>> --prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
>> tee config.out
>
> I'm sorry; I can't open a .rar file.  Can you send the logs compressed with a 
> conventional compression program like gzip, bzip2, or zip?
>
> Thanks.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014
<>

[OMPI users] Fwd: compilation aborted for Handler.cpp (code 2)

2013-09-16 Thread Syed Ahsan Ali

 Dear All


I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort
but getting the subject error. config.out and make.out is attached.
Following command was used for configure

 ./configure CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort
--prefix=/home/openmpi_gfortran -enable-mpi-f90 --enable-mpi-f77 |&
tee config.out
Please help/advise.
Thank you and best regards
Ahsan


logs.rar
Description: application/rar

Re: [OMPI users] Running openmpi jobs on two system-librdmacm: couldn't read ABI version

2013-03-26 Thread Syed Ahsan Ali

It may be because the other system is running upgraded version of linux
which is not having infiniband drivers. Any solution?


On Tue, Mar 26, 2013 at 12:42 PM, Syed Ahsan Ali <ahsansha...@gmail.com>wrote:

> Tried this but mpirun exits with this error
>
> mpirun -np 40 /home/MET/hrm/bin/hrm
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> librdmacm: couldn't read ABI version.
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> --
> [[33095,1],8]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> Module: OpenFabrics (openib)
>   Host: pmd04.pakmet.com
> Another transport will be used instead, although this may result in
> lower performance.
> --
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[33095,1],28]) is on host:
> compute-02-00.private02.pakmet.com
>   Process 2 ([[33095,1],0]) is on host: pmd02
>   BTLs attempted: openib self sm
> Your MPI job is now going to abort; sorry.
> ----------
>
>
> Ahsan
>
> On Fri, Mar 22, 2013 at 7:09 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>>
>> On Mar 22, 2013, at 3:42 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
>> wrote:
>>
>> Actually due to some data base corruption I am not able to add any new
>> node to cluster from the installer node. So I want to run parallel job on
>> more nodes without adding them to existing cluster.
>> You are right the binaries must be present on the remote node as well.
>> Is this possible throught nfs? just as the compute nodes are nfs mounted
>> with the installer node.
>>
>>
>> Sure - OMPI doesn't care how the binaries got there. Just so long as they
>> are present on the compute node.
>>
>>
>> Ahsan
>>
>>
>> On Fri, Mar 22, 2013 at 3:33 PM, Reuti <re...@staff.uni-marburg.de>wrote:
>>
>>> Am 22.03.2013 um 10:14 schrieb Syed Ahsan Ali:
>>>
>>> > I have a very basic question. If we want to run mpirun job on two
>>> systems which are not part of cluster, then how we can make it possible.
>>> Can the host be specifiend on mpirun which is not compute node, rather a
>>> stand alone system.
>>>
>>> Sure, the machines can be specified as argument to `mpiexec`. But do you
>>> want to run applications just between these two machines, or should they
>>> participate on a larger parallel job with machines of the cluster: then a
>>> direct network connection between outside and inside of the cluster is
>>> necessary by some kind of forwarding in case these are separated networks.
>>>
>>> Also the paths to the started binaries may be different, in case the two
>>> machines are not sharing the same /home with the cluster and this needs to
>>> be honored.
>>>
>>> In case you are using a queuing system and want to route jobs to outside
>>> machines of the set up cluster: it's necessary to negotiate with the admin
>>> to allow jobs being scheduled thereto.
>>>
>>> -- Reuti
>>>
>>>
>>> > Thanks
>>> > Ahsan
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>

Re: [OMPI users] Running openmpi jobs on two system-librdmacm: couldn't read ABI version

2013-03-26 Thread Syed Ahsan Ali

Tried this but mpirun exits with this error

mpirun -np 40 /home/MET/hrm/bin/hrm
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: assuming: 4
CMA: unable to get RDMA device list
--
[[33095,1],8]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
  Host: pmd04.pakmet.com
Another transport will be used instead, although this may result in
lower performance.
--
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.
  Process 1 ([[33095,1],28]) is on host: compute-02-00.private02.pakmet.com
  Process 2 ([[33095,1],0]) is on host: pmd02
  BTLs attempted: openib self sm
Your MPI job is now going to abort; sorry.
--


Ahsan

On Fri, Mar 22, 2013 at 7:09 PM, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Mar 22, 2013, at 3:42 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
> Actually due to some data base corruption I am not able to add any new
> node to cluster from the installer node. So I want to run parallel job on
> more nodes without adding them to existing cluster.
> You are right the binaries must be present on the remote node as well.
> Is this possible throught nfs? just as the compute nodes are nfs mounted
> with the installer node.
>
>
> Sure - OMPI doesn't care how the binaries got there. Just so long as they
> are present on the compute node.
>
>
> Ahsan
>
>
> On Fri, Mar 22, 2013 at 3:33 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>
>> Am 22.03.2013 um 10:14 schrieb Syed Ahsan Ali:
>>
>> > I have a very basic question. If we want to run mpirun job on two
>> systems which are not part of cluster, then how we can make it possible.
>> Can the host be specifiend on mpirun which is not compute node, rather a
>> stand alone system.
>>
>> Sure, the machines can be specified as argument to `mpiexec`. But do you
>> want to run applications just between these two machines, or should they
>> participate on a larger parallel job with machines of the cluster: then a
>> direct network connection between outside and inside of the cluster is
>> necessary by some kind of forwarding in case these are separated networks.
>>
>> Also the paths to the started binaries may be different, in case the two
>> machines are not sharing the same /home with the cluster and this needs to
>> be honored.
>>
>> In case you are using a queuing system and want to route jobs to outside
>> machines of the set up cluster: it's necessary to negotiate with the admin
>> to allow jobs being scheduled thereto.
>>
>> -- Reuti
>>
>>
>> > Thanks
>> > Ahsan
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Running openmpi jobs on two system

2013-03-22 Thread Syed Ahsan Ali

Actually due to some data base corruption I am not able to add any new node
to cluster from the installer node. So I want to run parallel job on more
nodes without adding them to existing cluster.
You are right the binaries must be present on the remote node as well.
Is this possible throught nfs? just as the compute nodes are nfs mounted
with the installer node.

Ahsan


On Fri, Mar 22, 2013 at 3:33 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 22.03.2013 um 10:14 schrieb Syed Ahsan Ali:
>
> > I have a very basic question. If we want to run mpirun job on two
> systems which are not part of cluster, then how we can make it possible.
> Can the host be specifiend on mpirun which is not compute node, rather a
> stand alone system.
>
> Sure, the machines can be specified as argument to `mpiexec`. But do you
> want to run applications just between these two machines, or should they
> participate on a larger parallel job with machines of the cluster: then a
> direct network connection between outside and inside of the cluster is
> necessary by some kind of forwarding in case these are separated networks.
>
> Also the paths to the started binaries may be different, in case the two
> machines are not sharing the same /home with the cluster and this needs to
> be honored.
>
> In case you are using a queuing system and want to route jobs to outside
> machines of the set up cluster: it's necessary to negotiate with the admin
> to allow jobs being scheduled thereto.
>
> -- Reuti
>
>
> > Thanks
> > Ahsan
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Running openmpi jobs on two system

2013-03-22 Thread Syed Ahsan Ali

I have a very basic question. If we want to run mpirun job on two systems
which are not part of cluster, then how we can make it possible. Can the
host be specifiend on mpirun which is not compute node, rather a stand
alone system.
Thanks
Ahsan

Re: [OMPI users] error while loading shared libraries: libhdf5.so.7:

2013-02-07 Thread Syed Ahsan Ali

Dear John
Looking into output of ldd for master and compute nodes solved my problem.
Thanks for such a simple solution. :)

On Thu, Feb 7, 2013 at 9:37 PM, Syed Ahsan Ali <ahsansha...@gmail.com>wrote:

> Dear John
> Thanks for the reply. I'll need help of you people to solve this problem.
> I am not expert in HPC and this would be my learning as well. Let me add
> that the cluster is based on Platform Cluster Manager (PCM) by IBM
> Computing. The compute nodes are NFS mounted with the installer node.
> Therefore the directory containing binary rca.x is also present in the
> compute nodes. Unfortunately I was trying to copy gfortran libraries from
> installer node to compute nodes using rsync but something went wrong and
> the model binary rca.x stopped working. I have recompiled the binary after
> reinstalling hdf as well as netcdf which model uses during compilation. All
> path are set in bashrc as well.
> Below is the output of ldd on master as well as compute nodes
>
>
>
> [pmdtest@pmd HadGEM]$ ldd rca.x
>
> libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x2b6a9503c000)
>
> libnetcdff.so.5 => /usr/local/lib/libnetcdff.so.5 (0x2b6a95344000)
>
> libnetcdf.so.7 => /usr/local/lib/libnetcdf.so.7 (0x2b6a95798000)
>
> libhdf5.so.7 => /usr/local/lib/libhdf5.so.7 (0x2b6a95aa1000)
>
> libhdf5_hl.so.7 => /usr/local/lib/libhdf5_hl.so.7 (0x2b6a95f5c000)
>
> libsz.so.2 => /usr/local/lib/libsz.so.2 (0x2b6a9618b000)
>
> libz.so.1 => /usr/local/lib/libz.so.1 (0x2b6a9639f000)
>
> libmpi_f90.so.0 => /home/openmpi/lib/libmpi_f90.so.0 (0x2b6a965b4000)
>
> libmpi_f77.so.0 => /home/openmpi/lib/libmpi_f77.so.0 (0x2b6a967b7000)
>
> libmpi.so.0 => /home/openmpi/lib/libmpi.so.0 (0x2b6a969ee000)
>
> libopen-rte.so.0 => /home/openmpi/lib/libopen-rte.so.0 (0x2b6a96cb6000)
>
> libopen-pal.so.0 => /home/openmpi/lib/libopen-pal.so.0 (0x2b6a96f16000)
>
> libdl.so.2 => /lib64/libdl.so.2 (0x0033e0e0)
>
> libnsl.so.1 => /lib64/libnsl.so.1 (0x0033e220)
>
> libutil.so.1 => /lib64/libutil.so.1 (0x0033ee40)
>
> libm.so.6 => /lib64/libm.so.6 (0x0033e120)
>
> libpthread.so.0 => /lib64/libpthread.so.0 (0x0033e160)
>
> libc.so.6 => /lib64/libc.so.6 (0x0033e0a0)
>
> libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x2b6a971a)
>
> /lib64/ld-linux-x86-64.so.2 (0x0033e060)
>
> librt.so.1 => /lib64/librt.so.1 (0x00362ac0)
>
> libifport.so.5 => /opt/intel/Compiler/11.1/064/lib/intel64/libifport.so.5
> (0x2b6a973b5000)
>
> libifcore.so.5 => /opt/intel/Compiler/11.1/064/lib/intel64/libifcore.so.5
> (0x2b6a974ef000)
>
> libimf.so =>
> /opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libimf.so
> (0x2b6a97765000)
>
> libsvml.so =>
> /opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libsvml.so
> (0x2b6a97c2f000)
>
> libintlc.so.5 =>
> /opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libintlc.so.5
> (0x2b6a984f5000)
>
> libifcoremt.so.5 =>
> /opt/intel/Compiler/11.1/064/lib/intel64/libifcoremt.so.5
> (0x2b6a98743000)
>
> libirng.so =>
> /opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libirng.so
> (0x2b6a989e8000)
>
> [pmdtest@pmd HadGEM]$ ssh compute-01-18
>
> ssh: connect to host compute-01-18 port 22: No route to host
>
> [pmdtest@pmd HadGEM]$ ssh compute-01-13
>
> Last login: Mon Jan 28 07:48:08 2013 from pmd-eth0.private.dns.zone
>
> [pmdtest@compute-01-13 ~]$ ldd rca.x
>
> ldd: ./rca.x: No such file or directory
>
> [pmdtest@compute-01-13 ~]$ ls
> /home/pmdtest/RCA4_CORDEX/RCA4_CORDEX_SAsia/HadGEM/rca.x
> Regards
> Ahsan
>
>
>  On Thu, Feb 7, 2013 at 7:40 PM, John Hearns <hear...@googlemail.com>wrote:
>
>>  ldd rca.x
>>
>> Try logging in to each node and run this command.
>> Even better use pdsh
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>


-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] error while loading shared libraries: libhdf5.so.7:

2013-02-07 Thread Syed Ahsan Ali

Dear John
Thanks for the reply. I'll need help of you people to solve this problem. I
am not expert in HPC and this would be my learning as well. Let me add that
the cluster is based on Platform Cluster Manager (PCM) by IBM Computing.
The compute nodes are NFS mounted with the installer node. Therefore the
directory containing binary rca.x is also present in the compute nodes.
Unfortunately I was trying to copy gfortran libraries from installer node
to compute nodes using rsync but something went wrong and the model binary
rca.x stopped working. I have recompiled the binary after reinstalling hdf
as well as netcdf which model uses during compilation. All path are set in
bashrc as well.
Below is the output of ldd on master as well as compute nodes



[pmdtest@pmd HadGEM]$ ldd rca.x

libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x2b6a9503c000)

libnetcdff.so.5 => /usr/local/lib/libnetcdff.so.5 (0x2b6a95344000)

libnetcdf.so.7 => /usr/local/lib/libnetcdf.so.7 (0x2b6a95798000)

libhdf5.so.7 => /usr/local/lib/libhdf5.so.7 (0x2b6a95aa1000)

libhdf5_hl.so.7 => /usr/local/lib/libhdf5_hl.so.7 (0x2b6a95f5c000)

libsz.so.2 => /usr/local/lib/libsz.so.2 (0x2b6a9618b000)

libz.so.1 => /usr/local/lib/libz.so.1 (0x2b6a9639f000)

libmpi_f90.so.0 => /home/openmpi/lib/libmpi_f90.so.0 (0x2b6a965b4000)

libmpi_f77.so.0 => /home/openmpi/lib/libmpi_f77.so.0 (0x2b6a967b7000)

libmpi.so.0 => /home/openmpi/lib/libmpi.so.0 (0x2b6a969ee000)

libopen-rte.so.0 => /home/openmpi/lib/libopen-rte.so.0 (0x2b6a96cb6000)

libopen-pal.so.0 => /home/openmpi/lib/libopen-pal.so.0 (0x2b6a96f16000)

libdl.so.2 => /lib64/libdl.so.2 (0x0033e0e0)

libnsl.so.1 => /lib64/libnsl.so.1 (0x0033e220)

libutil.so.1 => /lib64/libutil.so.1 (0x0033ee40)

libm.so.6 => /lib64/libm.so.6 (0x0033e120)

libpthread.so.0 => /lib64/libpthread.so.0 (0x0033e160)

libc.so.6 => /lib64/libc.so.6 (0x0033e0a0)

libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x2b6a971a)

/lib64/ld-linux-x86-64.so.2 (0x0033e060)

librt.so.1 => /lib64/librt.so.1 (0x00362ac0)

libifport.so.5 => /opt/intel/Compiler/11.1/064/lib/intel64/libifport.so.5
(0x2b6a973b5000)

libifcore.so.5 => /opt/intel/Compiler/11.1/064/lib/intel64/libifcore.so.5
(0x2b6a974ef000)

libimf.so =>
/opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libimf.so
(0x2b6a97765000)

libsvml.so =>
/opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libsvml.so
(0x2b6a97c2f000)

libintlc.so.5 =>
/opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libintlc.so.5
(0x2b6a984f5000)

libifcoremt.so.5 =>
/opt/intel/Compiler/11.1/064/lib/intel64/libifcoremt.so.5
(0x2b6a98743000)

libirng.so =>
/opt/intel/composer_xe_2013.0.079/compiler/lib/intel64/libirng.so
(0x2b6a989e8000)

[pmdtest@pmd HadGEM]$ ssh compute-01-18

ssh: connect to host compute-01-18 port 22: No route to host

[pmdtest@pmd HadGEM]$ ssh compute-01-13

Last login: Mon Jan 28 07:48:08 2013 from pmd-eth0.private.dns.zone

[pmdtest@compute-01-13 ~]$ ldd rca.x

ldd: ./rca.x: No such file or directory

[pmdtest@compute-01-13 ~]$ ls
/home/pmdtest/RCA4_CORDEX/RCA4_CORDEX_SAsia/HadGEM/rca.x
Regards
Ahsan


On Thu, Feb 7, 2013 at 7:40 PM, John Hearns  wrote:

> ldd rca.x
>
> Try logging in to each node and run this command.
> Even better use pdsh
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] error while loading shared libraries: libhdf5.so.7:

2013-02-07 Thread Syed Ahsan Ali

I have been running this program successfully before but some copy
operation from /usr/ directory has caused this error.

The program runs fine on the cores of the same machine.
libhdf5.so.7 is also present.

[pmdtest@pmd HadGEM]$ mpirun -np 32 -hostfile hostlist rca.x
rca.x: error while loading shared libraries: libhdf5.so.7: cannot open
shared object file: No such file or directory

Please advise!
Ahsan

Re: [OMPI users] configure: error: Could not run a simple Fortran 77 program. Aborting.

2013-02-01 Thread Syed Ahsan Ali

Dear Jeff

Thanks for reply. You are always very helpful.

Please note that openmpi version is 1.6.3
Rest log files are attatched,
On Fri, Feb 1, 2013 at 4:51 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com>wrote:

> configure is not finding a working Fortran compiler.  Please send all the
> information listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> On Feb 1, 2013, at 5:58 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
> >
> > I am getting following error while bulding openmpi
> >
> > *** Fortran 90/95 compiler
> > checking whether we are using the GNU Fortran compiler... yes
> > checking whether gfortran accepts -g... yes
> > checking if Fortran 77 compiler works... no
> > **
> > * It appears that your Fortran 77 compiler is unable to produce working
> > * executables.  A simple test application failed to properly
> > * execute.  Note that this is likely not a problem with Open MPI,
> > * but a problem with the local compiler installation.  More
> > * information (including exactly what command was given to the
> > * compiler and what error resulted when the command was executed) is
> > * available in the config.log file in this directory.
> > **
> > configure: error: Could not run a simple Fortran 77 program.  Aborting.
> > make: *** No targets specified and no makefile found.  Stop.
> > make: *** No rule to make target `install'.  Stop.
> > [root@pmd openmpi-1.6.3]#
> >
> > System has gfortran installed
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>

[OMPI users] configure: error: Could not run a simple Fortran 77 program. Aborting.

2013-02-01 Thread Syed Ahsan Ali

I am getting following error while bulding openmpi

*** Fortran 90/95 compiler
checking whether we are using the GNU Fortran compiler... yes
checking whether gfortran accepts -g... yes
checking if Fortran 77 compiler works... no
**
* It appears that your Fortran 77 compiler is unable to produce working
* executables.  A simple test application failed to properly
* execute.  Note that this is likely not a problem with Open MPI,
* but a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compiler and what error resulted when the command was executed) is
* available in the config.log file in this directory.
**
configure: error: Could not run a simple Fortran 77 program.  Aborting.
make: *** No targets specified and no makefile found.  Stop.
make: *** No rule to make target `install'.  Stop.
[root@pmd openmpi-1.6.3]#

System has gfortran installed

Re: [OMPI users] Infiniband errors

2012-12-20 Thread Syed Ahsan Ali

 /misc autofs
rw,fd=6,pgrp=3876,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs
rw,fd=12,pgrp=3876,timeout=300,minproto=5,maxproto=5,indirect 0 0

Thanks and Regards


On Wed, Dec 19, 2012 at 8:38 PM, Yann Droneaud <ydrone...@opteya.com> wrote:

> Le mercredi 19 décembre 2012 à 12:12 +0500, Syed Ahsan Ali a écrit :
> > Dear John
> >
> > I found this output of ibstatus on some nodes (most probably the
> > problem causing)
> > [root@compute-01-08 ~]# ibstatus
> >
> > Fatal error:  device '*': sys files not found
> > (/sys/class/infiniband/*/ports)
> >
> > Does this show any hardware or software issue?
> >
>
> This is a software issue.
>
> Which Linux (lsb_release --all or cat /etc/redhat-release) and kernel
> (uname -a) version are you using ?
>
> Which modules are loaded (lsmod) ?
>
> Is /sys mounted (mount and/or cat /proc/mounts) ?
>
> Regards.
>
> --
> Yann Droneaud
> OPTEYA
>
>
>


-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Infiniband errors

2012-12-19 Thread Syed Ahsan Ali

Dear John

I found this output of ibstatus on some nodes (most probably the problem
causing)
[root@compute-01-08 ~]# ibstatus
Fatal error:  device '*': sys files not found
(/sys/class/infiniband/*/ports)

Does this show any hardware or software issue?

Thanks




On Wed, Nov 28, 2012 at 3:17 PM, John Hearns  wrote:

> Those diagnostics are from Openfabrics.
> What type of infiniband card do you have?
> What drivers are you using?
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

2012-11-29 Thread Syed Ahsan Ali

I recieve following error while running an application
Does this represent any hardware issue?

[compute-01-01.private.dns.zone][[60090,1],10][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
[compute-01-01.private.dns.zone][[60090,1],13][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

Re: [OMPI users] Infiniband errors

2012-11-28 Thread Syed Ahsan Ali

I am not sure about drivers because those were installed by someone else
during cluster setup. I see following information about infiniband card.
The card is DDR InfiniBand Mellanox ConnectX.

On Wed, Nov 28, 2012 at 3:17 PM, John Hearns <hear...@googlemail.com> wrote:

> Those diagnostics are from Openfabrics.
> What type of infiniband card do you have?
> What drivers are you using?
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Infiniband errors

2012-11-28 Thread Syed Ahsan Ali

ibstats comes with some other distribution? I don't have this command
available right now


On Wed, Nov 28, 2012 at 1:14 PM, John Hearns <hear...@googlemail.com> wrote:

> Short answer. Run ibstats or ibstatus.
> Look also at the logs of your subnet manager.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Infiniband errors

2012-11-28 Thread Syed Ahsan Ali

Dear All

I have an application which is run using openmpi and uses infiniband flags.
The application is a forecast model simulation. A frequent problem arises
that the Infiniband mezzanine cards of servers become faulty (don't know
the reason why it happens so frequent), the model simulation becomes very
slow or even remain stuck, I have to manually remove the nodes from the
hostlist one by one to check which nodes has faulty infiniband so that I
can run the model on the rest of the nodes. Is there any way to check
during job run that which node is having communication problem over
infiniband aur is delaying the application.

Thanks!
Ahsan

Re: [OMPI users] Need solution- nodes can't find the paths.

2012-10-03 Thread Syed Ahsan Ali

Data is large and cannot be copied to the local drives od the compute nodes
as the data is large.
Second option is good but the thing I don't understand is that when each
and everything is NFS mounted to the compute nodes then why it can't takes
the external SAN drives too, I don't know how to export SAN volume from
headnode. Is there any other solution?

On Wed, Oct 3, 2012 at 1:13 PM, John Hearns <hear...@googlemail.com> wrote:

> You need to either copy the data to storage which the cluster nodes have
> mounted. Surely your cluster vendor included local storage?
>
> Or you can configure the cluster head node to export the SAN volume by NFS
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Need solution- nodes can't find the paths.

2012-10-03 Thread Syed Ahsan Ali

Dear All

I have a Dell Cluster running Platform Cluster Manager (PCM) , the compute
nodes are NFS mounted with the master node. Storage (SAN) is mounted to the
installer node only, the problem is that I am running a programme which
uses data which resides on Storage , so as far as running the program on
master node is concerned there is no problem but when I mpirun across other
nodes they are not able to find the paths (as the Storage partitions are
not mounted to the compute nodes). I have made symbolic links on the
installer node but the compute nodes are showing red color sysmbolic links.
Please advise how to reslove this issue

Best Regards
Ahsan

Re: [OMPI users] UC Permission denied, please try again.

2012-08-03 Thread Syed Ahsan Ali

Dear Reuti

Platform Cluster Manager PCM was used to build cluster, I am not familiar
with the queuing system, However ssh works between the nodes without
paraphrase and between master nodes as other compute nodes, the problem is
with one node

On Thu, Aug 2, 2012 at 9:07 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 02.08.2012 um 17:57 schrieb Syed Ahsan Ali:
>
> > Yes all the compute nodes are NFS mounted with the master node, so
> everthing is same, all other nodes are accessible on ssh without password.
>
> Are you using a queuing system?
>
> SSH could be setup to work from the master node without passphrase, but
> not between the nodes.
>
> -- Reuti
>
> > On Thu, Aug 2, 2012 at 1:09 PM, John Hearns <hear...@googlemail.com>
> wrote:
> > On 02/08/2012, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> > > Yes the issue has been diagnosed. I can ssh them but they are asking
> for
> > > passwords
> >
> > You need to configure 'passwordless ssh'
> >
> > Can we assume that your home directory is shared across all cluster
> nodes?
> > That means when you log into a cluster node the directory which is
> > your Unix 'home directory' has exactly the same files as on the
> > cluster head node?
> >
> > Then you need to configure passwordless ssh
> > Use this as a starting point:
> > http://www.open-mpi.org/faq/?category=rsh#ssh-keys
> > _______
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Syed Ahsan Ali Bokhari
> > Electronic Engineer (EE)
> >
> > Research & Development Division
> > Pakistan Meteorological Department H-8/4, Islamabad.
> > Phone # off  +92518358714
> > Cell # +923155145014
> >
> > _______
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] UC Permission denied, please try again.

2012-08-02 Thread Syed Ahsan Ali

Yes all the compute nodes are NFS mounted with the master node, so
everthing is same, all other nodes are accessible on ssh without password.

On Thu, Aug 2, 2012 at 1:09 PM, John Hearns <hear...@googlemail.com> wrote:

> On 02/08/2012, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> > Yes the issue has been diagnosed. I can ssh them but they are asking for
> > passwords
>
> You need to configure 'passwordless ssh'
>
> Can we assume that your home directory is shared across all cluster nodes?
> That means when you log into a cluster node the directory which is
> your Unix 'home directory' has exactly the same files as on the
> cluster head node?
>
> Then you need to configure passwordless ssh
> Use this as a starting point:
> http://www.open-mpi.org/faq/?category=rsh#ssh-keys
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] UC Permission denied, please try again.

2012-08-02 Thread Syed Ahsan Ali

Yes the issue has been diagnosed. I can ssh them but they are asking for
passwords



On Wed, Aug 1, 2012 at 2:02 PM, Rushton Martin <jmrush...@qinetiq.com>wrote:

>  That looks like a login issue to compute-02-02, -00 and -03.  Can you
> ssh to them?
>
> ** **
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Syed Ahsan Ali
> *Sent:* 01 August 2012 08:45
> *To:* Open MPI Users
> *Subject:* [OMPI users] Permission denied, please try again.
>
> ** **
>
>
> Dear All
>
> ** **
>
> I am having problem while running an application on cluster. The
> application was working fine but now this error has arised . we used to run
> the application the same way with user pmdtest and there was no error. I
> dont know which permission it is asking for. Please help!
>
> ** **
>
> [pmdtest@pmd02 d00_dayfiles]$ less *_hrm
>
> mpirun -np 32 /home/MET/hrm/bin/hrm
>
> Permission denied, please try again.
>
> Permission denied, please try again.
>
> Permission denied (publickey,gssapi-with-mic,password).
>
> Permission denied, please try again.
>
> Permission denied, please try again.
>
> Permission denied (publickey,gssapi-with-mic,password).
>
> --
> 
>
> A daemon (pid 25164) died unexpectedly with status 255 while attempting***
> *
>
> to launch so we are aborting.
>
> ** **
>
> There may be more information reported by the environment (see above).
>
> ** **
>
> This may be because the daemon was unable to find all the needed shared***
> *
>
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> 
>
> location of the shared libraries on the remote nodes and this will
>
> automatically be forwarded to the remote nodes.
>
> --
> 
>
> --
> 
>
> mpirun noticed that the job aborted, but has no info as to the process
>
> that caused that situation.
>
> --
> 
>
> --
> 
>
> mpirun was unable to cleanly terminate the daemons on the nodes shown
>
> below. Additional manual cleanup may be required - please refer to
>
> the "orte-clean" tool for assistance.
>
> --
> 
>
> compute-02-02 - daemon did not report back when launched
>
> compute-02-00 - daemon did not report back when launched
>
> compute-02-03 - daemon did not report back when launched
>
> ** **
>
> Best Regards
>
> Ahsan
> 
>
>
>  
>
> ** **
>
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. If you are
> not the intended recipient of this email, you must neither take any action
> based upon its contents, nor copy or show it to anyone. Please contact the
> sender if you believe you have received this email in error. QinetiQ may
> monitor email traffic data and also the content of email for the purposes
> of security. QinetiQ Limited (Registered in England & Wales: Company
> Number: 3796233) Registered office: Cody Technology Park, Ively Road,
> Farnborough, Hampshire, GU14 0LX 
> http://www.qinetiq.com<http://www.qinetiq.com/>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Permission denied, please try again.

2012-08-01 Thread Syed Ahsan Ali

Dear All

I am having problem while running an application on cluster. The
application was working fine but now this error has arised . we used to run
the application the same way with user pmdtest and there was no error. I
dont know which permission it is asking for. Please help!

 [pmdtest@pmd02 d00_dayfiles]$ less *_hrm
mpirun -np 32 /home/MET/hrm/bin/hrm
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
--
A daemon (pid 25164) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
compute-02-02 - daemon did not report back when launched
compute-02-00 - daemon did not report back when launched
compute-02-03 - daemon did not report back when launched

Best Regards
Ahsan

Re: [OMPI users] undefined reference to `netcdf_mp_nf90_open_'

2012-06-27 Thread Syed Ahsan Ali

>
> Dear Tim
>>>
>>


> I built netcdf with mpif90using this option
>>
>>
./configure MPIF90=/home/openmpi/bin/mpif90

and compiled the application using mpif90 comipler option.

Regards
Ahsan

>  If your mpif90 is properly built and set up with the same Fortran
> compiler you are using, it appears that either you didn't build the netcdf
> Fortran 90 modules with that compiler, or you didn't set the include path
> for the netcdf modules.  This would work the same with mpif90 as with the
> underlying Fortran compiler.
>
>
> --
> Tim Prince
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] undefined reference to `netcdf_mp_nf90_open_'

2012-06-27 Thread Syed Ahsan Ali

Dear Dima

I was not sure but it seemed to be like something related to netcdf and
mpif90. I have struggling with the compilation of comso, int2lm parallel
has been installed successfully but cosmo is giving this error. I don't
have any -lnetcdf or -lnetcdff in linker options of the Fopts file. I tried
recompiling netcdf with mpif90 option. Your help with the configuration of
cosmo would be higly appreciated.

Best Regards
Ahsan

On Tue, Jun 26, 2012 at 6:21 PM, Dmitry N. Mikushin <maemar...@gmail.com>wrote:

> Dear Syed,
>
> Why do you think it is related to MPI?
>
> You seem to be compiling the COSMO model, which depends on netcdf lib, but
> the symbols are not passed to linker by some reason. Two main reasons are:
> (1) the library linking flag is missing (check you have something like
> -lnetcdf -lnetcdff in your linker command line), (2) The netcdf Fortran
> bindings are compiled with a different naming notation (check names in the
> lib really contain the expected number of final underscores).
>
> I compiled cosmo 4.22 with openmpi and netcdf not long ago without any
> problems.
>
> Best,
> - Dima.
>
> 2012/6/26 Syed Ahsan Ali <ahsansha...@gmail.com>
>
>> Dear All
>>
>> I am getting following error while compilation of an application. Seems
>> like something related to netcdf and mpif90. Although I have compiled
>> netcdf with mpif90 option, dont why this error is happening. Any hint would
>> be highly appreciated.
>>
>>
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
>> function `src_obs_proc_cdf_mp_obs_cdf_read_org_':
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x17aa):
>> undefined reference to `netcdf_mp_nf90_open_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
>> function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1000e):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10039):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10064):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1008b):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x100c8):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10227):
>> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x102eb):
>> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x103af):
>> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10473):
>> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10559):
>> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10890):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108bb):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108e2):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10909):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10930):
>> undefined reference to `netcdf_mp_nf90_inq_varid_'
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o:/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x109e8):
>> more undefined references to `netcdf_mp_nf90_inq_varid_' follow
>>
>> /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
>> function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':
>>
>> /home/

Re: [OMPI users] undefined reference to `netcdf_mp_nf90_open_'

2012-06-27 Thread Syed Ahsan Ali

Dear Jeff

Can you explain a little how to get this. "* you can take the mpif90
command that is being used to generate these errors and add "--showme" to
the end of it, and you'll see what underlying compiler command is being
executed under the covers."*
*
*
Regards
Ahsan

On Tue, Jun 26, 2012 at 6:20 PM, Jeff Squyres <jsquy...@cisco.com> wrote:

> Sorry, this looks like an application issue -- i.e., the linker error
> you're getting doesn't look like it's coming from Open MPI.  Perhaps it's a
> missing application/middleware library.
>
> More specifically, you can take the mpif90 command that is being used to
> generate these errors and add "--showme" to the end of it, and you'll see
> what underlying compiler command is being executed under the covers.  That
> might help you understand exactly what is going on.
>
>
>
> On Jun 26, 2012, at 7:13 AM, Syed Ahsan Ali wrote:
>
> > Dear All
> >
> > I am getting following error while compilation of an application. Seems
> like something related to netcdf and mpif90. Although I have compiled
> netcdf with mpif90 option, dont why this error is happening. Any hint would
> be highly appreciated.
> >
> >
> > /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
> function `src_obs_proc_cdf_mp_obs_cdf_read_org_':
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x17aa):
> undefined reference to `netcdf_mp_nf90_open_'
> >
> > /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
> function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1000e):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10039):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10064):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1008b):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x100c8):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10227):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x102eb):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x103af):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10473):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10559):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10890):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108bb):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108e2):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10909):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10930):
> undefined reference to `netcdf_mp_nf90_inq_varid_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o:/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x109e8):
> more undefined references to `netcdf_mp_nf90_inq_varid_' follow
> >
> > /home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
> function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10abc):
> undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'
> >
> >
> /home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.tex

[OMPI users] undefined reference to `netcdf_mp_nf90_open_'

2012-06-26 Thread Syed Ahsan Ali

Dear All

I am getting following error while compilation of an application. Seems
like something related to netcdf and mpif90. Although I have compiled
netcdf with mpif90 option, dont why this error is happening. Any hint would
be highly appreciated.



/home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
function `src_obs_proc_cdf_mp_obs_cdf_read_org_':

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x17aa):
undefined reference to `netcdf_mp_nf90_open_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1000e):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10039):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10064):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1008b):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x100c8):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10227):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x102eb):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x103af):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10473):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10559):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10890):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108bb):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x108e2):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10909):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10930):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o:/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x109e8):
more undefined references to `netcdf_mp_nf90_inq_varid_' follow

/home/pmdtest/cosmo/source/cosmo_110525_4.18/obj/src_obs_proc_cdf.o: In
function `src_obs_proc_cdf_mp_obs_cdf_read_temp_pilot_':

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10abc):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10b8c):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10c5c):
undefined reference to `netcdf_mp_nf90_get_var_1d_eightbytereal_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10d2c):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10dfc):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10ecc):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10ef3):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x10fbb):
undefined reference to `netcdf_mp_nf90_get_var_1d_fourbyteint_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x1105a):
undefined reference to `netcdf_mp_nf90_inq_varid_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x110cb):
undefined reference to `netcdf_mp_nf90_inquire_variable_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x11102):
undefined reference to `netcdf_mp_nf90_inquire_dimension_'

/home/pmdtest/cosmo/source/cosmo_110525_4.18/src/src_obs_proc_cdf.f90:(.text+0x118be):
undefined reference to `netcdf_mp_nf90_inq_varid_'

Re: [OMPI users] HRM problem

2012-04-24 Thread Syed Ahsan Ali

I am not familiar with attaching debugger to the processes. Other things
you asked are as follows:

  Is this the first time you've ran it (with Open MPI? with any MPI?) *No
We have been running this and other models but this problem has arised now
*  How many processes is the job using?  Are you oversubscribing your
processors?* I have tried to run on cluster having 184 cores as well on 8
cores of the same server
*  What version of Open MPI are you using? *openmpi 1.4.2*
  Have you tested all network connections?  *yes
*  It might help us to know the size of cluster you are running and what
type of network? *the cluster has 32 nodes dell power edge blade servers
and connectivity is Gigabit Ethernet and Infiniband,
*


On Tue, Apr 24, 2012 at 3:02 PM, TERRY DONTJE <terry.don...@oracle.com>wrote:

> To determine if an MPI process is waiting for a message do what Rayson
> suggested and attach a debugger to the processes and see if any of them are
> stuck in MPI.  Either internally in a MPI_Recv or MPI_Wait call or looping
> on a MPI_Test call.
>
> Other things to consider.
>   Is this the first time you've ran it (with Open MPI? with any MPI?)?
>   How many processes is the job using?  Are you oversubscribing your
> processors?
>   What version of Open MPI are you using?
>   Have you tested all network connections?
>   It might help us to know the size of cluster you are running and what
> type of network?
>
> --td
>
> On 4/24/2012 2:42 AM, Syed Ahsan Ali wrote:
>
> Dear Rayson,
>
> That is a Nuemrical model that is written by National weather service of a
> country. The logs of the model show every detail about the simulation
> progress. I have checked on the remote nodes as well the application binary
> is running but the logs show no progress, it is just waiting at a point.
> The input data is correct everything is fine. How can I check if the MPI
> task is waiting for a message?
> Ahsan
>
> On Tue, Apr 24, 2012 at 11:03 AM, Rayson Ho <raysonlo...@gmail.com> wrote:
>
>> Seems like there's a bug in the application. Did you or someone else
>> write it, or did you get it from an ISV??
>>
>> You can log onto one of the nodes, attach a debugger, and see if the
>> MPI task is waiting for a message (looping in one of the MPI receive
>> functions)...
>>
>> Rayson
>>
>> =
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>> On Tue, Apr 24, 2012 at 12:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
>> wrote:
>> > Dear All,
>> >
>> > I am having problem with running an application on Dell cluster . The
>> model
>> > starts well but no further progress is shown. It just stuck. I have
>> checked
>> > the systems, no apparent hardware error is there. Other open mpi
>> > applications are running well on the same cluster. I have tried running
>> the
>> > application on cores of the same server as well but the problem is
>> same. The
>> > application just don't move further. The same application is also
>> running
>> > well on a backup cluster. Please help.
>> >
>> >
>> > Thanks and Best Regards
>> >
>> > Ahsan
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> ==
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>   Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] HRM problem

2012-04-24 Thread Syed Ahsan Ali

Dear Rayson,

That is a Nuemrical model that is written by National weather service of a
country. The logs of the model show every detail about the simulation
progress. I have checked on the remote nodes as well the application binary
is running but the logs show no progress, it is just waiting at a point.
The input data is correct everything is fine. How can I check if the MPI
task is waiting for a message?
Ahsan

On Tue, Apr 24, 2012 at 11:03 AM, Rayson Ho <raysonlo...@gmail.com> wrote:

> Seems like there's a bug in the application. Did you or someone else
> write it, or did you get it from an ISV??
>
> You can log onto one of the nodes, attach a debugger, and see if the
> MPI task is waiting for a message (looping in one of the MPI receive
> functions)...
>
> Rayson
>
> =
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
> On Tue, Apr 24, 2012 at 12:49 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
> > Dear All,
> >
> > I am having problem with running an application on Dell cluster . The
> model
> > starts well but no further progress is shown. It just stuck. I have
> checked
> > the systems, no apparent hardware error is there. Other open mpi
> > applications are running well on the same cluster. I have tried running
> the
> > application on cores of the same server as well but the problem is same.
> The
> > application just don't move further. The same application is also running
> > well on a backup cluster. Please help.
> >
> >
> > Thanks and Best Regards
> >
> > Ahsan
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] HRM problem

2012-04-24 Thread Syed Ahsan Ali

Dear All,

I am having problem with running an application on Dell cluster . The model
starts well but no further progress is shown. It just stuck. I have checked
the systems, no apparent hardware error is there. Other open mpi
applications are running well on the same cluster. I have tried running the
application on cores of the same server as well but the problem is same.
The application just don't move further. The same application is also
running well on a backup cluster. Please help.


Thanks and Best Regards

Ahsan

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format errorI

2012-03-01 Thread Syed Ahsan Ali

I am able to run the application with LSF now, it strange because I wasn't
able to trace any error.

On Thu, Mar 1, 2012 at 11:34 AM, PukkiMonkey <pukkimon...@gmail.com> wrote:

> What Jeff means is that because u didn't have echo "mpirun...>>outfile"
> but
> echo mpirun>>outfile ,
> you were piping the output to the outfile instead of stdout.
>
> Sent from my iPhone
>
> On Feb 29, 2012, at 8:44 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
> Sorry Jeff I couldn't get you point.
>
> On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres <jsquy...@cisco.com>wrote:
>
>> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:
>>
>> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
>> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
>> ./hrm >> ${OUTFILE}_hrm 2>&1
>> > [pmdtest@pmd02 d00_dayfiles]$
>>
>> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm
>> file, not stdout.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali

Sorry Jeff I couldn't get you point.

On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres <jsquy...@cisco.com> wrote:

> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:
>
> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
> ./hrm >> ${OUTFILE}_hrm 2>&1
> > [pmdtest@pmd02 d00_dayfiles]$
>
> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm file,
> not stdout.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>


-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali

I tried to echo but it returns nothing.

[pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
$i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
./hrm >> ${OUTFILE}_hrm 2>&1
[pmdtest@pmd02 d00_dayfiles]$


On Wed, Feb 29, 2012 at 12:01 PM, Jingcha Joba <pukkimon...@gmail.com>wrote:

> Just to be sure, can u try
> echo "${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1"
> and check if you are indeed getting the correct argument.
>
> If that looks fine, can u add --mca btl_openib_verbose 1 to the mpirun
> argument list, and see what it says?
>
>
>
> On Tue, Feb 28, 2012 at 10:15 PM, Syed Ahsan Ali <ahsansha...@gmail.com>wrote:
>
>> After creating new hostlist and making the scripts again it is working
>> now and picking up the hostlist as u can see :
>>
>> *
>> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>> (The above command is used to submit job)*
>>
>> *
>> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>> mpirun -np 32 /home/MET/hrm/bin/hrm
>> *
>> but it just stays on this command and the model simulation don't start
>> further. I can't understand this behavior because the simulation works
>> fine when hostlist is not given as follows:
>>
>> *${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*
>>
>> **
>> **
>> * *
>>
>> On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres <jsquy...@cisco.com>wrote:
>>
>>> Yes, this is known behavior for our CLI parser.  We could probably
>>> improve that a bit...
>>>
>>> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
>>>
>>> >
>>> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
>>> >
>>> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
>>> >>
>>> >>> Afraid I have to agree with the prior reply - sounds like NPROC
>>> isn't getting defined, which causes your cmd line to look like your
>>> original posting.
>>> >>
>>> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
>>> >>
>>> >> But: is this the intended behavior of mpirun? It looks like -np is
>>> eating -hostlist as a numeric argument? Shouldn't it complain about:
>>> argument for -np missing or argument not being numeric?
>>> >
>>> > Probably - I'm sure that the atol is returning zero, which should
>>> cause an error output. I'll check.
>>> >
>>> >
>>> >>
>>> >> -- Reuti
>>> >>
>>> >>
>>> >>>
>>> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
>>> >>>
>>> >>>> The following command in used in script for job submission
>>> >>>>
>>> >>>> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>>> >>>> where NPROC in defined in someother file. The same application is
>>> running on the other system with same configuration.
>>> >>>>
>>> >>>> On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey <
>>> pukkimon...@gmail.com> wrote:
>>> >>>> No of processes missing after -np
>>> >>>> Should be something like:
>>> >>>> mpirun -np 256 ./exec
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Sent from my iPhone
>>> >>>>
>>> >>>> On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali <ahsansha...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>>> Dear All,
>>> >>>>>
>>> >>>>> I am running an application with mpirun but it gives following
>>> error, it is not picking up hostlist, there are other applications which
>>> run well with hostlist but it just gives following error with
>>> >>>>>
>>> >>>>>
>>> >>>>> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>>> >>>>> mpirun -np  /home/MET/hrm/bin/hrm
>>> >>>>>
>>> --
>>> >>>&g

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali

After creating new hostlist and making the scripts again it is working now
and picking up the hostlist as u can see :

*${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
(The above command is used to submit job)*

*[pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
mpirun -np 32 /home/MET/hrm/bin/hrm
*
but it just stays on this command and the model simulation don't start
further. I can't understand this behavior because the simulation works
fine when hostlist is not given as follows:

*${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*

**
**
*

*
On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres <jsquy...@cisco.com> wrote:

> Yes, this is known behavior for our CLI parser.  We could probably improve
> that a bit...
>
> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
>
> >
> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
> >
> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
> >>
> >>> Afraid I have to agree with the prior reply - sounds like NPROC isn't
> getting defined, which causes your cmd line to look like your original
> posting.
> >>
> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
> >>
> >> But: is this the intended behavior of mpirun? It looks like -np is
> eating -hostlist as a numeric argument? Shouldn't it complain about:
> argument for -np missing or argument not being numeric?
> >
> > Probably - I'm sure that the atol is returning zero, which should cause
> an error output. I'll check.
> >
> >
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
> >>>
> >>>> The following command in used in script for job submission
> >>>>
> >>>> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
> >>>> where NPROC in defined in someother file. The same application is
> running on the other system with same configuration.
> >>>>
> >>>> On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey <pukkimon...@gmail.com>
> wrote:
> >>>> No of processes missing after -np
> >>>> Should be something like:
> >>>> mpirun -np 256 ./exec
> >>>>
> >>>>
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
> >>>>
> >>>>> Dear All,
> >>>>>
> >>>>> I am running an application with mpirun but it gives following
> error, it is not picking up hostlist, there are other applications which
> run well with hostlist but it just gives following error with
> >>>>>
> >>>>>
> >>>>> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
> >>>>> mpirun -np  /home/MET/hrm/bin/hrm
> >>>>>
> --
> >>>>> Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec
> format error
> >>>>>
> >>>>> This could mean that your PATH or executable name is wrong, or that
> you do not
> >>>>> have the necessary permissions.  Please ensure that the executable
> is able to be
> >>>>> found and executed.
> >>>>>
> >>>>>
> --
> >>>>>
> >>>>> Following the permission of the hostlist directory. Please help me
> to remove this error.
> >>>>>
> >>>>> [pmdtest@pmd02 bin]$ ll
> >>>>> total 7570
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
> >>>>> -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
> >>>>>
> >>>>>
> >>>>> Thank you and Regards
> >>>>> Ahsan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>&g

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-28 Thread Syed Ahsan Ali

The following command in used in script for job submission

${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl sm,openib,self
--mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
where NPROC in defined in someother file. The same application is running
on the other system with same configuration.

On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey <pukkimon...@gmail.com> wrote:

>   No of processes missing after -np
> Should be something like:
> mpirun -np 256 ./exec
>
>
>
> Sent from my iPhone
>
> On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>
>  Dear All,
>
> I am running an application with mpirun but it gives following error, it
> is not picking up hostlist, there are other applications which run well
> with hostlist but it just gives following error with
>
>
>  [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
> mpirun -np  /home/MET/hrm/bin/hrm
> --
> Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format
> error
>
> This could mean that your PATH or executable name is wrong, or that you do
> not
> have the necessary permissions.  Please ensure that the executable is able
> to be
> found and executed.
>
> --
>
> Following the permission of the hostlist directory. Please help me to
> remove this error.
>
>  [pmdtest@pmd02 bin]$ ll
> total 7570
> -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
> -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
> *-rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist*
> -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
> -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
> -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
>
>
> Thank you and Regards
> Ahsan
>
>
>
>
>
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

[OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-27 Thread Syed Ahsan Ali

Dear All,

I am running an application with mpirun but it gives following error, it is
not picking up hostlist, there are other applications which run well with
hostlist but it just gives following error with


 [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
mpirun -np  /home/MET/hrm/bin/hrm
--
Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format
error

This could mean that your PATH or executable name is wrong, or that you do
not
have the necessary permissions.  Please ensure that the executable is able
to be
found and executed.

--

Following the permission of the hostlist directory. Please help me to
remove this error.

 [pmdtest@pmd02 bin]$ ll
total 7570
-rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
-rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
*-rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist*
-rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
-rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
-rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts


Thank you and Regards
Ahsan

[OMPI users] Error building Openmpi (configure: error: C compiler cannot create executables)

2012-02-02 Thread Syed Ahsan Ali

s Layer release date... May 04, 2010
checking Open Portable Access Layer Subversion repository version... r23093

*** Initialization, setup
configure: builddir: /home/precis/opemmpi/openmpi-1.4.2
configure: srcdir: /home/precis/opemmpi/openmpi-1.4.2
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking for prefix by checking for ompi_clean... no
installing to directory "/usr/local"

*** Configuration options
checking whether to run code coverage... no
checking whether to compile with branch probabilities... no
checking whether to debug memory usage... no
checking whether to profile memory usage... no
checking if want developer-level compiler pickyness... no
checking if want developer-level debugging code... no
checking if want sparse process groups... no
checking if want Fortran 77 bindings... yes
checking if want Fortran 90 bindings... yes
checking desired Fortran 90 bindings "size"... small
checking whether to enable PMPI... yes
checking if want C++ bindings... yes
checking if want MPI::SEEK_SET support... yes
checking if want to enable weak symbol support... yes
checking if want run-time MPI parameter checking... runtime
checking if want to install OMPI header files... no
checking if want pretty-print stacktrace... yes
checking if peruse support is required... no
checking max supported array dimension in F90 MPI bindings... 4
checking if pty support should be enabled... yes
checking if user wants dlopen support... yes
checking if heterogeneous support should be enabled... no
checking if want trace file debugging... no
checking if want full RTE support... yes
checking if want fault tolerance... Disabled fault tolerance
checking if want IPv6 support... yes (if underlying system supports it)
checking if want orterun "--prefix" behavior to be enabled by default... no
checking for package/brand string... Open MPI
pre...@precis2.pakmet.com.pkDistribution
checking for ident string... 1.4.2
checking whether to add padding to the openib control header... no
checking whether to use an alternative checksum algo for messages... no


== Compiler and preprocessor tests


*** C compiler and preprocessor
checking for style of include used by make... GNU
checking for gcc... gcc
checking for C compiler default output file name...
configure: error: in `/home/precis/opemmpi/openmpi-1.4.2':
configure: error: C compiler cannot create executables
See `config.log' for more details.

-- 
Kind Regards

Syed Ahsan Ali Bokhari

67 matches

Mail list logo