[OMPI users] Problem executing mpic++ for LAMMPS installation

2012-10-12 Thread Rafael Antonio Soler-Crespo

Hello everyone,
 
I'm a new student at my university, and I need to install LAMMPS software to 
perform some molecular dynamic simulations for my work. The cluster I am 
working on has no root access for me (obviously) and I am installing everything 
on my local account. I'm having some difficulty installing LAMMPS on my cluster 
home account. I downloaded and installed openmpi, and had to edit ~/.bashrc to 
add the line:
 
export PATH=/home/ras536/bin/openmpi/bin/:${PATH}
 
To get it to recognize that I had installed mpic++ and etc. Upon doing this, I 
run:
 
$ mpic++
 
And I will succesfully obtain the message:
 
g++: no input files
 
So, I think, everything is fine with my openmpi1.1 (LAMMPS requieres this) 
installation. However, when I try to make LAMMPS using:

$ make openmpi 
 
I get errors like this:
 
mpic++ -O2 -funroll-loops -fstrict-aliasing -Wall -W -Wno-uninitialized  
-DLAMMPS_GZIP   -DFFT_FFTW   -c memory.cpp
mpic++ -O2 -funroll-loops -fstrict-aliasing -Wall -W -Wno-uninitialized  
-DLAMMPS_GZIP   -DFFT_FFTW   -c min_cg.cpp
mpic++ -O2 -funroll-loops -fstrict-aliasing -Wall -W -Wno-uninitialized  
-DLAMMPS_GZIP   -DFFT_FFTW   -c min.cpp
min.cpp: In member function âvoid LAMMPS_NS::Min::force_clear()â:
min.cpp:547: warning: unused variable âiâ

And furthermore, upon trying to use the executable:
 
./lmp_yotta 
 
I get this:
 
./lmp_yotta: error while loading shared libraries: liborte.so.0: cannot open 
shared object file: No such file or directory
 
Any idea what might be going on? Am I missing linking stuff so that LAMMPS 
building can proceed fine?
 
Thanks for the help,  

Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Ralph Castain
Something doesn't make sense here. If you direct launch with srun, there is no 
orted involved. The orted only gets launched if you start with mpirun

Did you configure --with-pmi and point to where that include file resides? 
Otherwise, the procs will all think they are singletons 

Sent from my iPhone

On Oct 12, 2012, at 7:27 AM, Michael Di Domenico  wrote:

> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
> 
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
> 
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
> 
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
> 
> if i look at all the nodes allocated to my job, it does start the orte
> process though
> 
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
> 
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected.  if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
> 
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
> 
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
> 
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
> 
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain  wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>> 
>> 
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>>  wrote:
>>> 
>>> adding some additional info
>>> 
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>> 
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>> 
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>> 
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>>  wrote:
 too add a little more detail, it looks like xhpl is not actually
 starting on all nodes when i kick off the mpirun
 
 each time i cancel and restart the job, the nodes that do not start
 change, so i can't call it a bad node
 
 if i disable infiniband with --mca btl self,sm,tcp on occasion i can
 get xhpl to actually run, but it's not consistent
 
 i'm going to check my ethernet network and make sure there's no
 problems there (could this be an OOB error with mpirun?), on the nodes
 that fail to start xhpl, i do see the orte process, but nothing in the
 logs about why it failed to launch xhpl
 
 
 
 On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
  wrote:
> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
> start when the rank count gets fairly high into the thousands.
> 
> My symptom is the jobs fires up via slurm, and I can see all the xhpl
> processes on the nodes, but it never kicks over to the next process.
> 
> My question is, what debugs should I turn on to tell me what the
> system might be waiting on?
> 
> I've checked a bunch of things, but I'm probably overlooking something
> trivial (which is par for me).
> 
> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
> Infiniband/PSM
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Gus Correa

Hi

I don't use Slurm, and our clusters are fairly small (few tens of nodes,
few hundred cores).
Having said that, I know that Torque, which we use here,
requires specific system configuration changes on large clusters,
like increasing the maximum number of open files,
increasing the ARP cache size, etc.
Apparently Slrum also needs some system tweaking on large clusters:
https://computing.llnl.gov/linux/slurm/big_sys.html
Could this be the problem?
Anyway, just a thought.

Gus Correa

On 10/12/2012 09:27 AM, Michael Di Domenico wrote:

what isn't working is when i fire off an MPI job with over 800 ranks,
they don't all actually start up a process

fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl

and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
all of them have actually started xhpl

most will read 12 started processes, but an inconsistent list of nodes
will fail to actually start xhpl and stall the whole job

if i look at all the nodes allocated to my job, it does start the orte
process though

what i need to figure out, is why the orte process starts, but fails
to actually start xhpl on some of the nodes

unfortunately, the list of nodes that don't start xhpl during my runs
changes each time and no hardware errors are being detected.  if i
cancel the job and restart the job over and over, eventually one will
actually kick off and run to completion.

if i run the process outside of slurm just using openmpi, it seems to
behave correctly, so i'm leaning towards a slurm interacting with
openmpi problem.

what i'd like to do is instrument a debug in openmpi that will tell me
what openmpi is waiting on in order to kick off the xhpl binary

i'm testing to see whether it's a psm related problem now, i'll check
back if i can narrow the scope a little more

On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain  wrote:

I'm afraid I'm confused - I don't understand what is and isn't working. What
"next process" isn't starting?


On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
  wrote:


adding some additional info

did an strace on an orted process where xhpl failed to start, i did
this after the mpirun execution, so i probably missed some output, but
it keeps scrolling

poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
events=POLLIN}], 9, 1000) = 0 (Timeout)

i didn't see anything useful in /proc under those file descriptors,
but perhaps i missed something i don't know to look for

On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
  wrote:

too add a little more detail, it looks like xhpl is not actually
starting on all nodes when i kick off the mpirun

each time i cancel and restart the job, the nodes that do not start
change, so i can't call it a bad node

if i disable infiniband with --mca btl self,sm,tcp on occasion i can
get xhpl to actually run, but it's not consistent

i'm going to check my ethernet network and make sure there's no
problems there (could this be an OOB error with mpirun?), on the nodes
that fail to start xhpl, i do see the orte process, but nothing in the
logs about why it failed to launch xhpl



On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
  wrote:

I'm trying to diagnose an MPI job (in this case xhpl), that fails to
start when the rank count gets fairly high into the thousands.

My symptom is the jobs fires up via slurm, and I can see all the xhpl
processes on the nodes, but it never kicks over to the next process.

My question is, what debugs should I turn on to tell me what the
system might be waiting on?

I've checked a bunch of things, but I'm probably overlooking something
trivial (which is par for me).

I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
Infiniband/PSM

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Michael Di Domenico
turned on the daemon debugs for orted and noticed this difference

 i get this on all the good nodes (ones that actually started xhpl)

Daemon was launched on node08 - beginning to initialize
[node08:21230] [[64354,0],1] orted_cmd: received add_local_procs
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],84]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],85]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],86]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],87]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],88]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],89]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],90]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],91]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],92]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],93]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],94]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],95]
[node08:21230] [[64354,0],1] orted: up and running - waiting for commands!
[node08:21230] procdir: /tmp/openmpi-sessions-user@node08_0/28/1/1
[node08:21230] jobdir: /tmp/openmpi-sessions-user@node08_/44228/1
[node08:21230] top: openmpi-sessions-user@node08_0
[node08:21230] tmp: /tmp
[...repeats the above five lines a bunch of times...]

--- get this on the ones that do not start xhpl

Daemon was launched on node06 - beginning to initialize
[node06:11230] [[46344,0],1] orted: up and running - waiting for commands!
[node06:11230] procdir: /tmp/openmpi-sessions-user@node06_0/28/1/1
[node06:11230] jobdir: /tmp/openmpi-sessions-user@node06_/44228/1
[node06:11230] top: openmpi-sessions-user@node06_0
[node06:11230] tmp: /tmp
[...above lines only come out once...]

On Fri, Oct 12, 2012 at 9:27 AM, Michael Di Domenico
 wrote:
> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
>
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
>
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
>
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
>
> if i look at all the nodes allocated to my job, it does start the orte
> process though
>
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
>
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected.  if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
>
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
>
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
>
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
>
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain  wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>>
>>
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>>  wrote:
>>>
>>> adding some additional info
>>>
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>>
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>>
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>>
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>>  wrote:
>>> > too add a little more detail, it looks like xhpl is not actually
>>> > starting on all nodes when i kick off the mpirun
>>> >
>>> > each time i cancel and restart the job, the nodes that do not start
>>> > change, so i can't call it a bad node
>>> >
>>> > if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>>> > get xhpl to actually run, but it's not consistent
>>> >
>>> > i'm going to check my ethernet network and make sure there's no
>>> > problems there (could this be an OOB error with mpirun?), on the nodes
>>> > that fail t

Re: [OMPI users] debugs for jobs not starting

2012-10-12 Thread Michael Di Domenico
what isn't working is when i fire off an MPI job with over 800 ranks,
they don't all actually start up a process

fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl

and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
all of them have actually started xhpl

most will read 12 started processes, but an inconsistent list of nodes
will fail to actually start xhpl and stall the whole job

if i look at all the nodes allocated to my job, it does start the orte
process though

what i need to figure out, is why the orte process starts, but fails
to actually start xhpl on some of the nodes

unfortunately, the list of nodes that don't start xhpl during my runs
changes each time and no hardware errors are being detected.  if i
cancel the job and restart the job over and over, eventually one will
actually kick off and run to completion.

if i run the process outside of slurm just using openmpi, it seems to
behave correctly, so i'm leaning towards a slurm interacting with
openmpi problem.

what i'd like to do is instrument a debug in openmpi that will tell me
what openmpi is waiting on in order to kick off the xhpl binary

i'm testing to see whether it's a psm related problem now, i'll check
back if i can narrow the scope a little more

On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain  wrote:
> I'm afraid I'm confused - I don't understand what is and isn't working. What
> "next process" isn't starting?
>
>
> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>  wrote:
>>
>> adding some additional info
>>
>> did an strace on an orted process where xhpl failed to start, i did
>> this after the mpirun execution, so i probably missed some output, but
>> it keeps scrolling
>>
>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>
>> i didn't see anything useful in /proc under those file descriptors,
>> but perhaps i missed something i don't know to look for
>>
>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>  wrote:
>> > too add a little more detail, it looks like xhpl is not actually
>> > starting on all nodes when i kick off the mpirun
>> >
>> > each time i cancel and restart the job, the nodes that do not start
>> > change, so i can't call it a bad node
>> >
>> > if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>> > get xhpl to actually run, but it's not consistent
>> >
>> > i'm going to check my ethernet network and make sure there's no
>> > problems there (could this be an OOB error with mpirun?), on the nodes
>> > that fail to start xhpl, i do see the orte process, but nothing in the
>> > logs about why it failed to launch xhpl
>> >
>> >
>> >
>> > On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
>> >  wrote:
>> >> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>> >> start when the rank count gets fairly high into the thousands.
>> >>
>> >> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>> >> processes on the nodes, but it never kicks over to the next process.
>> >>
>> >> My question is, what debugs should I turn on to tell me what the
>> >> system might be waiting on?
>> >>
>> >> I've checked a bunch of things, but I'm probably overlooking something
>> >> trivial (which is par for me).
>> >>
>> >> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
>> >> Infiniband/PSM
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [1.6.2] Compilation Error (at vtfilter) with Intel Compiler

2012-10-12 Thread Christian Krause
Thanks for the link - also setting CXX F77 and FC did the trick :)

./configure CC=icc CXX=icpc F77=ifort FC=ifort
--prefix=/usr/local/openmpi/1.6.2_intel_12.0.4 --with-sge
--with-hwloc=/usr/local/hwloc/1.5_intel_12.0.4
--with-openib-libdir=/usr/lib64 --with-udapl-libdir=/usr/lib64

works.

On Thu, Oct 11, 2012 at 8:11 PM, wookietreiber
 wrote:
> Hi,
>
> The error I get I couldn't find in the mails from your link. But I also
> didn't set CXX, F77 and FC. I'll try that tomorrow and we'll see if it
> changes anything.
>
> I find the error I get weird because some file is not found which
> I guess should not occur when switching compilers ...
>
>
> On Thu, Oct 11, 2012 at 01:09:28PM -0400, Gus Correa wrote:
>> Hi Christian
>>
>> Would your problem be similar to the one reported two days ago on
>> this thread? [It also failed to compile vampir trace tools,
>> it also didn't have the Intel C++ compiler specified to configure.]
>>
>> http://www.open-mpi.org/community/lists/users/2012/10/20449.php
>>
>> Have you tried to specify the Intel C++ compiler
>> to the configure script?
>>
>> ./configure CC=icc CXX=icpc  ... etc, etc ...
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>>
>> On 10/11/2012 11:00 AM, Christian Krause wrote:
>> >Hi,
>> >
>> >I tried to compile the current OpenMPI 1.6.2 with the Intel Compiler
>> >
>> ># icc --version
>> >icc (ICC) 12.0.4 20110427
>> >
>> >
>> >The error I get is the following (I changed directly in the vtfilter
>> >directory where the error occurs to reduce output for this mail):
>> >
>> ># cd ompi/contrib/vt/vt/tools/vtfilter/
>> ># make
>> >Making all in .
>> >make[1]: Entering directory
>> >`/gpfs0/global/local/src/xxx-mpi/openmpi-1.6.2/ompi/contrib/vt/vt/tools/vtfilter'
>> >   CXXvtfilter-vt_filter.o
>> >cc1plus: error: vtfilter-vt_filter.d: No such file or directory
>> >make[1]: *** [vtfilter-vt_filter.o] Error 1
>> >make[1]: Leaving directory
>> >`/gpfs0/global/local/src/xxx-mpi/openmpi-1.6.2/ompi/contrib/vt/vt/tools/vtfilter'
>> >make: *** [all-recursive] Error 1
>> >
>> >
>> >configure options from config.log:
>> >
>> >./configure CC=icc --prefix=/usr/local/openmpi/1.6.2_intel_12.0.4
>> >--with-sge --with-hwloc=/usr/local/hwloc/1.5_intel_12.0.4
>> >--with-openib-libdir=/usr/lib64 --with-udapl-libdir=/usr/lib64
>> >
>> >
>> >I have already built hwloc and pciutils locally using icc. Also I
>> >recently compiled OpenMPI 1.6.2 with gcc 4.7.1 with hwloc and pciutils
>> >too which worked without problems (configure basically the same, i.e.
>> >not setting CC and using different hwloc). That's why I'm assuming the
>> >error is somehow icc's fault ... I'm new to this mailing list and I
>> >already received some mails concerning the Intel Compiler so I figure
>> >there may be others who've experienced this problem?
>> >
>> >
>> >Thanks for any help in advance.
>> >
>> >Regards
>> >Christian
>> >___
>> >users mailing list
>> >us...@open-mpi.org
>> >http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
> Beste Grüße / Best Regards
> Christian Krause aka wookietreiber
>
> ---
>
> EGAL WIE DICHT DU BIST, GOETHE WAR DICHTER.