I'm having problems w/an old openMPI test program which I re-compiled using
OpenMPI 1.8.7 for CentOS 6.3 running Univa Grid Engine 8.6.4.
1.
Are the special PE requirements for Son of Grid Engine needed for Univa Grid
Engine 8.6.4 (in particular qsort_args and/or control_slaves both being pre
The strange thing is OpenMPI isn't mentioned anywhere as being a dependency for
Caffe! I haven't read anything that suggests OpenMPI is supported in Caffe
either. This is why I figure it must be a dependency of Caffe (of which there
are 15) that relies on OpenMPI.
I tried setting the compiler
I know this could possibly be off-topic, but the errors are OpenMPI errors and
if anyone could shed light on the nature of these errors I figure it would be
this group:
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
g++ .build_release/tools/upgrade_solver_proto_text.o -o
.build_re
I know this could possibly be off-topic, but the errors are OpenMPI errors and
if anyone could shed light on the nature of these errors I figure it would be
this group:
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
g++ .build_release/tools/upgrade_solver_proto_text.o -o
.build_r
Just a test.
IMPORTANT WARNING: This message is intended for the use of the person or entity
to which it is addressed and may contain information that is privileged and
confidential, the disclosure of which is governed by applicable law. If the
reader of this message is not the intended recipie
kend for some reason.
It gets used by both vader and sm, so no help there.
I’m afraid I’ll have to defer to Nathan from here as he is more familiar with
it than I.
On Mar 17, 2016, at 4:55 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I ran OpenMPI using the "-mca btl ^vader&q
lib64/libc.so.6(__libc_start_main+0xfd)[0x2b1b0e298cdd]
[csclprd3-6-12:30667] [14] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
[csclprd3-6-12:30667] *** End of error message ***
-Bill L.
From: users [users-boun...@open-mpi.org] on behalf of Lane, William
[willia
ory BTL to
segfault.
Try turning vader off and see if that helps - I’m not sure what you are using,
but maybe “-mca btl ^vader” will suffice
Nathan - any other suggestions?
On Mar 17, 2016, at 4:40 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I remember years ago, OpenMPI
I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
files limits be >= 4096 in order to function when large numbers of slots
were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
still present for OpenMPI versions 1.10.1 and greater?
I'm having some is
I'm getting an error message early on:
[csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh
-inherit -nostdin -V -verbose" for launching
unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on
device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh: usi
Re: [OMPI users] Son of Grid Engine, Parallel Environments and
OpenMPI 1.8.7
"Lane, William" writes:
> Our issues with OpenMPI 1.8.7 and Son-of-Gridengine turned out to be
> down to using the wrong Parallel Environment. Having a PE with
> control_slaves set to TRUE and start
checked and ompi transparently
fall back to --heyero-nodes if needed
bottom line, on a heterogeneous cluster, it is required or safer to use the
--hetero-nodes option
Cheers,
Gilles
On Wednesday, August 12, 2015, Dave Love
mailto:d.l...@liverpool.ac.uk>> wrote:
"Lane, William" &g
___
From: users [users-boun...@open-mpi.org] on behalf of Dave Love
[d.l...@liverpool.ac.uk]
Sent: Tuesday, August 11, 2015 9:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] Son of Grid Engine, Parallel Environments and
OpenMPI 1.8.7
"Lane, William" writes:
> I
ng and what the daemons think is wrong.
On Wed, Aug 5, 2015 at 3:17 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Actually, we're still having problems submitting OpenMPI 1.8.7 jobs
to the cluster thru SGE (which we need to do in order to track usage
stats on the cluster). I
there for qsort, but I
haven't checked it against SGE. Let us know if you hit a problem and we'll try
to figure it out.
Glad to hear your cluster is working - nice to have such challenges to shake
the cobwebs out :-)
On Wed, Aug 5, 2015 at 12:43 PM, Lane, William
mailto:william.l...
I read @
https://www.open-mpi.org/faq/?category=sge
that for OpenMPI Parallel Environments there's
a special consideration for Son of Grid Engine:
'"qsort_args" is necessary with the Son of Grid Engine distribution,
version 8.1.1 and later, and probably only applicable to it. For
very
I'm running OpenMPI 1.8.7 tests on a mixed bag cluster of various systems
under CentOS 6.3, I've been intermittently getting warnings about not having
the proper NUMA libraries installed. Which NUMA libraries should be installed
for CentOS 6.3 and OpenMPI 1.8.7?
Here's what I currently have instal
I'm running a mixed cluster of Blades (HS21 and HS22 chassis), x3550-M3 and
X3550-M4 systems, some of which support hyperthreading, while others
don't (specifically the HS21 blades) all on CentOS 6.3 w/SGE.
I have no problems running my simple OpenMPI 1.8.7 test code outside of SGE
(with or with
ckages are
installed on compute nodes?
Thank you for all your help w/this Ralph, now I can move on and
get the Linpack benchmark running.
-Bill L.
From: users [users-boun...@open-mpi.org] on behalf of Lane, William
[william.l...@cshs.org]
Sent: Thursday, July 16, 20
I'm just curious, if we run an OpenMPI job and it makes use of non-local memory
(i.e. memory tied to another socket) what kind of effects are seen on
performance?
How would you go about testing the above? I can't think of any command line
parameter that
would allow one to split an OpenMPI proces
users-boun...@open-mpi.org] on behalf of Ralph Castain
[r...@open-mpi.org]
Sent: Wednesday, July 15, 2015 10:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash
Stable 1.8.7 has been released - please let me know if the problem is resolved.
On
: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash
Can you give it a try? I’m skeptical, but it might work. The rc is out on the
web site:
http://www.open-mpi.org/software/ompi/v1.8/
On Jul 14, 2015, at 11:17 AM, Lane, William
mailto:william.l...@cshs.o
ut at least one problem.
We'll just have to see what happens and attack it next.
Ralph
On Tue, Jul 7, 2015 at 8:07 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I'm sorry I haven't been able to get the lstopo information for
all the nodes, but I had to get t
's
> only one PU per core, then hyper threading is disabled.
>
>
>> On Jun 29, 2015, at 4:42 PM, Lane, William wrote:
>>
>> Would the output of dmidecode -t processor and/or lstopo tell me conclusively
>> if hyperthreading is enabled or not? Hyperthreading
its OS, are you ?
Cheers,
Gilles
mca_btl_sm_add_procs(
int mca_btl_sm_add_procs(
On Wednesday, June 24, 2015, Lane, William
mailto:william.l...@cshs.org>> wrote:
Gilles,
All the blades only have two core Xeons (without hyperthreading) populating
both their sockets. All
the x3550 nodes
e
the issue :-(
can you try to reproduce the issue with the smallest hostfile, and then run
lstopo on all the nodes ?
btw, you are not mixing 32 bits and 64 bits OS, are you ?
Cheers,
Gilles
mca_btl_sm_add_procs(
int mca_btl_sm_add_procs(
On Wednesday, June 24, 2015, Lane, Willia
If there is a problem with
placement, that is where it would exist.
On Tue, Jun 23, 2015 at 5:12 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
There is something funny going on
heers,
Gilles
On Friday, June 19, 2015, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
I created a hostfile that just has the names of the hosts while
specifying no slot information whatsoever (e.g. csclprd3-0-0)
and received the following errors:
mpirun -np 132 -report-binding
Ralph,
There is something funny going on, the trace from the
runs w/the debug build aren't showing any differences from
what I got earlier. However, I did do a run w/the --bind-to core
switch and was surprised to see that hyperthreading cores were
sometimes being used.
Here's the traces that I ha
e the issue ?
do all the nodes have the same configuration ?
if yes, what happens without --hetero-nodes ?
Cheers,
Gilles
On Friday, June 19, 2015, Lane, William
> wrote:
Ralph,
I created a hostfile that just has the names of the hosts while
specifying no slot information whatsoever (e.g. cs
l?
I also suspect that you would have no problems if you -bind-to none - does that
in fact work?
On Jun 18, 2015, at 4:54 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I'm having a strange problem w/OpenMPI 1.8.6. If I run
my OpenMPI test code (compiled against OpenMPI 1.8.
I'm having a strange problem w/OpenMPI 1.8.6. If I run
my OpenMPI test code (compiled against OpenMPI 1.8.6
libraries) on < 131 slots I get no issues. Anything over 131
errors out:
mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/
--hostfile hostfile-single --mca btl_tcp_if_in
Is there any fixed date at which time OpenMPI stable 1.8.6 will
be released? I know the pre-release version is available but I'd
rather wait for the stable version.
Thank you,
-Bill Lane
IMPORTANT WARNING: This message is intended for the use of the person or entity
to which it is addressed and
I've compiled the linpack benchmark using openMPI 1.8.5 libraries
and include files on CentOS 6.4.
I've tested the binary on the one Intel node (some
sort of 4-core Xeon) and it runs, but when I try to run it on any of
the old Sunfire opteron compute nodes it appears to hang (although
top indicate
he behavior from this particular cmd line.
Looks like we are binding-to-core by default, even if you specify
use-hwthread-cpus. I’ll fix that one - still don’t understand the segfaults.
Bill: can you shed some light on those?
On Apr 9, 2015, at 8:28 PM, Lane, William
mailto:william.l...@cshs.
On Apr 8, 2015, at 10:55 AM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
I added one of the newer LGA2011 nodes to my hostfile and
re-ran the benchmark successfully and saw some strange results WRT the
binding directives. Why are hyperthreading cores being used
on the LGA
Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
On Apr 8, 2015, at 9:29 AM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
Thanks for YOUR help, I never
would've managed to get the LAPACK
benchmark running on more than one
node in our cluster without your help.
Ralph, i
ybe just NUMA nodes that also have hyperthreading capabilities?
Bill L.
From: users [users-boun...@open-mpi.org] on behalf of Lane, William
[william.l...@cshs.org]
Sent: Wednesday, April 08, 2015 9:29 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2
at 1:17 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
I've finally had some luck using the following:
$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix
$MPI_DI
lso help.
On Apr 6, 2015, at 12:24 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
Ralph,
For the following two different commandline invocations of the LAPACK benchmark
$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-no_slots
--mca btl_tcp_if_include eth0 --h
Sent: Sunday, April 05, 2015 6:09 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
On Apr 5, 2015, at 5:58 PM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I think some of the Intel Blade systems in the cluster are
dual core, but don't suppor
;t require a swap region be mounted - I didn't see anything in your
original message indicating that OMPI had actually crashed, but just wasn't
launching due to the above issue. Were you actually seeing crashes as well?
On Wed, Apr 1, 2015 at 8:31 AM, Lane, William
mailto:william.l
upport one proc/core - i.e., we
didn't at least 128 cores in the sum of the nodes you told us about. I take it
you were expecting that there were that many or more?
Ralph
On Wed, Apr 1, 2015 at 12:54 AM, Lane, William
mailto:william.l...@cshs.org>> wrote:
I'm having problems ru
I'm having problems running OpenMPI jobs
(using a hostfile) on an HPC cluster running
ROCKS on CentOS 6.3. I'm running OpenMPI
outside of Sun Grid Engine (i.e. it is not submitted
as a job to SGE). The program being run is a LAPACK
benchmark. The commandline parameter I'm
using to run the jobs is:
@open-mpi.org] on behalf of Ralph Castain
[r...@open-mpi.org]
Sent: Tuesday, September 02, 2014 11:03 AM
To: Open MPI Users
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots
(updated findings)
On Sep 2, 2014, at 10:48 AM, Lane, William wrote:
> Ralph,
>
> T
e between the "to" and the "none". You'll take
a performance hit, but it should at least run.
On Aug 29, 2014, at 11:29 PM, Lane, William wrote:
> The --bind-to-none switch didn't help, I'm still getting the same errors.
>
> The only NUMA pa
uest > 28 slots
(updated findings)
Am 28.08.2014 um 10:09 schrieb Lane, William:
> I have some updates on these issues and some test results as well.
>
> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via
> the SGE orte parallel environment received
&g
users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres)
[jsquy...@cisco.com]
Sent: Friday, August 08, 2014 5:25 AM
To: Open MPI User's List
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots
On Aug 8, 2014, at 1:24 AM, Lane, William wrote:
> Using the &qu
hich may be a
core or a hyperthread).
So if you're running in a bind-to-core situation, if it's a "before OMPI
supporter HT properly" version, then you'll bind 2 MPI processes to a single
core, and that will likely be pretty terrible for overall performance.
Does that help
ject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots
Hmmm...that's not a "bug", but just a packaging issue with the way CentOS
distributed some variants of OMPI that requires you install/update things in a
specific order.
On Jul 20, 2014, at 11:34 PM, Lane, Will
28 slots
I'm unaware of any CentOS-OMPI bug, and I've been using CentOS throughout the
6.x series running OMPI 1.6.x and above.
I can't speak to the older versions of CentOS and/or the older versions of OMPI.
On Jul 19, 2014, at 8:14 PM, Lane, William
mailto:william.l...@cshs
OMPI users] Mpirun 1.5.4 problems when request > 28 slots
Not for this test case size. You should be just fine with the default values.
If I understand you correctly, you've run this app at scale before on another
cluster without problem?
On Jul 19, 2014, at 1:34 PM, Lane, William
r app, which is what I'd expect given
your description
* try using gdb (or pick your debugger) to look at the corefile and see where
it is failing
I'd also suggest updating OMPI to the 1.6.5 or 1.8.1 versions, but I doubt
that's the issue behind this problem.
On Jul 19, 201
I'm getting consistent errors of the form:
"mpirun noticed that process rank 3 with PID 802 on node csclprd3-0-8 exited on
signal 11 (Segmentation fault)."
whenever I request more than 28 slots. These
errors even occur when I run mpirun locally
on a compute node that has 32 slots (8 cores, 16 wi
please unsubscribe me from this maillist.
Thank you,
-Bill Lane
From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of Ole
Nielsen [ole.moller.niel...@gmail.com]
Sent: Monday, September 19, 2011 1:39 AM
To: us...@open-mpi.org
Subject: Re: [OMP
Thank you for your help Ralph and Reuti,
The problem turned out to be the number of file descriptors was insufficient.
The reason given by a sys admin was that since SGE isn't a user it wasn't
initially using the new
upper bound on the number of file descriptors.
-Bill Lane
descriptors on that node, assuming your application does a complete wireup
across all procs.
Updating to 1.4.3 would be a good idea as it is more stable, but it may not
resolve this problem if the issue is one of the above.
HTH
Ralph
On Jul 25, 2011, at 11:23 PM, Lane, William wrote:
> Please h
Please help me resolve the following problems with a 306 node Rocks cluster
using SGE. Please note I can run the
job successfully on <87 slots, but not anymore than that.
We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub and
the following lines from a script:
mpirun -n $
58 matches
Mail list logo