Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Gus Correa
If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Martin Siegert
It isn't really Torque that is imposing those constraints: - the torque_mom initscript inherits from the OS whatever ulimits are in effect at that time; - each job inherits the ulimits from the pbs_mom. Thus, you need to change the ulimits from whatever is set at startup time, e.g., in

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
+1 On Jun 11, 2014, at 6:01 PM, Ralph Castain wrote: > Yeah, I think we've seen that somewhere before too... > > > On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote: > >> Agreed. The problem is not with UDCM. I don't think something is wrong with >>

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain
Yeah, I think we've seen that somewhere before too... On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote: > Agreed. The problem is not with UDCM. I don't think something is wrong with > the system. I think his Torque is imposing major constraints on the maximum > size that

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
Agreed. The problem is not with UDCM. I don't think something is wrong with the system. I think his Torque is imposing major constraints on the maximum size that can be locked into memory. Josh On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm wrote: > Probably won't help to use

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Nathan Hjelm
Probably won't help to use RDMACM though as you will just see the resource failure somewhere else. UDCM is not the problem. Something is wrong with the system. Allocating a 512 entry CQ should not fail. -Nathan On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >I'm guessing it's a

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
Okay, let me poke around some more. It is clearly tied to the coprocessors, but I'm not yet sure just why. One thing you might do is try the nightly 1.8.2 tarball - there have been a number of fixes, and this may well have been caught there. Worth taking a look. On Jun 11, 2014, at 6:44 AM,

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
I'm guessing it's a resource limitation issue coming from Torque. H...I found something interesting on the interwebs that looks awfully similar: http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html Greg, if the suggestion from the Torque users doesn't resolve your

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
Mellanox -- What would cause a CQ to fail to be created? On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." wrote: > Is there any other work around that I might try? Something that avoids UDCM? > > -Original Message- > From: Fischer, Greg A. > Sent: Tuesday,

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega] $

Re: [hwloc-users] misleading cache size on AMD Opteron 6348?

2014-06-11 Thread Brice Goglin
The hwloc version will likely not change much regarding this hardware bug. Since your hardware/BIOS looks buggy, we can't do much about it except creating a valid XML that you could force to override the normal hardware-based discovery. Brice Le 11/06/2014 21:16, Yury Vorobyov a écrit : > I do

Re: [hwloc-users] misleading cache size on AMD Opteron 6348?

2014-06-11 Thread Yury Vorobyov
I do not see big difference... This time I used upstream version of hwloc (not git live). $ lstopo * hwloc has encountered what looks like an error from the operating system. * * L3 (P#6 cpuset 0x03f0) intersects

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz
Sorry - it crashes with both torque and rsh launchers. The output from a gdb backtrace on the core files looks identical. Dan On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: > Afraid I'm a little confused now - are you saying it works fine under Torque, > but segfaults

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
Afraid I'm a little confused now - are you saying it works fine under Torque, but segfaults under rsh? Could you please clarify your current situation? On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: > It looks like it is still segfaulting with the rsh launcher: > >

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz
It looks like it is still segfaulting with the rsh launcher: ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh -np 4 -machinefile ./nodes ./hello [conte-a084:51113] *** Process received signal *** [conte-a084:51113] Signal: Segmentation fault (11) [conte-a084:51113] Signal