If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:
* - memlock -1
* - stack -1
* - nofile 32768
ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited
to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
It isn't really Torque that is imposing those constraints:
- the torque_mom initscript inherits from the OS whatever ulimits are
in effect at that time;
- each job inherits the ulimits from the pbs_mom.
Thus, you need to change the ulimits from whatever is set at
startup time, e.g., in
On Jun 11, 2014, at 6:01 PM, Ralph Castain
> Yeah, I think we've seen that somewhere before too...
> On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote:
>> Agreed. The problem is not with UDCM. I don't think something is wrong with
Yeah, I think we've seen that somewhere before too...
On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote:
> Agreed. The problem is not with UDCM. I don't think something is wrong with
> the system. I think his Torque is imposing major constraints on the maximum
> size that
Agreed. The problem is not with UDCM. I don't think something is wrong with
the system. I think his Torque is imposing major constraints on the maximum
size that can be locked into memory.
On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm wrote:
> Probably won't help to use
Probably won't help to use RDMACM though as you will just see the
resource failure somewhere else. UDCM is not the problem. Something is
wrong with the system. Allocating a 512 entry CQ should not fail.
On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>I'm guessing it's a
Okay, let me poke around some more. It is clearly tied to the coprocessors, but
I'm not yet sure just why.
One thing you might do is try the nightly 1.8.2 tarball - there have been a
number of fixes, and this may well have been caught there. Worth taking a look.
On Jun 11, 2014, at 6:44 AM,
I'm guessing it's a resource limitation issue coming from Torque.
H...I found something interesting on the interwebs that looks awfully
Greg, if the suggestion from the Torque users doesn't resolve your
What would cause a CQ to fail to be created?
On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
> Is there any other work around that I might try? Something that avoids UDCM?
> -Original Message-
> From: Fischer, Greg A.
> Sent: Tuesday,
Is there any other work around that I might try? Something that avoids UDCM?
From: Fischer, Greg A.
Sent: Tuesday, June 10, 2014 2:59 PM
To: Nathan Hjelm
Cc: Open MPI Users; Fischer, Greg A.
Subject: RE: [OMPI users] openib segfaults with Torque
The hwloc version will likely not change much regarding this hardware bug.
Since your hardware/BIOS looks buggy, we can't do much about it except
creating a valid XML that you could force to override the normal
Le 11/06/2014 21:16, Yury Vorobyov a écrit :
> I do
I do not see big difference... This time I used upstream version of hwloc
(not git live).
* hwloc has encountered what looks like an error from the operating system.
* L3 (P#6 cpuset 0x03f0) intersects
Sorry - it crashes with both torque and rsh launchers. The output from
a gdb backtrace on the core files looks identical.
On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote:
> Afraid I'm a little confused now - are you saying it works fine under Torque,
> but segfaults
Afraid I'm a little confused now - are you saying it works fine under Torque,
but segfaults under rsh? Could you please clarify your current situation?
On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote:
> It looks like it is still segfaulting with the rsh launcher:
It looks like it is still segfaulting with the rsh launcher:
ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
-np 4 -machinefile ./nodes ./hello
[conte-a084:51113] *** Process received signal ***
[conte-a084:51113] Signal: Segmentation fault (11)
Mail list logo