On 18/06/15 12:03, Wiegand, Paul wrote: > Thanks Chris, these are useful.
No worries! [...] > My build parameters are a lot like yours. The only one I wonder > about is --without-scif. We've a mixture of Phi and non-Phi nodes, > so I wasn't sure how to set this one and right now I take the default > (which I gather includes SCIF). Do you have any insight as to your > choice on this? > > For giggles I tried again just now, focusing on various nodes (both > with Phi and without), and the results are all the same (segfault). Ah, interesting. There *was* a time when having SCIF enabled could cause crashes (even on nodes without Phi's). Could you humour me and try it with that disabled as a btl (^scif)? At least that would remove it as a possibility. Looking back at ompi-devel it looks like they should have been fixed last year around 1.8.2. > We also --disable_vt and --disable-pty-support on our OpenMPI build, > but I don't think these would cause the problem I'm seeing. Any > disagreement with that? To be honest I have no idea, given that it works outside of Slurm it would appear not to be.. > As to Uwe's suggestion about the PMI plugin, I've built a number of > different ways, including with the PMI plugin. The libs are built > and present, and I can run as root without setting the resv-ports. > When I build without PMI, I set the resv-ports, and it still doesn't > work. So I don't think PMI is the issue, but I appreciate the > suggestion. We have PMI and resv-ports set: MpiParams=ports=12000-22999 > The fact that I can run as root and that I can run without openib > (that is using --mca btl ^openib on the mpirun call) suggests to me > that there's some kind of permissions / resource access problem to > the IB. But I can't understand why this would work fine outside of > slurm but be a problem under slurm. Well it depends on where slurmd is inheriting its limits from, I've seen people on this list have issues when starting slurmd on boot on nodes and it's not inheriting the limits they think they have set. The best way to see what your limits really are is to do: sbatch --wrap 'ulimits -a' > Someone at SSERCA suggested setting PropagateResourceLimits=NONE in > the slurm.conf file and opening up more than just memlock limits in > the /etc/sysconfig/slurm file. I did all that, but none of that > solved anything. We do the same there. How are you starting slurmd out of interest, are you starting it via systemd or are you logging in and running a non-systemd init script by hand? We're on RHEL6 (no systemd) and don't start them on boot, instead we will only start it by hand (as if a node reboots due to a fault we want to go and check it out first). All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
