Hi, Would love to say I have had success but am in midstride with the same issue. Looks like the simple path is to upgrade to v16 but until now have been pleased with v15.08
Here is a post describing the fix https://groups.google.com/forum/#!searchin/slurm-devel/dropbox/slurm-devel/WF4a36l0Y9g/U_XcpgKGBQAJ Here is the link containing lost.pl reference that will help you see the scope of the problem. https://groups.google.com/forum/#!searchin/slurm-devel/lost.pl/slurm-devel/TQcerLLEKAU/6QtpxZ2PBgAJ Good luck, Doug On Tue, Oct 17, 2017 at 8:58 AM, Douglas Jacobsen <dmjacob...@lbl.gov> wrote: > You probably have a core file in the directory where slurmdbd logs to, a > back trace from gdb would be most telling > > On Oct 17, 2017 08:17, "Loris Bennett" <loris.benn...@fu-berlin.de> wrote: > >> >> Hi, >> >> We have been having some with NFS mounts via Infiniband getting dropped >> by nodes. We ended up switching our main admin server, which provides >> NFS and Slurm from one machine to another. >> >> Now, however, if slurmdbd is started, as soon as slurmctld starts, >> slurmdbd seg faults. In the slurmdbd.log we have >> >> slurmdbd: error: We have more allocated time than is possible (7724741 >> > 7012800) for cluster soroban(1948) from 2017-10-17T16:00:00 - >> 2017-10-17T17:00:00 tres 1 >> slurmdbd: error: We have more time than is possible >> (7012800+36720+0)(7049520) > 7012800 for cluster soroban(1948) from >> 2017-10-17T16:00:00 - 2017-10-17T17:00:00 tres 1 >> slurmdbd: Warning: Note very large processing time from hourly_rollup >> for soroban: usec=46390426 began=17:08:17.777 >> Segmentation fault (core dumped) >> >> and the corresponding output of strace is >> >> fstat(3, {st_mode=S_IFREG|0600, st_size=871270, ...}) = 0 >> write(3, "[2017-10-17T17:09:04.168] Warnin"..., 132) = 132 >> +++ killed by SIGSEGV (core dumped) +++ >> >> We're running 17.02.7. Any ideas? >> >> Cheers, >> >> Loris >> >> -- >> Dr. Loris Bennett (Mr.) >> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de >> >