[slurm-dev] Slurmdbd logging huge lines and sacct doesn't work

2014-05-13 Thread Mario Kadastik
? I planned to attach a file with the last 3 lines, but that's 21MB big so decided against it :) Mario Kadastik, PhD Senior researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] thread count over limit

2014-05-13 Thread Mario Kadastik
on send/recv operation any ideas how to debug what the 256 threads are in fact doing to understand the underlying cause? As I doubt it's normal that we're exhausting the thread count on a 5000 jobslot cluster... Mario Kadastik, PhD Senior researcher --- Physics is like sex, sure it may have

[slurm-dev] Re: Segfault on SL5.7

2014-05-09 Thread Mario Kadastik
to null pointer or some such, which is really bizarre. And turning off optimizations (as a side effect to debugging) fixes it. Mario Kadastik, PhD Senior researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] Re: Segfault on SL5.7

2014-05-08 Thread Mario Kadastik
Anyone? This is blocking one of our main nodes from submissions. Any ideas on what might cause this or how to debug further are welcome. On 07.05.2014, at 11:42, Mario Kadastik mario.kadas...@cern.ch wrote: Hi, yesterday I upgraded Slurm from 2.5.3 to 14.11 pre-1 (i.e. the current git

[slurm-dev] Segfault on SL5.7

2014-05-07 Thread Mario Kadastik
are welcome as the SL5.7 node is one of the main user nodes where they create code and submit to cluster so it has to work even though the full rest of the cluster works fine. The config btw is shared over NFS so it is identical on all nodes. Mario Kadastik, PhD Senior researcher --- Physics

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Mario Kadastik
I have encountered that slurmctld uses more than 20GB of virtual memory. But the RSS is less than 1GB. I am not sure whether this is OK or there is some leakage. 在 2013-06-25二的 11:56 -0700,Mario Kadastik写道: The OOM kill: Jun 25 18:21:32 slurm-1 kernel: [5463683.553994] OOM killed process

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Mario Kadastik
as a function of cores/jobs or their flux? Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Mario Kadastik
is probably too large. Well then I guess this is bad: [root@slurm-1 ~]# ps -eao pid,user,rss,cmd|grep slurm 21613 slurm6735956 /usr/sbin/slurmctld it's already using 6.4GB of RSS... Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Mario Kadastik
3372 /usr/sbin/slurmdbd And I just ran sreport commands to check and got nice reports back so the accounting DB is running. Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Mario Kadastik
jobs ended hours ago. Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] Re: Resubmit on failure

2013-06-20 Thread Mario Kadastik
or not). Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman

[slurm-dev] Resubmit on failure

2013-06-19 Thread Mario Kadastik
. The user would know if the filesystems etc is fine with that and in our case mostly is. Is such a feature already in slurm or not? If yes, can you point me to documentation. Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's

[slurm-dev] Re: Overview

2013-04-26 Thread Mario Kadastik
The thread has somewhat been branched for the RAM requirements. Any useful comments on the #2-#4? I can probably summarize this by running over all compute nodes with scontrol show host, but that may not be too efficient... On 25.04.2013, at 17:28, Mario Kadastik mario.kadas...@cern.ch wrote

[slurm-dev] Re: Disable black hole nodes automatically

2013-02-08 Thread Mario Kadastik
-zero exit code the job is rescheduled automatically elsewhere. In the best case scenario this would imply no failed jobs except for those that were running at the time of failure if they are impacted. Will see if this works. Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may

[slurm-dev] Re: Pre-empting short jobs

2013-02-04 Thread Mario Kadastik
to be drained as the jobs started at about the same time and have about the same length so waiting for a whole node to free might take a day or so... And this is wasting resources. On 01.02.2013, at 16:42, Mario Kadastik mario.kadas...@cern.ch wrote: Hi, we would like to configure our cluster

[slurm-dev] Re: SLURM without shared home?

2012-11-24 Thread Mario Kadastik
slurm [root@slurm-1 ~]# rpmbuild -ta slurm-2.5.0-rc-mario.tar.bz2 error: line 93: Tag takes single token only: Name:see META file [root@slurm-1 ~]# I'm guessing the problem lies between the keyboard and the chair, but just in case I thought to ask :) Mario Kadastik, PhD Researcher

[slurm-dev] Re: SLURM without shared home?

2012-11-19 Thread Mario Kadastik
to swap torque for slurm hoping the commands work :) Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman