[slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 17:06, Christopher Samuel wrote: Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and noticed that if I run it with srun rather than mpirun it goes over 20% slower. Following on from this issue, we've found that whilst mpirun gives acceptable performance the memory accounting doesn't appear to be correct. Anyone seen anything similar, or any ideas on what could be going on? Here are two identical NAMD jobs running over 69 nodes using 16 nodes per core, this one launched with mpirun (Open-MPI 1.6.5): == slurm-94491.out == WallClock: 101.176193 CPUTime: 101.176193 Memory: 1268.554688 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94491 94491.batch6504068K 11167820K 94491.05952048K 9028060K This one launched with srun (about 60% slower): == slurm-94505.out == WallClock: 163.314163 CPUTime: 163.314163 Memory: 1253.511719 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94505 94505.batch 7248K 1582692K 94505.01022744K 1307112K cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+ =JUPK -END PGP SIGNATURE-
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:19, Christopher Samuel wrote: Anyone seen anything similar, or any ideas on what could be going on? Sorry, this was with: # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 Since those initial tests we've started enforcing memory limits (the system is not yet in full production) and found that this causes jobs to get killed. We tried the cgroups gathering method, but jobs still die with mpirun and now the numbers don't seem to right for mpirun or srun either: mpirun (killed): [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94564 94564.batch-523362K 0 94564.0 394525K 0 srun: [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94565 94565.batch998K 0 94565.0 88663K 0 All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93 KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk =jYrC -END PGP SIGNATURE-
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
On 2013-08-07 09:19, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 17:06, Christopher Samuel wrote: Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and noticed that if I run it with srun rather than mpirun it goes over 20% slower. Following on from this issue, we've found that whilst mpirun gives acceptable performance the memory accounting doesn't appear to be correct. Anyone seen anything similar, or any ideas on what could be going on? See my message from yesterday https://groups.google.com/d/msg/slurm-devel/BlZ2-NwwCCg/03DnMEWYHqUJ for what I think is the reason. That is, the memory accounting is per task, and when launching using mpirun the number of tasks does not correspond to the number of MPI processes, but rather to the number of orted processes (1 per node). Here are two identical NAMD jobs running over 69 nodes using 16 nodes per core, this one launched with mpirun (Open-MPI 1.6.5): == slurm-94491.out == WallClock: 101.176193 CPUTime: 101.176193 Memory: 1268.554688 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94491 94491.batch6504068K 11167820K 94491.05952048K 9028060K This one launched with srun (about 60% slower): == slurm-94505.out == WallClock: 163.314163 CPUTime: 163.314163 Memory: 1253.511719 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94505 94505.batch 7248K 1582692K 94505.01022744K 1307112K cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+ =JUPK -END PGP SIGNATURE- -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:59, Janne Blomqvist wrote: That is, the memory accounting is per task, and when launching using mpirun the number of tasks does not correspond to the number of MPI processes, but rather to the number of orted processes (1 per node). That appears to be correct, I am seeing 1 task in the batch and 68 tasks for orted when I use mpirun whilst I see 1 task in the batch and 1104 tasks as namd2 when I use srun. I could understand how that might result in Slurm (wrongly) thinking that a single task is using more than its allowed memory per tasks, but I'm not sure I understand how that could lead to Slurm thinking the job is using vastly more memory than it actually is though. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB+lgACgkQO2KABBYQAh8uqgCdGuA03jCEdJVJE2dJGBHEJjb/ WY4An3em/48L25xq4Ui/GHijSJY2Oo6T =Zk4G -END PGP SIGNATURE-
[slurm-dev] Re: 2.6.0 html documentation says CR_Core_Memory Not yet implemented
That was old documentation. We'll fix that with the next web page update. Quoting Jeff Tan jeffe...@au1.ibm.com: Dear SchedMD, Just verifying what might be no more than a missed docupdate: is CR_Core_Memory definitely implemented in 2.6.0? The cons_res.html that comes with it says it isn't, but it seems to work when activated. Regards Jeff Dr. Jeff Tan High Performance Computing Specialist IBM Research Collaboratory for Life Sciences (Melbourne, Australia) Phone: +61 3 90 354392
[slurm-dev] Re: cons_res: Can't use Partition SelectType
Hi Magnus, Thanks for the reply. I am using slurm.conf: SelectTypeParameter=CR_CPU_Memory to manage memory on the nodes. I am using partition.conf: SelectTypeParameter=CR_Core in one partition to allow gpu jobs to run without memory problems. The documentation states that is a valid set up but the log shows it's not implemented? How do I monitor memory in all but one partition? Thanks Eva On Wed, 7 Aug 2013, Magnus Jonsson wrote: Hi! To use Partition SelectTypeParameter you must use CR_Socket or (CR_Core and CR_ALLOCATE_FULL_SOCKET) as the default SelectTypeParameter. You are using CR_CPU_Memory. Best regards, Magnus On 2013-08-05 23:13, Eva Hocks wrote: I am getting spam messages in the logs: [2013-08-05T14:04:32.000] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET The slurm.conf settings are: SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory and I have set one partition in partitions.conf to SelectTypeParameters=CR_Core Why does slurm complain? I HAVE set CR_Core and I even checked the spelling. Thanks Eva
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
Just a note, if srun isn't used to launch a task the odds of accounting for the step being correct are very low. Using srun is the only known way to always guarantee accounting for steps to be accurate. This also goes for handling memory limits. Danny On 08/07/13 00:43, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/13 16:59, Janne Blomqvist wrote: That is, the memory accounting is per task, and when launching using mpirun the number of tasks does not correspond to the number of MPI processes, but rather to the number of orted processes (1 per node). That appears to be correct, I am seeing 1 task in the batch and 68 tasks for orted when I use mpirun whilst I see 1 task in the batch and 1104 tasks as namd2 when I use srun. I could understand how that might result in Slurm (wrongly) thinking that a single task is using more than its allowed memory per tasks, but I'm not sure I understand how that could lead to Slurm thinking the job is using vastly more memory than it actually is though. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB+lgACgkQO2KABBYQAh8uqgCdGuA03jCEdJVJE2dJGBHEJjb/ WY4An3em/48L25xq4Ui/GHijSJY2Oo6T =Zk4G -END PGP SIGNATURE-
[slurm-dev] Re: Jobs not queued in SLURM 2.3
Hi Carles, Thanks for your reply. I dont see any error in the logs files. I use -v flag and there arent errors. Thanks for your explain about backfillI used backfilling because it was the default option, but I had to eliminate the walltime limits due to some user complaints. This strange behaviour only affect a one specific user. He can send several jobs to the queue (and are dispatched and work fine), but seems like there was a limit in the number of running jobs or resources for this user, after that, his works arent dispatched (dont appear) but the other works are running, also he can submit other jobs using less resources. Although there are enough free resources. Thanks in advance. Regards. Date: Wed, 7 Aug 2013 01:13:44 -0700 From: mini...@gmail.com To: slurm-dev@schedmd.com Subject: [slurm-dev] Re: Jobs not queued in SLURM 2.3 Re: [slurm-dev] Jobs not queued in SLURM 2.3 Hi José Manuel, Do you see any error on the controller logfile? Having backfilling with ulimited time jobs is useless because it will never backfill a job, but should not affect on the non appearing jobs issue. Are any of the user jobs dispatched? Regards, Carles Fenoy Barcelona Supercomputing Center On Wed, Aug 7, 2013 at 9:57 AM, José Manuel Molero jml...@hotmail.com wrote: Hi, I have a cluster working fine with Slurm 2.3 But I noticed that a specific user can not queue (or dispatch) many works at the same time.When this user queue many jobs, the following jobs are submitted correctly (give a job number) but not added to the queue. And the cluster have free resources. https://computing.llnl.gov/linux/slurm/faq.html#time This is not the problem, because other user can allocate jobs without problems. The scheduler type si backfill, and all the jobs are time unlimited. Could this be the problem?Why the jobs dissapear (not added to the queue) and not passed to PD state if there are free resources? Thanks in advance. Best regards. -- -- Carles Fenoy
[slurm-dev] Increase user priority on a specific partition
Hello, Is there a way to increase a users priority on a specific partition without using qos? When this user runs jobs on this partition I want their jobs to have a greater priority than all of the other jobs on this partition. Thanks for your help, Neil Van Lysel smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Danny, On 08/08/13 04:08, Danny Auble wrote: Just a note, if srun isn't used to launch a task the odds of accounting for the step being correct are very low. Using srun is the only known way to always guarantee accounting for steps to be accurate. This also goes for handling memory limits. Thanks, and I understand why that is, it's just a shame that the performance penalty for using srun with Open-MPI makes it unusable. :-( cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIDGgsACgkQO2KABBYQAh8fhgCeIMPlkusK2JD3Ns11W8gq1gAx 0aIAn0uGgyRKVNJvmDcf11ZTOqZJobcn =yg4K -END PGP SIGNATURE-