Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Blomqvist Janne
Hi, I was perhaps a bit unprecise, sorry about that. The point of the "datasync" tool and the "datasync-reaper" cronjob would be to replace or augment the per-job /tmp that is cleaned at the end of each job. Datasets would then be left on the node local disks until they are deleted by

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Goetz, Patrick G
But rsync -a will only help you if people are using identical or at least overlapping data sets? And you don't need rsync to prune out old files. On 2/26/19 1:53 AM, Janne Blomqvist wrote: > On 22/02/2019 18.50, Will Dennis wrote: >> Hi folks, >> >> Not directly Slurm-related, but... We have

[slurm-users] sacct end time for failed jobs

2019-02-26 Thread Brian Andrus
All, So I am using sacct to generate daily reports of job run times that is imported into an external db for cost and projected use planning. One thing I have noticed is that the END field for jobs with a state of FAILED is "Unknown" but the ELAPSED field has the time it ran. It seems to

Re: [slurm-users] pmix and ucx versions compatibility with slurm

2019-02-26 Thread Christopher Samuel
On 2/26/19 5:13 AM, Daniel Letai wrote: I couldn't find any documentation regarding which api from pmix or ucx Slurm is using, and how stable those api are. There is information about PMIx at least on the SchedMD website: https://slurm.schedmd.com/mpi_guide.html#pmix For UCX I'd suggest

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Loris Bennett
Hi Chris, I had JobAcctGatherType=jobacct_gather/linux TaskPlugin=task/affinity ProctrackType=proctrack/cgroup ProctrackType was actually unset but cgroup is the default. I have now changed the settings to JobAcctGatherType=jobacct_gather/cgroup TaskPlugin=task/affinity,task/cgroup

Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Marcus Wagner
Hi Merlin, thanks for the answer, but our user is not in need of a high index, but in fact in need of 100k taskids. Best Marcus On 2/26/19 3:50 PM, Merlin Hartley wrote: *max_array_tasks* Specify the maximum number of tasks that be included in a job array. The default limit is

Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Merlin Hartley
max_array_tasks Specify the maximum number of tasks that be included in a job array. The default limit is MaxArraySize, but this option can be used to set a lower limit. For example, max_array_tasks=1000 and MaxArraySize=11 would permit a maximum task ID of 10, but limit the number of

Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Jeffrey Frey
Also see "https://slurm.schedmd.com/slurm.conf.html; for MaxArraySize/MaxJobCount. We just went through a user-requested adjustment to MaxArraySize to bump it from 1000 to 1; as the documentation states, since each index of an array job is essentially "a job," you must be sure to also

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Marcus Wagner
Hi Loris, ok, THAT seems really much. What do you use for gathering these values? jobacct_gather/cgroup? If I remember right, there was a discussion lately in this list regarding the JobAcctGatherType, yet I do not remember the outcame. I remember though, that someone pointed to SLUG18 (or

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Loris Bennett
Hi Marcus, Thanks for the response, but that doesn't seem to be the issue. The problem seems to be that the raw data are incorrect: Slurm data: ... Ncpus Nnodes NtasksReqmem PerNode Cput Walltime Mem ExitStatus Slurm data: ...50 2 1 10240 0 503611

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Marcus Wagner
Hi Loris, I assume, this job used FAIRLY few memory, in the kb range, might that be true? replace sub kbytes2str {     my $kbytes = shift;     if ($kbytes == 0) {     return sprintf("%.2f %sB", 0.0, 'M');     }     my $mul = 1024;     my $exp = int(log($kbytes) / log($mul));     my @pre

[slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Loris Bennett
Hi, With seff 18.08.5-2 we have been getting spurious results regarding memory usage: $ seff 1230_27 Job ID: 1234 Array Job ID: 1230_27 Cluster: curta User/Group: x/x State: COMPLETED (exit code 0) Nodes: 4 Cores per node: 25 CPU Utilized: 9-16:49:18 CPU

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Ansgar Esztermann-Kirchner
Hi, I'd like to share our set-up as well, even though it's very specialized and thus probably won't work in most places. However, it's also very efficient in terms of budget when it does. Our users don't usually have shared data sets, so we don't need high bandwidth at any particular point --

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Adam Podstawka
Am 26.02.19 um 09:20 schrieb Tru Huynh: > On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote: >> On 2/22/19 3:54 PM, Aaron Jackson wrote: >> >>> Happy to answer any questions about our setup. >> >> > >> Email me directly to get added (I had to disable the Mailman web >

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Raymond Wan
Hi Janne, On Tue, Feb 26, 2019 at 3:56 PM Janne Blomqvist wrote: > When reaping, it searches for these special .datasync directories (up to > a configurable recursion depth, say 2 by default), and based on the > LAST_SYNCED timestamps, deletes entire datasets starting with the oldest >

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Tru Huynh
On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote: > On 2/22/19 3:54 PM, Aaron Jackson wrote: > > >Happy to answer any questions about our setup. > > > > Email me directly to get added (I had to disable the Mailman web Could you add me to that list? Thanks Tru -- Dr

Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Ole Holm Nielsen
On 2/26/19 9:07 AM, Marcus Wagner wrote: Does anyone know, why per default the number of array elements is limited to 1000? We have one user, who would like to have 100k array elements! What is more difficult for the scheduler, one array job with 100k elements or 100k non-array jobs?

[slurm-users] maximum size of array jobs

2019-02-26 Thread Marcus Wagner
Hello everyone, I have another question ;) Does anyone know, why per default the number of array elements is limited to 1000? We have one user, who would like to have 100k array elements! What is more difficult for the scheduler, one array job with 100k elements or 100k non-array jobs?