Re: [lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

2015-05-08 Thread Drokin, Oleg
Hello!

On Apr 30, 2015, at 6:02 PM, Scott Nolin wrote:

 Has anyone been working with the lustre jobstats feature and SLURM? We have 
 been, and it's OK. But now that I'm working on systems that run a lot of 
 array jobs and a fairly recent slurm version we found some ugly stuff.
 
 Array jobs report their do SLURM_JOBID as a variable, and it's unique for 
 every job. But they use other IDs too that appear only for array jobs.
 
 http://slurm.schedmd.com/job_array.html
 
 However, that unique SLURM_JOBID as far as I can tell is only truly exposed 
 in command line tools via 'scontrol' - which is only valid while the job is 
 running. If you want to look at older jobs with sacct for example, things are 
 troublesome.
 
 Here's what my coworker and I have figured out:
 
 - You submit a (non-array) job that gets jobid 100.
 - The next job gets jobid 101.
 - Then submit a 10 task array job. That gets jobid 102. The sub tasks get 9 
 more job ids. If nothing else is happening with the system, that means you 
 use jobid 102 to 112.
 
 If things were that orderly, you could cope with using SLURM_JOB_ID in lustre 
 jobstats pretty easily. Use sacct and you see job 102_2 - you know that is 
 jobid 103 in lustre jobstats.
 
 But, if other jobs get submitted during set up (as of course they do), they 
 can take jobid 103. So, you've got problems.
 
 I think we may try to set a magic variable in the slurm prolog and use that 
 for the jobstats_var, but who knows.

There's another method planned for doing jobid stuff, now mainly featured in 
kernel staging tree, but will make it's way to lustre tree too.

It's to just write your jobid directly into lustre from your prologue script 
(and clear from epilogue).

That way you can set it to whatever you like without ugly messings with shell 
variables (and equally ugly parsing of those variables from the kernel!).

For some reason I cannot find the corresponding master patch, though I have a 
passing memory of writing it, so this needs to be addressed separately.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

2015-04-30 Thread Scott Nolin
Has anyone been working with the lustre jobstats feature and SLURM? We 
have been, and it's OK. But now that I'm working on systems that run a 
lot of array jobs and a fairly recent slurm version we found some ugly 
stuff.


Array jobs report their do SLURM_JOBID as a variable, and it's unique 
for every job. But they use other IDs too that appear only for array jobs.


http://slurm.schedmd.com/job_array.html

However, that unique SLURM_JOBID as far as I can tell is only truly 
exposed in command line tools via 'scontrol' - which is only valid while 
the job is running. If you want to look at older jobs with sacct for 
example, things are troublesome.


Here's what my coworker and I have figured out:

- You submit a (non-array) job that gets jobid 100.
- The next job gets jobid 101.
- Then submit a 10 task array job. That gets jobid 102. The sub tasks 
get 9 more job ids. If nothing else is happening with the system, that 
means you use jobid 102 to 112.


If things were that orderly, you could cope with using SLURM_JOB_ID in 
lustre jobstats pretty easily. Use sacct and you see job 102_2 - you 
know that is jobid 103 in lustre jobstats.


But, if other jobs get submitted during set up (as of course they do), 
they can take jobid 103. So, you've got problems.


I think we may try to set a magic variable in the slurm prolog and use 
that for the jobstats_var, but who knows.


Scott



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org