Also worth looking at AcctGatherProfileType/HDF5 for more detailed job profiling

Cheers,
Roshan
________________________________________
From: Christopher Samuel <[email protected]>
Sent: 01 December 2014 22:38
To: slurm-dev
Subject: [slurm-dev] Re: Job Resource Report

On 02/12/14 06:45, Will French wrote:

> Does anyone know of a Slurm equivalent to the Torque command tracejob
> (http://docs.adaptivecomputing.com/torque/4-1-7/Content/topics/11-troubleshooting/usingTracejobToLocateFailures.htm)?
> This command allows you to easily compare requested resources to actual
> usage, and is useful for troubleshooting when a user's job dies.

I think sacct (if you've set up accounting) will give you a lot of
that.  Here's an example from a trivial job of mine that just does
a sleep 60 and exit 1.

Apologies for the very long lines!

[samuel@barcoo BARCOO]$ sacct -j 2633455 -l
       JobID    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  
AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   
MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  
AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreq     ReqMem 
ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead 
MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite
------------ ---------- ---------- ---------- -------------- -------------- 
---------- ---------- ---------- ---------- ---------- -------- ------------ 
-------------- ---------- ---------- ---------- ---------- ---------- -------- 
---------- ---------- ---------- -------- ---------- ---------- ---------- 
-------------- ------------ --------------- --------------- -------------- 
------------ ---------------- ---------------- --------------
2633455      failjob.sh       main                                              
                                                                                
                                                                                
 1   00:01:00     FAILED      1:0                              2Gc
2633455.bat+      batch               134884K      barcoo001              0    
106056K       316K  barcoo001          0       316K        0    barcoo001       
       0          0   00:00:00  barcoo001          0   00:00:00        1        
  1   00:01:00     FAILED      1:0      2.70G          0        2Gc             
 0        0.01M       barcoo001               0          0.01M        0.00M     
   barcoo001                0          0.00M


--
Christopher Samuel        Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to