Nothing abnormal showing up in ps — that was what I had suspected as well.

Cheers,

Alden

On Jun 8, 2017, at 10:31 PM, Wensheng Deng <[email protected]<mailto:[email protected]>> 
wrote:

Hi Alden,

The CPU time is probably summarization of that from all 8 CPU cores. By any 
chance do you have any runaway process from the job on the node, such as 
epilogue etc? I am guessing...


On Thu, Jun 8, 2017 at 2:39 PM Stradling, Alden Reid (ars9ac) 
<[email protected]<mailto:[email protected]>> wrote:
I have a job whose workload finished yesterday (successfully, no issues, output 
files good), but the SLURM job is still accumulating time. I just suspended it, 
but I’d like to know how it’s getting away with billing many extra hours.

 The other 26 in this batch completed normally. The job script completed on 
June 7th at 7:53:19.

   JobName      State               Start    Elapsed    CPUTime
---------- ---------- ------------------- ---------- ----------
PBMC_5c_0+  COMPLETED 2017-06-06T16:39:51   02:53:56   23:11:28
     batch  COMPLETED 2017-06-06T16:39:51   02:53:56   23:11:28
PBMC_6a_0+  COMPLETED 2017-06-06T16:39:51   04:54:06 1-15:12:48
     batch  COMPLETED 2017-06-06T16:39:51   04:54:06 1-15:12:48
PBMC_6b_0+  COMPLETED 2017-06-06T16:39:51   03:04:41 1-00:37:28
     batch  COMPLETED 2017-06-06T16:39:51   03:04:41 1-00:37:28
PBMC_6c_0+  SUSPENDED 2017-06-06T16:39:51 1-21:12:55 15-01:43:20

That 15 days… not really possible since the job started two days ago.

sstat has nothing to say. scontrol shows me nothing out of the ordinary:

[root@udc-ba34-37:~] scontrol show jobid -dd  665155
JobId=665155 JobName=PBMC_6c_020917_ATACseq.py
   JobState=SUSPENDED Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=1-21:12:39 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2017-06-06T16:39:49 EligibleTime=2017-06-06T16:39:49
   StartTime=2017-06-06T16:39:51 EndTime=2017-06-09T16:39:51
   PreemptTime=None SuspendTime=2017-06-08T13:52:30 SecsPreSuspend=162759
   Partition=serial AllocNode:Sid=udc-ba34-37:199211
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=udc-ba33-28c
   BatchHost=udc-ba33-28c
   NumNodes=1 NumCPUs=8 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
     Nodes=udc-ba33-28c CPU_IDs=2-9 Mem=32000
   MinCPUsNode=8 MinMemoryNode=32000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   
Command=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.sub
   WorkDir=/sfs/lustre/allocations/shefflab/processed/cphg_atac/results_pipeline
   
StdErr=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log
   StdIn=/dev/null
   
StdOut=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log
   BatchScript=
#!/bin/bash
#SBATCH --job-name='PBMC_6c_020917_ATACseq.py'
#SBATCH 
--output='/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log'
#SBATCH --mem='32000'
#SBATCH --cpus-per-task='8'
#SBATCH --time='3-00:00:00'
#SBATCH --partition='serial'
#SBATCH -m block
#SBATCH --ntasks=1

echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`

/home/ns5bc/code/ATACseq/pipelines/ATACseq.py --input2 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L001_R2_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L002_R2_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L003_R2_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L004_R2_001.fastq.gz
 --genome hg38 --single-or-paired paired --sample-name PBMC_6c_020917 --input 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L001_R1_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L002_R1_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L003_R1_001.fastq.gz
 
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L004_R1_001.fastq.gz
 --prealignments rCRSd --genome-size hs -D --frip-ref-peaks 
/home/ns5bc/code/cphg_atac/metadata/CD4_hotSpot_liftedhg19tohg38.bed -O 
/sfs/lustre/allocations/shefflab/processed/cphg_atac/results_pipeline -P 8 -M 
32000

Thanks!

———————
Alden Stradling
Research Computing Infrastructure
University of Virginia
[email protected]<mailto:[email protected]>


Reply via email to