Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

[email protected] Thu, 29 Jun 2017 07:12:02 -0700

It’s not the NFS traffic, that is fairly minimal, most of our storage traffic 
goes to a DDN GPFS cluster.


The 139k jobs were not array jobs. The user refuses to do the work to make them 
array jobs because he thinks snakemake makes it easier for him to handle 
dependencies with DRMAA and regular jobs. Schedd_job_info is already false.

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


 

On 29.06.17, 15:47, "Mark Dixon" <[email protected]> wrote:

    On Tue, 27 Jun 2017, [email protected] wrote:
    
    > Never mind. One of my users submitted a job with 139k subjobs.
    ...
    
    Hi,
    
    I don't think I have all the messages from this thread for some reason. No 
    doubt I'm going to repeat things someone else has suggested - apologies in 
    advance :)
    
    Firstly, make sure you've obtimised your disk I/O. This typically means 
    making sure $SGE_ROOT is on a filesystem local to your qmaster, and 
    reducing nfs traffic from your compute nodes by making their spools local 
    to each node (they end up in $SGE_ROOT/$SGE_CELL/spool/<HOSTNAME> by 
    default, but you can choose somewhere else at install time - replace with 
    a symlink later if it's an existing install), although the messages file 
    in there is useful to be central (again, a symlink, and its nfs traffic 
    doesn't seem to slow things down). People seem to get good results doing 
    this and sticking with classic spooling. Certainly I do :)
    
    Secondly, you talk about 139k subjobs - so this is a task array, right? 
    That is large. You should find that the qmaster handles its memory better 
    with large task arrays if you can live with setting schedd_job_info to 
    false in 'qconf -msconf' - it stops the qmaster from collecting the info 
    shown under 'scheduling info' in the output of 'qstat -j <jid>'.
    
    Hope this helps,
    
    Mark
    

_______________________________________________
SGE-discuss mailing list
[email protected]
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

Reply via email to