It’s not the NFS traffic, that is fairly minimal, most of our storage traffic goes to a DDN GPFS cluster.
The 139k jobs were not array jobs. The user refuses to do the work to make them array jobs because he thinks snakemake makes it easier for him to handle dependencies with DRMAA and regular jobs. Schedd_job_info is already false. Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 29.06.17, 15:47, "Mark Dixon" <[email protected]> wrote: On Tue, 27 Jun 2017, [email protected] wrote: > Never mind. One of my users submitted a job with 139k subjobs. ... Hi, I don't think I have all the messages from this thread for some reason. No doubt I'm going to repeat things someone else has suggested - apologies in advance :) Firstly, make sure you've obtimised your disk I/O. This typically means making sure $SGE_ROOT is on a filesystem local to your qmaster, and reducing nfs traffic from your compute nodes by making their spools local to each node (they end up in $SGE_ROOT/$SGE_CELL/spool/<HOSTNAME> by default, but you can choose somewhere else at install time - replace with a symlink later if it's an existing install), although the messages file in there is useful to be central (again, a symlink, and its nfs traffic doesn't seem to slow things down). People seem to get good results doing this and sticking with classic spooling. Certainly I do :) Secondly, you talk about 139k subjobs - so this is a task array, right? That is large. You should find that the qmaster handles its memory better with large task arrays if you can live with setting schedd_job_info to false in 'qconf -msconf' - it stops the qmaster from collecting the info shown under 'scheduling info' in the output of 'qstat -j <jid>'. Hope this helps, Mark _______________________________________________ SGE-discuss mailing list [email protected] https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
