Hi,

this doesn't look good. Why do you have such a high load when all processes are 
sleeping. Are they using some hidden threads (the main thread should then show 
running AFAIK). How many (real) cores do you have in the system? Maybe some 
jobs just ask the machine for number of cores and start threads automatically. 
When each job tries to use all cores, this could explain it.

export OMP_NUM_THREADS=$NSLOTS

in the jobscript can be used with some applications, to limit them to the 
number of started threads to the number of requested slots.

-- Reuti


Am 08.07.2011 um 17:48 schrieb Peskin, Eric:

> On Jul 8, 2011, at 5:52 AM, Reuti wrote:
>> 
>> can you please check on the node what is generating this high load?
>> 
>> $ ps -e f
> 
>  PID TTY      STAT   TIME COMMAND
>    1 ?        Ss     0:01 init [3]
>    2 ?        S<     0:00 [migration/0]
>    3 ?        SN     0:00 [ksoftirqd/0]
>    4 ?        S<     0:00 [watchdog/0]
>    5 ?        S<     0:00 [migration/1]
>    6 ?        SN     0:00 [ksoftirqd/1]
>    7 ?        S<     0:00 [watchdog/1]
>    8 ?        S<     0:00 [migration/2]
>    9 ?        SN     0:00 [ksoftirqd/2]
>   10 ?        S<     0:00 [watchdog/2]
>   11 ?        S<     0:00 [migration/3]
>   12 ?        SN     0:00 [ksoftirqd/3]
>   13 ?        S<     0:00 [watchdog/3]
>   14 ?        S<     0:00 [migration/4]
>   15 ?        SN     0:00 [ksoftirqd/4]
>   16 ?        S<     0:00 [watchdog/4]
>   17 ?        S<     0:00 [migration/5]
>   18 ?        SN     0:00 [ksoftirqd/5]
>   19 ?        S<     0:00 [watchdog/5]
>   20 ?        S<     0:00 [migration/6]
>   21 ?        SN     0:00 [ksoftirqd/6]
>   22 ?        S<     0:00 [watchdog/6]
>   23 ?        S<     0:00 [migration/7]
>   24 ?        SN     0:00 [ksoftirqd/7]
>   25 ?        S<     0:00 [watchdog/7]
>   26 ?        S<     0:00 [migration/8]
>   27 ?        SN     0:00 [ksoftirqd/8]
>   28 ?        S<     0:00 [watchdog/8]
>   29 ?        S<     0:00 [migration/9]
>   30 ?        SN     0:00 [ksoftirqd/9]
>   31 ?        S<     0:00 [watchdog/9]
>   32 ?        S<     0:00 [migration/10]
>   33 ?        SN     0:00 [ksoftirqd/10]
>   34 ?        S<     0:00 [watchdog/10]
>   35 ?        S<     0:00 [migration/11]
>   36 ?        SN     0:00 [ksoftirqd/11]
>   37 ?        S<     0:00 [watchdog/11]
>   38 ?        S<     0:00 [events/0]
>   39 ?        S<     0:00 [events/1]
>   40 ?        S<     0:00 [events/2]
>   41 ?        S<     0:00 [events/3]
>   42 ?        S<     0:00 [events/4]
>   43 ?        S<     0:00 [events/5]
>   44 ?        S<     0:00 [events/6]
>   45 ?        S<     0:00 [events/7]
>   46 ?        S<     0:00 [events/8]
>   47 ?        S<     0:00 [events/9]
>   48 ?        S<     0:00 [events/10]
>   49 ?        S<     0:00 [events/11]
>   50 ?        S<     0:00 [khelper]
>  443 ?        S<     0:00 [kthread]
>  459 ?        S<     0:00  \_ [kblockd/0]
>  460 ?        S<     0:00  \_ [kblockd/1]
>  461 ?        S<     0:00  \_ [kblockd/2]
>  462 ?        S<     0:00  \_ [kblockd/3]
>  463 ?        S<     0:00  \_ [kblockd/4]
>  464 ?        S<     0:00  \_ [kblockd/5]
>  465 ?        S<     0:00  \_ [kblockd/6]
>  466 ?        S<     0:00  \_ [kblockd/7]
>  467 ?        S<     0:00  \_ [kblockd/8]
>  468 ?        S<     0:00  \_ [kblockd/9]
>  469 ?        S<     0:00  \_ [kblockd/10]
>  470 ?        S<     0:00  \_ [kblockd/11]
>  471 ?        S<     0:00  \_ [kacpid]
>  633 ?        S<     0:00  \_ [cqueue/0]
>  634 ?        S<     0:00  \_ [cqueue/1]
>  635 ?        S<     0:00  \_ [cqueue/2]
>  636 ?        S<     0:00  \_ [cqueue/3]
>  637 ?        S<     0:00  \_ [cqueue/4]
>  638 ?        S<     0:00  \_ [cqueue/5]
>  639 ?        S<     0:00  \_ [cqueue/6]
>  640 ?        S<     0:00  \_ [cqueue/7]
>  641 ?        S<     0:00  \_ [cqueue/8]
>  642 ?        S<     0:00  \_ [cqueue/9]
>  643 ?        S<     0:00  \_ [cqueue/10]
>  644 ?        S<     0:00  \_ [cqueue/11]
>  647 ?        S<     0:00  \_ [khubd]
>  649 ?        S<     0:00  \_ [kseriod]
>  804 ?        S      0:00  \_ [pdflush]
>  805 ?        S      0:19  \_ [pdflush]
>  806 ?        S<     0:00  \_ [kswapd0]
>  807 ?        S<     0:00  \_ [kswapd1]
>  808 ?        S<     0:00  \_ [aio/0]
>  809 ?        S<     0:00  \_ [aio/1]
>  810 ?        S<     0:00  \_ [aio/2]
>  811 ?        S<     0:00  \_ [aio/3]
>  812 ?        S<     0:00  \_ [aio/4]
>  813 ?        S<     0:00  \_ [aio/5]
>  814 ?        S<     0:00  \_ [aio/6]
>  815 ?        S<     0:00  \_ [aio/7]
>  816 ?        S<     0:00  \_ [aio/8]
>  817 ?        S<     0:00  \_ [aio/9]
>  818 ?        S<     0:00  \_ [aio/10]
>  819 ?        S<     0:00  \_ [aio/11]
>  971 ?        S<     0:00  \_ [kpsmoused]
> 1083 ?        S<     0:00  \_ [ata/0]
> 1084 ?        S<     0:00  \_ [ata/1]
> 1085 ?        S<     0:00  \_ [ata/2]
> 1086 ?        S<     0:00  \_ [ata/3]
> 1087 ?        S<     0:00  \_ [ata/4]
> 1088 ?        S<     0:00  \_ [ata/5]
> 1089 ?        S<     0:00  \_ [ata/6]
> 1090 ?        S<     0:00  \_ [ata/7]
> 1091 ?        S<     0:00  \_ [ata/8]
> 1092 ?        S<     0:00  \_ [ata/9]
> 1093 ?        S<     0:00  \_ [ata/10]
> 1094 ?        S<     0:00  \_ [ata/11]
> 1095 ?        S<     0:00  \_ [ata_aux]
> 1109 ?        S<     0:00  \_ [scsi_eh_0]
> 1110 ?        S<     0:00  \_ [scsi_eh_1]
> 1111 ?        S<     0:00  \_ [scsi_eh_2]
> 1112 ?        S<     0:00  \_ [scsi_eh_3]
> 1113 ?        S<     0:00  \_ [scsi_eh_4]
> 1114 ?        S<     0:00  \_ [scsi_eh_5]
> 1141 ?        S<     0:00  \_ [kstriped]
> 1194 ?        S<     0:10  \_ [kjournald]
> 1219 ?        S<     0:01  \_ [kauditd]
> 2703 ?        S<     0:00  \_ [kmpathd/0]
> 2704 ?        S<     0:00  \_ [kmpathd/1]
> 2705 ?        S<     0:00  \_ [kmpathd/2]
> 2706 ?        S<     0:00  \_ [kmpathd/3]
> 2707 ?        S<     0:00  \_ [kmpathd/4]
> 2708 ?        S<     0:00  \_ [kmpathd/5]
> 2709 ?        S<     0:00  \_ [kmpathd/6]
> 2710 ?        S<     0:00  \_ [kmpathd/7]
> 2711 ?        S<     0:00  \_ [kmpathd/8]
> 2712 ?        S<     0:00  \_ [kmpathd/9]
> 2713 ?        S<     0:00  \_ [kmpathd/10]
> 2714 ?        S<     0:00  \_ [kmpathd/11]
> 2715 ?        S<     0:00  \_ [kmpath_handlerd]
> 2754 ?        S<     0:05  \_ [kjournald]
> 2756 ?        S<     0:04  \_ [kjournald]
> 3807 ?        S<     0:00  \_ [rpciod/0]
> 3808 ?        S<     0:08  \_ [rpciod/1]
> 3809 ?        S<     0:02  \_ [rpciod/2]
> 3810 ?        S<     0:03  \_ [rpciod/3]
> 3811 ?        S<     0:01  \_ [rpciod/4]
> 3812 ?        S<     0:11  \_ [rpciod/5]
> 3813 ?        S<     0:00  \_ [rpciod/6]
> 3814 ?        S<     0:00  \_ [rpciod/7]
> 3815 ?        S<     0:00  \_ [rpciod/8]
> 3816 ?        S<     0:03  \_ [rpciod/9]
> 3817 ?        S<     0:00  \_ [rpciod/10]
> 3818 ?        S<     0:58  \_ [rpciod/11]
> 4047 ?        SN   348:07  \_ [kipmi0]
> 1250 ?        S<s    0:00 /sbin/udevd -d
> 3598 ?        S<sl   0:05 auditd
> 3600 ?        S<sl   0:02  \_ /sbin/audispd
> 3722 ?        Sl   207:37 /opt/rocks/bin/python /opt/rocks/bin/greceptor
> 3734 ?        Ss     0:00 syslogd -m 0
> 3737 ?        Ss     0:00 klogd -x
> 3750 ?        Ss     0:12 irqbalance
> 3767 ?        Ss     0:00 portmap
> 3835 ?        Ss     0:00 rpc.statd
> 3866 ?        Ss     0:00 rpc.idmapd
> 3887 ?        Ss     0:00 dbus-daemon --system
> 3951 ?        S      0:00 [lockd]
> 3965 ?        Ss     0:00 /usr/sbin/acpid
> 3977 ?        Ss     0:01 hald
> 3978 ?        S      0:00  \_ hald-runner
> 3987 ?        S      0:00      \_ hald-addon-acpi: listening on acpid socket 
> /var/run/acpid.socket
> 3995 ?        S      0:00      \_ hald-addon-keyboard: listening on 
> /dev/input/event0
> 4125 ?        Ssl    0:00 automount
> 4189 ?        Rl    13:30 /opt/gridengine/bin/lx26-amd64/sge_execd
> 19334 ?        Z      0:00  \_ [sge_shepherd] <defunct>
> 20252 ?        S      0:00  \_ sge_shepherd-127467 -bg
> 20253 ?        Ss     0:00  |   \_ -bash 
> /opt/gridengine/default/spool/compute-0-0/job_scripts/127467 ohscal_4 
> UAF_X_squared1 svm_poly_optimize_c
> 20385 ?        Sl    31:47  |       \_ 
> /usr/local/MATLAB/R2011a/bin/glnxa64/MATLAB -nodisplay -r control_script_cl 
> ohscal_4 UAF_X_squared1 svm_poly_optimize_c; quit; -nojvm
> 20588 ?        S      0:00  \_ sge_shepherd-127468 -bg
> 20589 ?        Ss     0:00  |   \_ -bash 
> /opt/gridengine/default/spool/compute-0-0/job_scripts/127468 ohscal_4 
> UAF_X_squared1 krr_poly_optimize_c
> 20721 ?        Sl    81:14  |       \_ 
> /usr/local/MATLAB/R2011a/bin/glnxa64/MATLAB -nodisplay -r control_script_cl 
> ohscal_4 UAF_X_squared1 krr_poly_optimize_c; quit; -nojvm
> 20829 ?        S      0:00  \_ sge_shepherd-127468 -bg
> 20830 ?        Ss     0:00  |   \_ -bash 
> /opt/gridengine/default/spool/compute-0-0/job_scripts/127468 ohscal_4 
> UAF_X_squared1 krr_poly_optimize_c
> 20962 ?        Sl    84:24  |       \_ 
> /usr/local/MATLAB/R2011a/bin/glnxa64/MATLAB -nodisplay -r control_script_cl 
> ohscal_4 UAF_X_squared1 krr_poly_optimize_c; quit; -nojvm
> 21135 ?        S      0:00  \_ sge_shepherd-127468 -bg
> 21136 ?        Ss     0:00  |   \_ -bash 
> /opt/gridengine/default/spool/compute-0-0/job_scripts/127468 ohscal_4 
> UAF_X_squared1 krr_poly_optimize_c
> 21268 ?        Sl    71:47  |       \_ 
> /usr/local/MATLAB/R2011a/bin/glnxa64/MATLAB -nodisplay -r control_script_cl 
> ohscal_4 UAF_X_squared1 krr_poly_optimize_c; quit; -nojvm
> 21371 ?        S      0:00  \_ sge_shepherd-127468 -bg
> 21372 ?        Ss     0:00      \_ -bash 
> /opt/gridengine/default/spool/compute-0-0/job_scripts/127468 ohscal_4 
> UAF_X_squared1 krr_poly_optimize_c
> 21504 ?        Sl    74:30          \_ 
> /usr/local/MATLAB/R2011a/bin/glnxa64/MATLAB -nodisplay -r control_script_cl 
> ohscal_4 UAF_X_squared1 krr_poly_optimize_c; quit; -nojvm
> 4211 ?        Sl     0:06 /usr/sbin/snmpd -Lsd -Lf /dev/null -p 
> /var/run/snmpd.pid -a
> 4227 ?        Ss     0:00 /usr/sbin/sshd
> 23976 ?        Ss     0:00  \_ sshd: root@notty
> 23978 ?        Rs     0:00      \_ ps -e f
> 4244 ?        Ss     0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
> 4332 ?        Ss     0:00 /usr/libexec/postfix/master
> 4352 ?        S      0:00  \_ qmgr -l -t fifo -u
> 23354 ?        S      0:00  \_ pickup -l -t fifo -u
> 4344 ?        Ss     0:01 crond
> 4378 ?        Ss     0:00 xfs -droppriv -daemon
> 4401 ?        Ss     0:00 /usr/sbin/atd
> 4457 ?        S      0:00 /usr/sbin/smartd -q never
> 4461 tty1     Ss+    0:00 /sbin/mingetty tty1
> 4463 tty2     Ss+    0:00 /sbin/mingetty tty2
> 4465 tty3     Ss+    0:00 /sbin/mingetty tty3
> 4466 tty4     Ss+    0:00 /sbin/mingetty tty4
> 4467 tty5     Ss+    0:00 /sbin/mingetty tty5
> 4469 tty6     Ss+    0:00 /sbin/mingetty tty6
> 28118 ?        SLs    0:00 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid
> 6784 ?        Ss     2:53 /usr/sbin/gmond
> 
> 
>> (f w/o -) will generate a readable output. Are all jobs bound to the 
>> sge_execd and the sge_shepherds?
> 
> All the heavy processes are.
> 
> 
>> Are there kernel tasks in state D?
> 
> No, nothing seems to be in state D.
> 
> 
> For what it is worth, here is the output of top -n 1 on the same node:
> 
> top - 11:41:59 up 32 days,  1:22,  1 user,  load average: 37.85, 39.53, 38.43
> Tasks: 203 total,   2 running, 201 sleeping,   0 stopped,   0 zombie
> Cpu(s): 32.2%us, 11.3%sy,  0.0%ni, 56.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  49449280k total, 24527668k used, 24921612k free,    43444k buffers
> Swap:  1020116k total,        0k used,  1020116k free,   519884k cached
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 21268 liz06     25   0 5350m 4.2g  34m S 425.1  8.8  69:36.86 MATLAB
> 20721 liz06     25   0 5300m 4.2g  34m S 311.9  8.8  77:36.80 MATLAB
> 21504 liz06     25   0 5410m 4.2g  34m S 224.5  8.9  71:42.39 MATLAB
> 20962 liz06     25   0 5343m 4.2g  34m S 139.1  8.8  81:30.63 MATLAB
> 19600 liz06     25   0 3307m 2.7g  33m S 63.6  5.8  35:21.36 MATLAB
> 20385 liz06     21   0 3666m 2.6g  33m S 31.8  5.6  31:07.20 MATLAB
>    1 root      15   0 10348  696  584 S  0.0  0.0   0:01.93 init
>    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.52 migration/0
>    3 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
>    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.59 migration/1
>    6 root      34  19     0    0    0 S  0.0  0.0   0:00.10 ksoftirqd/1
>    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1
>    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.44 migration/2
>    9 root      34  19     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/2
>   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2
>   11 root      RT  -5     0    0    0 S  0.0  0.0   0:00.32 migration/3
>   12 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/3
>   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3
>   14 root      RT  -5     0    0    0 S  0.0  0.0   0:00.30 migration/4
>   15 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/4
>   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4
>   17 root      RT  -5     0    0    0 S  0.0  0.0   0:00.24 migration/5
>   18 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/5
>   19 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/5
>   20 root      RT  -5     0    0    0 S  0.0  0.0   0:00.60 migration/6
>   21 root      34  19     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/6
>   22 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/6
>   23 root      RT  -5     0    0    0 S  0.0  0.0   0:00.85 migration/7
>   24 root      34  19     0    0    0 S  0.0  0.0   0:00.22 ksoftirqd/7
>   25 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/7
>   26 root      RT  -5     0    0    0 S  0.0  0.0   0:00.35 migration/8
>   27 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/8
>   28 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/8
>   29 root      RT  -5     0    0    0 S  0.0  0.0   0:00.20 migration/9
>   30 root      34  19     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/9
>   31 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/9
>   32 root      RT  -5     0    0    0 S  0.0  0.0   0:00.18 migration/10
>   33 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/10
>   34 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/10
>   35 root      RT  -5     0    0    0 S  0.0  0.0   0:00.17 migration/11
>   36 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/11
>   37 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/11
>   38 root      10  -5     0    0    0 S  0.0  0.0   0:00.02 events/0
>   39 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/1
>   40 root      10  -5     0    0    0 S  0.0  0.0   0:00.01 events/2
>   41 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/3
>   42 root      10  -5     0    0    0 S  0.0  0.0   0:00.01 events/4
> 
> 
> ------------------------------------------------------------
> This email message, including any attachments, is for the sole use of the 
> intended recipient(s) and may contain information that is proprietary, 
> confidential, and exempt from disclosure under applicable law. Any 
> unauthorized review, use, disclosure, or distribution is prohibited. If you 
> have received this email in error please notify the sender by return email 
> and delete the original message. Please note, the recipient should check this 
> email and any attachments for the presence of viruses. The organization 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> =================================
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to