Hi, Am 17.06.2014 um 03:51 schrieb Sangmin Park:
> It looks like okay. But, the usage reporting still does not work. > This is the 'ps -e f' result. > > 11151 ? Sl 0:14 /opt/sge/bin/lx24-amd64/sge_execd > 16851 ? S 0:00 \_ sge_shepherd-46865 -bg > 16877 ? Ss 0:00 | \_ bash > /opt/sge/default/spool/lion20/job_scripts/46865 > 16884 ? S 0:00 | \_ /bin/bash > /opt/intel/impi/4.0.3.008/intel64/bin/mpirun -np 12 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIB > 16895 ? S 0:00 | \_ mpiexec.hydra -machinefile > /tmp/sge_machinefile_16884 -np 12 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.M > 16896 ? S 0:00 | \_ /opt/sge/bin/lx24-amd64/qrsh > -inherit lion20 /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port > li > 16906 ? S 0:00 | \_ /usr/bin/ssh -p 42593 > lion20 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/sge/default/spool/lion > 16904 ? S 0:00 \_ sge_shepherd-46865 -bg > 16905 ? Ss 0:00 \_ sshd: p012chm [priv] > 16911 ? S 0:00 \_ sshd: p012chm@notty Aha, you are using SSH. Please have a look here to enable proper accounting: http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html section "SSH TIGHT INTEGRATION". The location in OpenSSH is now: http://gridengine.org/pipermail/users/2013-December/006974.html > 16912 ? Ss 0:00 \_ > /opt/sge/utilbin/lx24-amd64/qrsh_starter > /opt/sge/default/spool/lion20/active_jobs/46865.1/1.lion20 > 17001 ? S 0:00 \_ > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port lion20:57442 > --pmi-connect lazy-cache --pmi-agg > 17002 ? Rl 0:11 \_ > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > <snip> > > queuename qtype resv/used/tot. load_avg arch > > states > > --------------------------------------------------------------------------------- > > all.q@lion01 BIP 0/0/12 2.03 lx24-amd64 > > --------------------------------------------------------------------------------- > > all.q@lion02 BIP 0/0/12 0.00 lx24-amd64 > > --------------------------------------------------------------------------------- > > all.q@lion03 BIP 0/0/12 12.00 lx24-amd64 Why is the load 12, when there are no slots used? -- Reuti > > --------------------------------------------------------------------------------- > > all.q@lion04 BIP 0/0/12 0.03 lx24-amd64 > > > > > > FYI, > > Our cluster has 37 computing nodes, lion01 ~ lion37. > > SGE is installed /opt directory in the master node called 'lion'. > > and only master node is 'submit host' > > Good, but does it now work correctly according to the tree output of the > processes? > > -- Reuti > > > > > > --Sangmin > > > > > > On Fri, Jun 13, 2014 at 4:11 PM, Reuti <[email protected]> wrote: > > Am 13.06.2014 um 06:50 schrieb Sangmin Park: > > > > > Hi, > > > > > > I've checked his job when it's running. > > > I've checked it via 'ps -ef' command and found that his job is using > > > "mpiexec.hydra". > > > > Putting a blank between "-e" and "f" will give a nice process tree. > > > > > > > And 'qrsh' is using '-inherit' option. Here's details. > > > > > > p012chm 21424 21398 0 13:20 ? 00:00:00 bash > > > /opt/sge/default/spool/lion07/job_scripts/46651 > > > p012chm 21431 21424 0 13:20 ? 00:00:00 /bin/bash > > > /opt/intel/impi/4.0.3.008/intel64/bin/mpirun -np 12 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21442 21431 0 13:20 ? 00:00:00 mpiexec.hydra > > > -machinefile /tmp/sge_machinefile_21431 -np 12 > > > > What creates this "sge_machinefile_21431"? Often it's put into $TMPDIR, > > i.e. the temporary directory of the job as you can use always the same name > > and it will be removed after the job for sure. > > > > > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21443 21442 0 13:20 ? 00:00:00 > > > /opt/sge/bin/lx24-amd64/qrsh -inherit lion07 > > > > Ok, on the one hand this looks good and should give a proper accounting. > > But maybe there is something about the hostname resolution, as AFAIK on the > > local machine "lion07" it should just fork instead making a local `qrsh > > -inherit...`. > > > > Does `qstat -f` list the short names only, or are the FQDN in the output > > for the queue instances? > > > > -- Reuti > > > > > > > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port > > > lion07:54060 --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh > > > --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0 > > > root 21452 21451 0 13:20 ? 00:00:00 sshd: p012chm [priv] > > > p012chm 21453 21443 0 13:20 ? 00:00:00 /usr/bin/ssh -p 60725 > > > lion07 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > > > '/opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07' > > > p012chm 21457 21452 0 13:20 ? 00:00:00 sshd: p012chm@notty > > > p012chm 21458 21457 0 13:20 ? 00:00:00 > > > /opt/sge/utilbin/lx24-amd64/qrsh_starter > > > /opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07 > > > p012chm 21548 21458 0 13:20 ? 00:00:00 > > > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port > > > lion07:54060 --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh > > > --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0 > > > p012chm 21549 21548 99 13:20 ? 00:22:04 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21550 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21551 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21552 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21553 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21554 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21555 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21556 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21557 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21558 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21559 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > p012chm 21560 21548 99 13:20 ? 00:22:10 > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > smpark 21728 21638 0 13:43 pts/0 00:00:00 grep chm > > > > > > --Sangmin > > > > > > > > > On Thu, Jun 12, 2014 at 8:04 PM, Reuti <[email protected]> wrote: > > > Am 12.06.2014 um 04:23 schrieb Sangmin Park: > > > > > > > I've checked the version of Intel MPI. He uses Intel MPI 4.0.3.008 > > > > version. > > > > Our system uses rsh to access computing nodes. SGE doses, too. > > > > > > > > Please let me know how to cehck which one is used 'mpiexec.hydry' or > > > > 'mpiexec'. > > > > > > Do you have both files somewhere in a "bin" directory inside the Intel > > > MPI? You could rename "mpiexec" and create a symbolic link "mpiexec" > > > pointing to "mpiexec.hydra". The old startup will need some daemons > > > running on the node (which are outside of SGE's control and accounting*), > > > but "mpiexec.hydra" will startup the child processes on its own as kids > > > of its own and should hence be under SGE's control. And as long as you > > > are staying on one and the same node, this should work already without > > > further setup then. To avoid a later surprise when you compute between > > > nodes, the `rsh`/`ssh` should nevertheless being caught and redirected to > > > `qrsh -inherit...` like outlined in "$SGE_ROOT/mpi". > > > > > > -- Reuti > > > > > > *) It's even possible to force the daemons to be started under SGE, but > > > it's convoluted and not recommended. > > > > > > > > > > Sangmin > > > > > > > > > > > > On Wed, Jun 11, 2014 at 6:46 PM, Reuti <[email protected]> > > > > wrote: > > > > Hi, > > > > > > > > Am 11.06.2014 um 02:38 schrieb Sangmin Park: > > > > > > > > > For the best performance, we recommend users to use 8 cores on a > > > > > single particular node, not distributed with multi node. > > > > > Before I said, he uses VASP application compiled with Intel MPI. So > > > > > he uses Intel MPI now. > > > > > > > > Which version of Intel MPI? Even with the latest one it's not tightly > > > > integrated by default (despite the fact, that MPICH3 [on which it is > > > > based] is tightly integrated by default). > > > > > > > > Depending on the version it might be necessary to make some adjustments > > > > - IIRC mainly use `mpiexec.hydra` instead of `mpiexec` and supply a > > > > wrapper to catch the `rsh`/`ssh` call (like in the MPI demo in SGE's > > > > directory). > > > > > > > > -- Reuti > > > > > > > > > > > > > --Sangmin > > > > > > > > > > > > > > > On Tue, Jun 10, 2014 at 5:58 PM, Reuti <[email protected]> > > > > > wrote: > > > > > Hi, > > > > > > > > > > Am 10.06.2014 um 10:21 schrieb Sangmin Park: > > > > > > > > > > > This user does always parallel job using VASP application. > > > > > > Usually, he uses 8 cores per a job. Lots of this kind of job have > > > > > > been submitted by the user. > > > > > > > > > > 8 cores on a particular node or 8 slots across the cluster? What MPI > > > > > implementation does he use? > > > > > > > > > > -- Reuti > > > > > > > > > > NB: Please keep the list posted. > > > > > > > > > > > > > > > > Sangmin > > > > > > > > > > > > > > > > > > On Tue, Jun 10, 2014 at 3:42 PM, Reuti <[email protected]> > > > > > > wrote: > > > > > > Am 10.06.2014 um 08:00 schrieb Sangmin Park: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I'm very confused about the output of qacct command. > > > > > > > I thought CPU column time is the best way to measure resource > > > > > > > usage by users through this web page, > > > > > > > https://wiki.duke.edu/display/SCSC/Checking+SGE+Usage > > > > > > > > > > > > > > But, I have some situation. > > > > > > > One of users in my institution, actually this user is a one of > > > > > > > heavy users, uses lots of HPC resources. To get the resource > > > > > > > usage by this user for requirement of the payment, I commanded > > > > > > > qacct and the output is below, this is just for May. > > > > > > > > > > > > > > OWNER WALLCLOCK UTIME STIME CPU > > > > > > > MEMORY IO IOW > > > > > > > ======================================================================================================================== > > > > > > > p012chm 2980810 28.485 35.012 100.634 > > > > > > > 4.277 0.576 0.000 > > > > > > > > > > > > > > CPU time is too much small. Because he is very heavy user of our > > > > > > > institution, I can not accept this result. However, the WALLCLOCK > > > > > > > time is very much. > > > > > > > > > > > > > > How do I get correct information of usage resources by users via > > > > > > > qacct? > > > > > > > > > > > > This may happen in case you have parallel jobs which are not > > > > > > tightly integrated into SGE. What types of jobs is the user running? > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > =========================== > > > > > > > Sangmin Park > > > > > > > Supercomputing Center > > > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > > > Ulsan, 689-798, Korea > > > > > > > > > > > > > > phone : +82-52-217-4201 > > > > > > > mobile : +82-10-5094-0405 > > > > > > > fax : +82-52-217-4209 > > > > > > > =========================== > > > > > > > _______________________________________________ > > > > > > > users mailing list > > > > > > > [email protected] > > > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > =========================== > > > > > > Sangmin Park > > > > > > Supercomputing Center > > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > > Ulsan, 689-798, Korea > > > > > > > > > > > > phone : +82-52-217-4201 > > > > > > mobile : +82-10-5094-0405 > > > > > > fax : +82-52-217-4209 > > > > > > =========================== > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > =========================== > > > > > Sangmin Park > > > > > Supercomputing Center > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > Ulsan, 689-798, Korea > > > > > > > > > > phone : +82-52-217-4201 > > > > > mobile : +82-10-5094-0405 > > > > > fax : +82-52-217-4209 > > > > > =========================== > > > > > > > > > > > > > > > > > > > > -- > > > > =========================== > > > > Sangmin Park > > > > Supercomputing Center > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > Ulsan, 689-798, Korea > > > > > > > > phone : +82-52-217-4201 > > > > mobile : +82-10-5094-0405 > > > > fax : +82-52-217-4209 > > > > =========================== > > > > > > > > > > > > > > > -- > > > =========================== > > > Sangmin Park > > > Supercomputing Center > > > Ulsan National Institute of Science and Technology(UNIST) > > > Ulsan, 689-798, Korea > > > > > > phone : +82-52-217-4201 > > > mobile : +82-10-5094-0405 > > > fax : +82-52-217-4209 > > > =========================== > > > > > > > > > > -- > > =========================== > > Sangmin Park > > Supercomputing Center > > Ulsan National Institute of Science and Technology(UNIST) > > Ulsan, 689-798, Korea > > > > phone : +82-52-217-4201 > > mobile : +82-10-5094-0405 > > fax : +82-52-217-4209 > > =========================== > > > > > -- > =========================== > Sangmin Park > Supercomputing Center > Ulsan National Institute of Science and Technology(UNIST) > Ulsan, 689-798, Korea > > phone : +82-52-217-4201 > mobile : +82-10-5094-0405 > fax : +82-52-217-4209 > =========================== _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
