[gridengine users] SGE and GPU

2014-04-14 Thread Feng Zhang
Hi, Is there's any plan to implement the GPU resource management in SGE in the near future? Like Slurm or Torque? There are some ways to do this using scripts/programs, but I wonder that if the SGE itself can recognize and manage GPU(and Phi). Not need to be complicated and powerful, just do

Re: [gridengine users] SGE and GPU

2014-04-14 Thread Feng Zhang
working on it. Ian On Mon, Apr 14, 2014 at 10:35 AM, Feng Zhang prod.f...@gmail.com wrote: Hi, Is there's any plan to implement the GPU resource management in SGE in the near future? Like Slurm or Torque? There are some ways to do this using scripts/programs, but I wonder that if the SGE itself

Re: [gridengine users] SGE and GPU

2014-04-14 Thread Feng Zhang
, but as my understanding, it is not efficient? Since each node has, for example, 4 queues. If one user submit a PE job to a queue, he/she can not use the other GPUs on the other queues? On Mon, Apr 14, 2014 at 2:16 PM, Reuti re...@staff.uni-marburg.de wrote: Am 14.04.2014 um 20:06 schrieb Feng Zhang

Re: [gridengine users] SGE and GPU

2014-04-14 Thread Feng Zhang
: Again, look into using it as a consumable resource as Gowtham posted above. Ian On Mon, Apr 14, 2014 at 11:57 AM, Feng Zhang prod.f...@gmail.com wrote: Thanks, Reuti, The socket solution looks like only work fine for serial jobs, not PE jobs, right? Our cluster has different nodes, some nodes

Re: [gridengine users] SGE and GPU

2014-04-14 Thread Feng Zhang
On Mon, Apr 14, 2014 at 5:36 PM, Reuti re...@staff.uni-marburg.de wrote: Am 14.04.2014 um 20:57 schrieb Feng Zhang: Thanks, Reuti, The socket solution looks like only work fine for serial jobs, not PE jobs, right? You mean using more than one GPU at a time, or using parallel processes

Re: [gridengine users] Script in Parallel Environment

2014-04-29 Thread Feng Zhang
For red hat Linux, you may try to put the script into /etc/profile.d, for bash env. On Tuesday, April 29, 2014, Yago Fernández Pinilla yago...@gmail.com wrote: Hi all, Is it possible to execute a script in Grid Engine as a startup script in every of the nodes with tight integration? The

[gridengine users] sequential jobs on cluster

2014-05-23 Thread Feng Zhang
Hi, I am now running large disk IO jobs(sequential jobs NOT PE jobs) on my cluster. Is there any way I can submit my jobs evenly to all the nodes? The problem I have now is: I have 10 jobs, and I have 10 nodes, and each node has 10 CPU cores. When I submit my jobs, all the 10 jobs are

Re: [gridengine users] sequential jobs on cluster

2014-05-28 Thread Feng Zhang
the queues. For example, I want queue FILLUP jobs to fill up a node as possible, while for queue SERIAL job fill nodes in a round robin way? Thanks, Feng On Fri, May 23, 2014 at 5:47 PM, Reuti re...@staff.uni-marburg.de wrote: Please keep the list posted. Am 23.05.2014 um 23:37 schrieb Feng Zhang

Re: [gridengine users] sequential jobs on cluster

2014-05-28 Thread Feng Zhang
On Wed, May 28, 2014 at 11:13 AM, Feng Zhang prod.f...@gmail.com wrote: The method on Stephans' blog works fine. Works fine: I mean it can distribute jobs evenly into the cluster, but not in round robin way. load_formula -slots schedule_interval 00:00:10

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Feng Zhang
Guys, Just curious, how does the h_vmem work on processes of MPI jobs(or OPENMP, multi-threading)? I have some parallel jobs, the top command shows VET of 40GB, while the RES is only 100MB. On Mon, Jun 30, 2014 at 3:01 PM, Michael Stauffer mgsta...@gmail.com wrote: Message: 4 Date: Mon, 30 Jun

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Feng Zhang
Sorry a typo. The VET should be VIRT. On Mon, Jun 30, 2014 at 4:47 PM, Feng Zhang prod.f...@gmail.com wrote: Guys, Just curious, how does the h_vmem work on processes of MPI jobs(or OPENMP, multi-threading)? I have some parallel jobs, the top command shows VET of 40GB, while the RES is only

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Feng Zhang
, Jun 30, 2014 at 4:47 PM, Feng Zhang prod.f...@gmail.com wrote: Guys, Just curious, how does the h_vmem work on processes of MPI jobs(or OPENMP, multi-threading)? I have some parallel jobs, the top command shows VET of 40GB, while the RES is only 100MB. On Mon, Jun 30, 2014 at 3:01 PM

Re: [gridengine users] SGE and NFS

2014-11-12 Thread Feng Zhang
Bright sets Spool to be local on each node, while the config and excusables on NFS if you have a HA configuration on your head servers. I think in theory, if the active head fails, you can bring it offline and make the passive head active manually, and your jobs will not be lost. From the error

Re: [gridengine users] Requesting a resource OR another resource

2014-11-19 Thread Feng Zhang
SGE has no information of GPU. Defining a consumable of ngpus is a way to do that, but SGE still does not know which GPU is assigned to which job(or process). What I did is to set a script to assign available GPU id(s) to a job(or MPI process) , like SGE load sensor, but put it in

Re: [gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

2015-01-23 Thread Feng Zhang
Llya, Can you please run: qstat -j jobid and past the output here? It may be useful for checking the problem On Fri, Jan 23, 2015 at 12:08 PM, Ilya M 4ilya.m+g...@gmail.com wrote: Removed the quota limits. To no avail: same problems. Original Message Subject: Re:

Re: [gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

2015-01-23 Thread Feng Zhang
in queue gpu.q@gpu006 because job requests unknown resource (mem_free) ... Ilya. Original Message Subject: Re: [gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource From: Feng Zhang prod.f...@gmail.com To: Ilya

Re: [gridengine users] Huge amount of files generated in local disk

2015-01-28 Thread Feng Zhang
26.01.2015 um 17:15 schrieb Feng Zhang prod.f...@gmail.com: I just found a strange behavior of SGE 2011. One user's job generate 1+ million small files in local disk($TEMPDIR). Hence in the local scratch directory provided by SGE? It looks like it makes the execd very busy and from

Re: [gridengine users] How to set up h_vmem as a consumable resource

2015-02-25 Thread Feng Zhang
On Wed, Feb 25, 2015 at 6:44 AM, Simon Andrews simon.andr...@babraham.ac.uk wrote: From: Mishkin Derakhshan mishkin...@gmail.com Date: Tuesday, 24 February 2015 00:07 To: users@gridengine.org users@gridengine.org Subject: [gridengine users] How to set up h_vmem as a consumable resource

Re: [gridengine users] limit slots to core count no longer works

2015-04-14 Thread Feng Zhang
What is $num_proc? Did you try to set a real number? Like limit hosts {*} to slots=12? On Tue, Apr 14, 2015 at 3:32 PM, John Young j.e.yo...@larc.nasa.gov wrote: Hello, We (fairly) recently upgraded our cluster to Rocks 6.1.1 and we now seem to be having problems with RQS. On our old

Re: [gridengine users] Negative complex values

2015-06-08 Thread Feng Zhang
Hi Simon, As you defined the h_vmem as JOB, according to the manual: A consumable defined by 'y' is a per slot consumables which means the limit is multiplied by the number of slots being used by the job before being applied. In case of 'j' the consumable is a per job

Re: [gridengine users] frequent errors from the GRID messages

2015-05-29 Thread Feng Zhang
Is there any way to list all the files of the failed job in /tmp: ls -l /tmp/ and ls -l /tmp/8319760.1. rhel6.q/ On Fri, May 29, 2015 at 1:49 AM, sudha.penme...@wipro.com wrote: Yes Hugh, Users have permissions for the directory drwxrwxrwt. 48 root root 163840 May 29 08:48 /tmp

Re: [gridengine users] Using multiple queues inherits s_rt h_rt

2015-05-29 Thread Feng Zhang
A question: for -q free64,bio, what GE does to choose an available queue for a job? Will it sort and do alphabetical order? On Fri, May 29, 2015 at 8:12 AM, William Hay w@ucl.ac.uk wrote: On Thu, 28 May 2015 19:27:07 + Joseph Farran jfar...@uci.edu wrote: Hi all. I am not sure if

Re: [gridengine users] Monitoring slot usage

2015-07-30 Thread Feng Zhang
I have similar issue too. Especially when users run MPI+Multithreads jobs. Some Multithreading programs by default use all of the cores on a node they find. Now I have a script to scan the usage of CPU and RAM on all nodes, and it will warn me if it find any overloaded nodes. Not sure SGE has

Re: [gridengine users] Applying RQS limits across a set of hosts

2015-09-23 Thread Feng Zhang
Are the 2 groups of nodes defined as queues? If so, limit on queues may work for you: limitqueues XXX to slots=16 On Wed, Sep 23, 2015 at 10:27 AM, Simon Andrews wrote: > > > On 23/09/2015 15:20, "Jesse Becker" wrote: > >>On Wed, Sep

Re: [gridengine users] Applying RQS limits across a set of hosts

2015-09-23 Thread Feng Zhang
ndr...@babraham.ac.uk> wrote: > > > On 23/09/2015 15:48, "Feng Zhang" <prod.f...@gmail.com> wrote: > >>Are the 2 groups of nodes defined as queues? If so, limit on queues >>may work for you: >> >>limitqueues XXX to slots=16 > > >

Re: [gridengine users] Defining slots in grid queue

2016-05-06 Thread Feng Zhang
I think that set "slots" on the node to be 2 X number-of-cpu-cores will work. On Fri, May 6, 2016 at 8:56 AM, wrote: > Hi, > > > > I have only one host defined in a queue and want to allot 2 slots per core > instead of one slot per core. > > > > How do we need to

Re: [gridengine users] How to removed jobs stuck in DR state

2016-06-28 Thread Feng Zhang
qdel -f jobid with systemadmin account may work. On Tue, Jun 28, 2016 at 4:22 PM, Sean Smith wrote: > Hi All, > > I have some jobs stuck in our Queue that I don't know how to remove. > Originally I was not worried about these but I discovered later that they > are

Re: [gridengine users] users Digest, Vol 72, Issue 13

2016-12-13 Thread Feng Zhang
Have you checked the status of the queue instances? Sometime if a queue instance goes into error status, it can not run jobs like this. qstat -F can list the status. And qmod -c (queue instatnce) can clear it. On Mon, Dec 12, 2016 at 12:35 AM, Coleman, Marcus [JRDUS Non-J] <

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Feng Zhang
It seems SGE master did not get refreshed with new hostgroup. Maybe you can try: 1. restart SGE master or 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", wait till it gets refreshed; then change it back to "@basichosts". On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Feng Zhang
Is there any running jobs on queue instance of compute-2-4@basic.q? On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer <mgsta...@gmail.com> wrote: > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.f...@gmail.com> wrote: >> >> It seems SGE master did not get refreshe

Re: [gridengine users] Debugging crash when running program through GridEngine

2018-05-04 Thread Feng Zhang
It maybe caused by the environment variables. One simple way may be adding something like: >source myscriptforenv.sh (bash) in your job script. On Fri, May 4, 2018 at 9:45 AM, Simon Andrews wrote: > I’ve got a strange problem on our cluster where some python

Re: [gridengine users] User job fails silently

2018-08-08 Thread Feng Zhang
I also did some test. If provide extra interrupting signal processing in all these scripts(to catch any TERM signal from OS), it can kind of solve the issue. On Wed, Aug 8, 2018 at 11:04 PM, Feng Zhang wrote: > > I am guessing it may be very similar to what I have met before. My

Re: [gridengine users] User job fails silently

2018-08-08 Thread Feng Zhang
I am guessing it may be very similar to what I have met before. My issue was: one user used a bash script as a batch job script, and in it, it calls another script(python), and this script then calls a third script(and may be so on...). For these kind of jobs, if there's anything wrong, it can

Re: [gridengine users] Instalation issue on version 8.1.10-1

2018-10-17 Thread Feng Zhang
Maybe you can check the install script, like inst_sge, to see if there's any typo. Or some script files may have special characters, from like Widows, Linux, etc.? Best, Feng Best, Feng On Wed, Oct 17, 2018 at 12:43 PM Jerome wrote: > Dear all > > I've trying to install a fresh version

Re: [gridengine users] Effects of login_shells settings

2018-10-31 Thread Feng Zhang
add an extra command line in job script: unset module may be helpful. Best, Feng Best, Feng On Wed, Oct 31, 2018 at 2:48 PM Ilya M <4ilya.m+g...@gmail.com> wrote: > > Hello, > > I had an unexpected effect after adding bash to the login_shells list in SGE > 6.2.u5. Some array tasks started

Re: [gridengine users] Processes not exiting

2018-11-13 Thread Feng Zhang
probably it is the Maker which does not have proper handling of signals? Maybe you can try to use a script to run the job, rather than run binary directly, to see if it can work. Also you can add some signal handling commands in your script to check... Best, Feng On Tue, Nov 13, 2018 at 7:07

Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Feng Zhang
You can try to write the script to first scan all the files to get their full path names and then run the Array jobs. > On Jun 13, 2019, at 1:20 PM, VG wrote: > > HI Joshua, > I like the array job option because essentially it will still be 1 job and it > will run them in parallel. > > I

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Feng Zhang
looks like your job used a lot of ram: mem 7.463TBs io 70.435GB iow 0.000s maxvmem 532.004MB Do you have CGROUP to limit resource of jobs? Best, Feng On Tue, May 14, 2019 at 9:53 AM hiller wrote: > > ~> qconf -srqs > No resource quota set found > > 'dmesg -T'

Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Feng Zhang
Is "threads" added into all.q? Also can check "qconf -srqs" is there's any limit On Thu, Jun 11, 2020 at 2:33 PM Chris Dagdigian wrote: > > Hi folks, > > Got a bewildering situation I've never seen before with simple SMP/threaded > PE techniques > > I made a brand new PE called threaded: > > $