Hi,
Is there's any plan to implement the GPU resource management in SGE in
the near future? Like Slurm or Torque? There are some ways to do this
using scripts/programs, but I wonder that if the SGE itself can
recognize and manage GPU(and Phi). Not need to be complicated and
powerful, just do
working on it.
Ian
On Mon, Apr 14, 2014 at 10:35 AM, Feng Zhang prod.f...@gmail.com wrote:
Hi,
Is there's any plan to implement the GPU resource management in SGE in
the near future? Like Slurm or Torque? There are some ways to do this
using scripts/programs, but I wonder that if the SGE itself
, but as my
understanding, it is not efficient? Since each node has, for example,
4 queues. If one user submit a PE job to a queue, he/she can not use
the other GPUs on the other queues?
On Mon, Apr 14, 2014 at 2:16 PM, Reuti re...@staff.uni-marburg.de wrote:
Am 14.04.2014 um 20:06 schrieb Feng Zhang
:
Again, look into using it as a consumable resource as Gowtham posted above.
Ian
On Mon, Apr 14, 2014 at 11:57 AM, Feng Zhang prod.f...@gmail.com wrote:
Thanks, Reuti,
The socket solution looks like only work fine for serial jobs, not PE
jobs, right?
Our cluster has different nodes, some nodes
On Mon, Apr 14, 2014 at 5:36 PM, Reuti re...@staff.uni-marburg.de wrote:
Am 14.04.2014 um 20:57 schrieb Feng Zhang:
Thanks, Reuti,
The socket solution looks like only work fine for serial jobs, not PE
jobs, right?
You mean using more than one GPU at a time, or using parallel processes
For red hat Linux, you may try to put the script into /etc/profile.d, for
bash env.
On Tuesday, April 29, 2014, Yago Fernández Pinilla yago...@gmail.com
wrote:
Hi all,
Is it possible to execute a script in Grid Engine as a startup script in
every of the nodes with tight integration?
The
Hi,
I am now running large disk IO jobs(sequential jobs NOT PE jobs) on my
cluster. Is there any way I can submit my jobs evenly to all the
nodes? The problem I have now is:
I have 10 jobs, and I have 10 nodes, and each node has 10 CPU cores.
When I submit my jobs, all the 10 jobs are
the queues. For
example, I want queue FILLUP jobs to fill up a node as possible,
while for queue SERIAL job fill nodes in a round robin way?
Thanks,
Feng
On Fri, May 23, 2014 at 5:47 PM, Reuti re...@staff.uni-marburg.de wrote:
Please keep the list posted.
Am 23.05.2014 um 23:37 schrieb Feng Zhang
On Wed, May 28, 2014 at 11:13 AM, Feng Zhang prod.f...@gmail.com wrote:
The method on Stephans' blog works fine.
Works fine: I mean it can distribute jobs evenly into the cluster,
but not in round robin way.
load_formula -slots
schedule_interval 00:00:10
Guys,
Just curious, how does the h_vmem work on processes of MPI jobs(or
OPENMP, multi-threading)? I have some parallel jobs, the top command
shows VET of 40GB, while the RES is only 100MB.
On Mon, Jun 30, 2014 at 3:01 PM, Michael Stauffer mgsta...@gmail.com wrote:
Message: 4
Date: Mon, 30 Jun
Sorry a typo. The VET should be VIRT.
On Mon, Jun 30, 2014 at 4:47 PM, Feng Zhang prod.f...@gmail.com wrote:
Guys,
Just curious, how does the h_vmem work on processes of MPI jobs(or
OPENMP, multi-threading)? I have some parallel jobs, the top command
shows VET of 40GB, while the RES is only
, Jun 30, 2014 at 4:47 PM, Feng Zhang prod.f...@gmail.com wrote:
Guys,
Just curious, how does the h_vmem work on processes of MPI jobs(or
OPENMP, multi-threading)? I have some parallel jobs, the top command
shows VET of 40GB, while the RES is only 100MB.
On Mon, Jun 30, 2014 at 3:01 PM
Bright sets Spool to be local on each node, while the config and
excusables on NFS if you have a HA configuration on your head servers.
I think in theory, if the active head fails, you can bring it offline
and make the passive head active manually, and your jobs will not be
lost.
From the error
SGE has no information of GPU. Defining a consumable of ngpus is a
way to do that, but SGE still does not know which GPU is assigned to
which job(or process).
What I did is to set a script to assign available GPU id(s) to a
job(or MPI process) , like SGE load sensor, but put it in
Llya,
Can you please run:
qstat -j jobid
and past the output here? It may be useful for checking the problem
On Fri, Jan 23, 2015 at 12:08 PM, Ilya M 4ilya.m+g...@gmail.com wrote:
Removed the quota limits. To no avail: same problems.
Original Message
Subject: Re:
in queue gpu.q@gpu006 because
job
requests unknown resource (mem_free)
...
Ilya.
Original Message
Subject: Re: [gridengine users] Cannot request resource if it is a load
value of memory type: SGE reports it as unknown resource
From: Feng Zhang prod.f...@gmail.com
To: Ilya
26.01.2015 um 17:15 schrieb Feng Zhang prod.f...@gmail.com:
I just found a strange behavior of SGE 2011.
One user's job generate 1+ million small files in local
disk($TEMPDIR).
Hence in the local scratch directory provided by SGE?
It looks like it makes the execd very busy and from
On Wed, Feb 25, 2015 at 6:44 AM, Simon Andrews
simon.andr...@babraham.ac.uk wrote:
From: Mishkin Derakhshan mishkin...@gmail.com
Date: Tuesday, 24 February 2015 00:07
To: users@gridengine.org users@gridengine.org
Subject: [gridengine users] How to set up h_vmem as a consumable
resource
What is $num_proc? Did you try to set a real number? Like limit
hosts {*} to slots=12?
On Tue, Apr 14, 2015 at 3:32 PM, John Young j.e.yo...@larc.nasa.gov wrote:
Hello,
We (fairly) recently upgraded our cluster to Rocks 6.1.1
and we now seem to be having problems with RQS. On our old
Hi Simon,
As you defined the h_vmem as JOB, according to the manual:
A consumable defined by 'y' is a per slot consumables which
means the limit is multiplied by the number of slots being
used by the job before being applied. In case of 'j' the
consumable is a per job
Is there any way to list all the files of the failed job in /tmp:
ls -l /tmp/
and
ls -l /tmp/8319760.1. rhel6.q/
On Fri, May 29, 2015 at 1:49 AM, sudha.penme...@wipro.com wrote:
Yes Hugh, Users have permissions for the directory
drwxrwxrwt. 48 root root 163840 May 29 08:48 /tmp
A question: for -q free64,bio, what GE does to choose an available
queue for a job? Will it sort and do alphabetical order?
On Fri, May 29, 2015 at 8:12 AM, William Hay w@ucl.ac.uk wrote:
On Thu, 28 May 2015 19:27:07 +
Joseph Farran jfar...@uci.edu wrote:
Hi all.
I am not sure if
I have similar issue too. Especially when users run MPI+Multithreads
jobs. Some Multithreading programs by default use all of the cores on
a node they find.
Now I have a script to scan the usage of CPU and RAM on all nodes, and
it will warn me if it find any overloaded nodes.
Not sure SGE has
Are the 2 groups of nodes defined as queues? If so, limit on queues
may work for you:
limitqueues XXX to slots=16
On Wed, Sep 23, 2015 at 10:27 AM, Simon Andrews
wrote:
>
>
> On 23/09/2015 15:20, "Jesse Becker" wrote:
>
>>On Wed, Sep
ndr...@babraham.ac.uk> wrote:
>
>
> On 23/09/2015 15:48, "Feng Zhang" <prod.f...@gmail.com> wrote:
>
>>Are the 2 groups of nodes defined as queues? If so, limit on queues
>>may work for you:
>>
>>limitqueues XXX to slots=16
>
>
>
I think that set "slots" on the node to be 2 X number-of-cpu-cores will work.
On Fri, May 6, 2016 at 8:56 AM, wrote:
> Hi,
>
>
>
> I have only one host defined in a queue and want to allot 2 slots per core
> instead of one slot per core.
>
>
>
> How do we need to
qdel -f jobid
with systemadmin account may work.
On Tue, Jun 28, 2016 at 4:22 PM, Sean Smith wrote:
> Hi All,
>
> I have some jobs stuck in our Queue that I don't know how to remove.
> Originally I was not worried about these but I discovered later that they
> are
Have you checked the status of the queue instances? Sometime if a queue
instance goes into error status, it can not run jobs like this.
qstat -F
can list the status. And qmod -c (queue instatnce) can clear it.
On Mon, Dec 12, 2016 at 12:35 AM, Coleman, Marcus [JRDUS Non-J] <
It seems SGE master did not get refreshed with new hostgroup. Maybe you can try:
1. restart SGE master
or
2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
wait till it gets refreshed; then change it back to "@basichosts".
On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer
Is there any running jobs on queue instance of compute-2-4@basic.q?
On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer <mgsta...@gmail.com> wrote:
> On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.f...@gmail.com> wrote:
>>
>> It seems SGE master did not get refreshe
It maybe caused by the environment variables.
One simple way may be adding something like:
>source myscriptforenv.sh (bash)
in your job script.
On Fri, May 4, 2018 at 9:45 AM, Simon Andrews
wrote:
> I’ve got a strange problem on our cluster where some python
I also did some test. If provide extra interrupting signal processing in
all these scripts(to catch any TERM signal from OS), it can kind of solve
the issue.
On Wed, Aug 8, 2018 at 11:04 PM, Feng Zhang wrote:
>
> I am guessing it may be very similar to what I have met before. My
I am guessing it may be very similar to what I have met before. My issue
was: one user used a bash script as a batch job script, and in it, it calls
another script(python), and this script then calls a third script(and may
be so on...). For these kind of jobs, if there's anything wrong, it can
Maybe you can check the install script, like inst_sge, to see if there's
any typo.
Or some script files may have special characters, from like Widows, Linux,
etc.?
Best,
Feng
Best,
Feng
On Wed, Oct 17, 2018 at 12:43 PM Jerome wrote:
> Dear all
>
> I've trying to install a fresh version
add an extra command line in job script:
unset module
may be helpful.
Best,
Feng
Best,
Feng
On Wed, Oct 31, 2018 at 2:48 PM Ilya M <4ilya.m+g...@gmail.com> wrote:
>
> Hello,
>
> I had an unexpected effect after adding bash to the login_shells list in SGE
> 6.2.u5. Some array tasks started
probably it is the Maker which does not have proper handling of signals?
Maybe you can try to use a script to run the job, rather than run
binary directly, to see if it can work. Also you can add some signal
handling commands in your script to check...
Best,
Feng
On Tue, Nov 13, 2018 at 7:07
You can try to write the script to first scan all the files to get their full
path names and then run the Array jobs.
> On Jun 13, 2019, at 1:20 PM, VG wrote:
>
> HI Joshua,
> I like the array job option because essentially it will still be 1 job and it
> will run them in parallel.
>
> I
looks like your job used a lot of ram:
mem 7.463TBs
io 70.435GB
iow 0.000s
maxvmem 532.004MB
Do you have CGROUP to limit resource of jobs?
Best,
Feng
On Tue, May 14, 2019 at 9:53 AM hiller wrote:
>
> ~> qconf -srqs
> No resource quota set found
>
> 'dmesg -T'
Is "threads" added into all.q?
Also can check "qconf -srqs" is there's any limit
On Thu, Jun 11, 2020 at 2:33 PM Chris Dagdigian wrote:
>
> Hi folks,
>
> Got a bewildering situation I've never seen before with simple SMP/threaded
> PE techniques
>
> I made a brand new PE called threaded:
>
> $
39 matches
Mail list logo