Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti
ine all the time. Only after my suggestion to add h_vmem on an exechost level to avoid oversubscription all the jobs crashed then, due to no memory being available (as h_vmem = 0 was used this way as an automatically set limit). Essentially: the default value in a complex definition is ign

Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti
Hi, Any consumables in place like memory or other resource requests? Any output of `qalter -w v …` or "-w p"? -- Reuti > Am 11.06.2020 um 20:32 schrieb Chris Dagdigian : > > Hi folks, > > Got a bewildering situation I've never seen before with simple SMP/threade

Re: [gridengine users] How to export an X11 back to the client?

2020-05-12 Thread Reuti
SH itself it is possible with the "match" option in "sshd_config" to allow only certain users from certain nodes. Nevertheless: maybe adding "-v" to the `ssh` command will output additional info, also the messages of `sshd` might be in some log file. -- Reuti > Feedb

Re: [gridengine users] About cpu time.

2020-05-07 Thread Reuti
Hi, It might be, that the application is ignoring the set OMP_NUM_THREADS (or assumes a max value if unset) and using all cores in a machine. How many cores are installed? -- Reuti Am 07.05.2020 um 01:04 schrieb Jerome IBt: > Dear all > > I'm facing a strange problem with some

Re: [gridengine users] How to export an X11 back to the client?

2020-05-02 Thread Reuti
> Am 02.05.2020 um 00:15 schrieb Mun Johl : > > Hi Reuti, > > Thank you for your reply. > Please see my comments below. > >> Hi, >> >> Am 01.05.2020 um 20:44 schrieb Mun Johl: >> >>> Hi, >>> >>> I am using SGE

Re: [gridengine users] How to export an X11 back to the client?

2020-05-01 Thread Reuti
he login via SSH. Hence I'm not sure what you are looking for to be set. Maybe you want to define in SGE to always use SSH -X? https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html -- Reuti > > Please advise. > > Thank you and regards, > > -- > Mun > ___

Re: [gridengine users] Job in error states

2020-03-07 Thread Reuti
Hi, is it alwys failing on one and the same node? Or are several nodes affected? One guess could be that the file system is full. -- Reuti > Am 05.03.2020 um 18:46 schrieb Jerome : > > Dear all > > I'm facing a strange error in SGE. One job is declared as in err

Re: [gridengine users] slots equals cores

2020-01-31 Thread Reuti
> Am 31.01.2020 um 18:23 schrieb Jerome IBt : > > Le 31/01/2020 à 10:19, Reuti a écrit : >> Hi Jérôme, >> >> Personally I would prefer to keep the output of `qquota` short and use it >> only for users's limits. I.e. defining the slot limit on an exechost

Re: [gridengine users] slots equals cores

2020-01-31 Thread Reuti
experience is, that sometime RQS are screwed up especially if used in combination with some load values (although $num_proc is of course fixed in your case). -- Reuti > Am 31.01.2020 um 17:00 schrieb Jerome : > > Dear all > > I'm facing a new problem on my cluster with SG

Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-30 Thread Reuti
Hi, I never used SGE OGS/GE 2011.11p1, and for other derivates it seems to work as intended. Is there any output in the messages file of the executing host where it mentions to try to kill the process due to an exhausted wallclock time? -- Reuti > Am 28.01.2020 um 03:50 schrieb Derrick

Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-24 Thread Reuti
longer than 48 hours. Are your directing these commands to SSH? -- Reuti > I am wondering if this is a known issue? > > I am running open source version of SGE OGS/GE 2011.11p1 > > Cheers, > Derrick > ___ > users mailing list &g

Re: [gridengine users] qsub -V doesn't set $PATH

2020-01-22 Thread Reuti
n the value of PATH. Another option could be an "adjustment" of the PATH variable by a JSV. -- Reuti > >> and epilog scripts run with the submission environment but possibly in the >> context of a different user (i.e. a user could point a root-running prolog >> s

Re: [gridengine users] finding out what jobs are using which PE

2020-01-22 Thread Reuti
> Am 22.01.2020 um 15:14 schrieb WALLIS Michael : > > > From: Reuti > >>> (for the record, if the number of used_slots is higher than the number >>> of slots, no jobs using that PE will run. Don't know how that's even >>> possible.) > >>

Re: [gridengine users] finding out what jobs are using which PE

2020-01-22 Thread Reuti
using that PE will run. Don't know how that's even possible.) You mean the setting of "slots" in the definition of a particular PE? -- Reuti > Cheers, > > Mike > > -- > Mike Wallis x503305 > University of Edinburgh, Research Services, > Argyle House, 3 Lady L

Re: [gridengine users] CPU and Mem usage for interactive jobs

2019-12-09 Thread Reuti
There are patches around to attach the additional group id to the ssh daemon: https://arc.liv.ac.uk/SGE/htmlman/htmlman8/pam_sge-qrsh-setup.html rlogin is used for an interactive login by `qrsh`, rsh for `qrsh` with a command. -- Reuti > Am 09.12.2019 um 18:39 schrieb Korzennik, Sylv

Re: [gridengine users] CPU and Mem usage for interactive jobs

2019-12-09 Thread Reuti
we need something different > in the GE configuration to enable this? Are you using the "builtin" method for the startup or SSH, i.e. for the settings in rsh_daemon resp. rsh_client? -- Reuti > Cheers, > Sylvain > -- > __

Re: [gridengine users] qsh not working

2019-11-19 Thread Reuti
mmunication of the daemons/clients for `qrsh` … in SGE has no support for X11 forwarding. Hence the approach by `qrsh xterm` should give a suitable result when set up to use "/usr/bin/ssh -X -Y". -- Reuti > Your job 2657108 ("INTERACTIVE") has been submitted > wai

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-28 Thread Reuti
ramdisk@node19 common@node27 common@node23 common@node23 common@node28 -- Reuti > > Regards, > > -- > Mun > > > From: Mun Johl > Sent: Friday, October 25, 2019 5:42 PM > To: dpo...@gmail.com > Cc: Skylar Thompson ;

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Reuti
rnal name changes, while the internal ones stay the same? -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs stuck in transitioning state

2019-09-27 Thread Reuti
XECD_PORT, ssh, scp, something else)? It uses its own protocol. No SSH inside the cluster is necessary. > What are the system-level requirements for succesfully sending the > submit scripts (for example: same UID for sge across the cluster, same > UID<->username for the user submitting the job across the cluster, etc)? Yes. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] preventing certain jobs from being suspended (subordinated)

2019-09-05 Thread Reuti
out this again - does the subordinate queue > setting accept 'queueu@@hostgroup' syntax like everything else? Don't > remember if I ever tried that. Yes, one can limit it to be available on certain machines only: subordinate_list NONE,[@intel2667v4=short] -- Reuti > Tina > &

Re: [gridengine users] preventing certain jobs from being suspended (subordinated)

2019-09-04 Thread Reuti
ond correctly to SIGSTOP, but the GPU portion keeps > running). > > Is there any way, with our current number of queues, to exempt jobs > using a GPU resource complex (-l gpu) from being suspended by short jobs? Not that I'm aware of. Almost 10 years ago I had a sim

Re: [gridengine users] limit CPU/slot resource to the number of reserved slots

2019-08-27 Thread Reuti
le: the number of slots will be corrected in the copy of the input file to the number of granted slots. Let me know if you would like to get them. -- Reuti > I was told the this should be possible in slurm (which we don't have, > and to which w

Re: [gridengine users] Sorting qhost and choosing qstat columns

2019-08-01 Thread Reuti
minating the "jclass" column, which doesn't > contain any information, but I can only find ways to add columns, not take > them away. Is there a way to make this column go away? Besides `cut -b`: what type of output are you looking for? There is a `qstatus` AWK scri

Re: [gridengine users] Automatically creating home directories before job execution

2019-07-30 Thread Reuti
Hi Ilya, Am 31.07.2019 um 00:55 schrieb Ilya M: > Hi Reuti, > > So /home is not mounted via NFS as it's usually done? > Correct. > > > How exactly is your setup? I mean: you want to create some kind of pseudo > home directory on the nodes (hence "-b y"

Re: [gridengine users] Automatically creating home directories before job execution

2019-07-30 Thread Reuti
getting > meaningful debug output. How exactly is your setup? I mean: you want to create some kind of pseudo home directory on the nodes (hence "-b y" can't be used with user binaries) and the staged job script (by SGE) will then execute the job

Re: [gridengine users] Adding a requirement to limit to certain execd hosts

2019-07-26 Thread Reuti
attach it the exechosts although attaching them to a queue might be shorter and a central location for all definitions. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Having issues getting sun grid engine running on new frontend

2019-07-25 Thread Reuti
etree" for > reading: No such file or directory Did you transfer the old configuration or does this pop up in a fresh installed system? Unfortunately the procedure might be changed by the ROCKS distribution compared to the

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-16 Thread Reuti
mon got. The changes of the "nofile" setting should be visible in the shell when you log in too. -- Reuti > Currently, my only workaround is to rebuild the Compute Node (reinstall OS > etc) so that it corrects this issue. > > >> Can you check the limits that are set i

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-03 Thread Reuti
ettings but the rest are fine. Several ulimits can be set in the queue configuration, and can so different for each queue or exechost. -- Reuti > I am wondering if this is SGE related? And idea is welcomed. > > Cheers, > Derrick > ___ >

Re: [gridengine users] h_vmem / m_mem_free

2019-06-27 Thread Reuti
that your job and the sum of all of its processes passed this limit and kill the job. SGE can do this, by using the additional group ID which is attached to all processes by a particular job, while the kernel might only watch a certain process. Using now an assigned cgroup, even the kernel ca

Re: [gridengine users] parallel vs single job

2019-06-24 Thread Reuti
t wait until all running jobs without this constraint were drained. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qmon finished jobs

2019-06-18 Thread Reuti
ot;Finished Jobs". But it won't retrieve jobs which were already drained from the listing. Also after a restart of the `qmaster`, the list will initially be empty. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti
AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would be: what was the initial command in the job script, and what output/error did it generate? -- Reuti

Re: [gridengine users] I need a decoder ring for the qacct output

2019-04-25 Thread Reuti
> Am 25.04.2019 um 17:41 schrieb Mun Johl : > > Hi Skyler, Reuti, > > Thank you for your reply. > Please see my comments below. > > On Thu, Apr 25, 2019 at 08:03 AM PDT, Reuti wrote: >> Hi, >> >>> Am 25.04.2019 um 16:53 schrieb Mun Joh

Re: [gridengine users] I need a decoder ring for the qacct output

2019-04-25 Thread Reuti
rched the man pages and web for definitions of the output of > qacct, but I have not been able to find a complete reference (just bits > and pieces here and there). > > Can anyone point me to a complete reference so that I can better > understand the output of qacct? There is a man page about

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti
Am 09.04.2019 um 21:08 schrieb Mun Johl: > Hi Reuti, > > One clarification question below ... > > On Tue, Apr 09, 2019 at 09:05 AM PDT, Reuti wrote: >>> Am 09.04.2019 um 17:43 schrieb Mun Johl : >>> >>> Hi Reuti, >>> >>>

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti
Am 09.04.2019 um 21:08 schrieb Mun Johl: > Hi Reuti, > > One clarification question below ... > > On Tue, Apr 09, 2019 at 09:05 AM PDT, Reuti wrote: >>> Am 09.04.2019 um 17:43 schrieb Mun Johl : >>> >>> Hi Reuti, >>> >>>

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti
> Am 09.04.2019 um 17:43 schrieb Mun Johl : > > Hi Reuti, > > Thank you for your reply! > Please see my comments below. > > On Mon, Apr 08, 2019 at 10:27 PM PDT, Reuti wrote: >> Hi, >> >>> Am 09.04.2019 um 05:37 schrieb Mun Johl : >>>

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-08 Thread Reuti
he contractor only need an account on serverA in > order to utilize SGE? Or would he need an account on the grid master as > well? Are you not using a central user administration by NIS or LDAP? AFAICS he needs an entry only on the execution host (and on the submission host of course). -- Reut

Re: [gridengine users] Limiting users' access to nodes

2019-04-08 Thread Reuti
NONE,[@special_hosts=allowed_users] and the white listed users have to be in the allowed_users ACL and the @special_hosts are the machines where ordinary users are banned from. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-12 Thread Reuti
> Am 12.03.2019 um 15:55 schrieb David Trimboli : > > > On 3/5/2019 12:34 PM, David Trimboli wrote: >> >> On 3/5/2019 12:18 PM, Reuti wrote: >>>> Am 05.03.2019 um 18:06 schrieb David Trimboli >>>> : >>>> >>>> I

[gridengine users] A Virtual GridEngine Cluster in a cluster

2019-03-08 Thread Reuti
ave this problem on a local scratch directory for $TMPDIR though. === BTW: did I mention it: no need to be root anywhere. -- Reuti multi-spawn.sh Description: Binary data __SGE_PLANCHET__.tgz Description: Binary data cluster.tgz Description: Binary data __

Re: [gridengine users] Priority?

2019-03-06 Thread Reuti
oesn't seem to change the order in > which they run. You mean the value you set with "-p"? To which value did you change this for certain jobs? $ qstat -pri might give a hint about the overall values which are assigned to a job in column "npprior". -- Reuti > > Can any

Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-05 Thread Reuti
.q} wouldn't hurt as it means "for each entry in the list", and the only entry is all.q. But to lower the impact I would leave this out. -- Reuti > } > > I get the feeling that will limit the number of slots that all users can > collectively use simultaneously to 10. I w

Re: [gridengine users] Fair share policy

2019-02-27 Thread Reuti
Hi, > Am 27.02.2019 um 22:07 schrieb Kandalaft, Iyad (AAFC/AAC) > : > > HI Reuti > > I'm implementing only a share-tree. Then you can set: policy_hierarchy S The past usage is stored in the user object, hence auto_user_delete_time should be zer

Re: [gridengine users] Fair share policy

2019-02-27 Thread Reuti
Hi, there is a man page "man sge_priority". Which policy do you intend to use: share-tree (honors past usage) or functional (current use), or both? -- Reuti > Am 25.02.2019 um 15:03 schrieb Kandalaft, Iyad (AAFC/AAC) > : > > Hi all, > > I recently implement

Re: [gridengine users] Accessing qacct accounting file from login/compute nodes

2019-02-19 Thread Reuti
ing info via qacct. I am wondering > what is the common way to achieve this without giving access to the qmaster > node? You mean, $SGE_ROOT is not shared in your cluster? -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] starting a new gridengine accounting file

2019-01-29 Thread Reuti
f <(zcat accounting.0.gz) -- Reuti > -- > JY > -- > "All ideas and opinions expressed in this communication are > those of the author alone and do not necessarily reflect the > ideas and opinions of anyone else." >

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Reuti
how to read it. Any helpful insight > much appreciated Did you try to stop and start the qmaster? -- Reuti > qping -i 5 -info hpc-s 6444 qmaster 1 > 01/26/2019 01:12:18: > SIRM version: 0.1 > SIRM message id: 1 > start time: 01/26/2

Re: [gridengine users] Installing man pages

2019-01-25 Thread Reuti
as no access (as the permission bits look fine). Essentially the exporting NFS machine will deny the access according to certain bits of the permission bits. Does the line form above contain a plus sign on the exporting machine like: -rwxr-xr-x+ 1 root root 1941408 Feb 28 2016 /opt/sge/bin/lx-amd64/qh

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti
> Am 24.01.2019 um 20:29 schrieb David Triimboli : > > On 1/24/2019 2:05 PM, Reuti wrote: >> Do the permissions for the directories include the x flag and not only r? >> >> drwxr-xr-x 2 root root 4.0K Jan 13 2010 man1 >> drwxr-xr-x 2 root root 4.0K Jan 13 20

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti
> Am 24.01.2019 um 19:28 schrieb David Triimboli : > > On 1/24/2019 1:14 PM, Reuti wrote: >>> Am 24.01.2019 um 19:10 schrieb David Triimboli : >>> >>> On 1/24/2019 12:44 PM, Reuti wrote: >>>> Hi, >>>> >>>>> Am 24.0

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti
> Am 24.01.2019 um 19:28 schrieb David Triimboli : > > On 1/24/2019 1:14 PM, Reuti wrote: >>> Am 24.01.2019 um 19:10 schrieb David Triimboli : >>> >>> On 1/24/2019 12:44 PM, Reuti wrote: >>>> Hi, >>>> >>>>> Am 24.0

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti
> Am 24.01.2019 um 19:10 schrieb David Triimboli : > > On 1/24/2019 12:44 PM, Reuti wrote: >> Hi, >> >>> Am 24.01.2019 um 18:28 schrieb David Triimboli : >>> >>> This is just a silly question. Using Son of Grid Engine 8.1.9, I installed >>

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti
? How do I get it to do > so? What is the output of the command: manpath on both machines? -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-22 Thread Reuti
d memory listed by `ipcs` and it starts to swap? -- Reuti > > Regards, > > Derek > -Original Message- > From: Reuti > Sent: January 18, 2019 11:26 AM > To: Derek Stephenson > Cc: users@gridengine.org > Subject: Re: [gridengine users] Dilemma with e

Re: [gridengine users] Starting out

2019-01-21 Thread Reuti
> Am 18.01.2019 um 18:06 schrieb David Triimboli : > > On 1/18/2019 11:49 AM, Reuti wrote: >>> Am 18.01.2019 um 17:41 schrieb David Triimboli : >>> >>> On 1/18/2019 11:22 AM, Reuti wrote: >>>> Hi, >>>> >>>>> Am 1

Re: [gridengine users] Starting out

2019-01-18 Thread Reuti
> Am 18.01.2019 um 17:41 schrieb David Triimboli : > > On 1/18/2019 11:22 AM, Reuti wrote: >> Hi, >> >>> Am 18.01.2019 um 17:09 schrieb David Triimboli : >>> >>> Hi, all. I've got a twenty-four-node cluster running versions of CentOS 5 >>

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-18 Thread Reuti
> Am 18.01.2019 um 16:26 schrieb Derek Stephenson > : > > Hi Reuti, > > I don't believe anyone has adjusted the scheduler from defaults but I see: > schedule_interval 00:00:04 > flush_submit_sec 1 > flush_finish_sec

Re: [gridengine users] Starting out

2019-01-18 Thread Reuti
ript inside SGE isn't prepared for your actual kernel, i.e. a case for 4.* kernels is missing. What does: $ $SGE_ROOT/util/arch return? -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-18 Thread Reuti
cket fd 7 For interactive jobs: any firewall in place, blocking the communication between the submission host and the exechost – maybe switched on at a later point in time? SGE will use a random port for the communication. After the reboot it worked instantly again? -- Reuti > Now I've seen a serie

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-11 Thread Reuti
> Am 11.01.2019 um 00:30 schrieb Derrick Lin : > > Hi Reuti > > Thanks for the input. But how does this help on troubleshooting the prolog > script? You asked for the meaning of the "-i" option, and I tried to outline its behavior. -- Reuti > I will also tr

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-10 Thread Reuti
> Am 09.01.2019 um 23:39 schrieb Derrick Lin : > > Hi Reuti and Iyad, > > Here is my prolog script, it just does one thing, setting quota on the XFS > volume for each job: > > The prolog_exec_xx_xx.log file was generated, so I assumed the first exec > comman

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-10 Thread Reuti
Hi, Am 09.01.2019 um 23:35 schrieb Derrick Lin: > Hi Reuti, > > I have to say I am still not familiar with the "-i" in qsub after reading the > man page, what does it do? It will be feed as stdin to the jobscript. Hence: $ qsub -i myfile foo.sh is like: $ foo.sh

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Reuti
ed. Is there any statement in the prolog, which could wait for stdin – and in a batch job there is just no stdin, hence it continues? Could be tested with "-i" to a batch job. -- Reuti > qsub job are working fine. > > Any idea will be appreciated > > Cheers, > Derrick

Re: [gridengine users] batch array jobs are executed on interactive queue

2019-01-08 Thread Reuti
n interactive.q; maybe you cen remove the PE smp there, unless you want to use it interactively too. -- Reuti > $ qconf -sp smp > pe_namesmp > slots 999 > user_lists NONE > xuser_listsNONE > start_proc_argsNONE > stop_proc_

Re: [gridengine users] Fwd: Request: JOB_ID, QUEUE, etc. variables in a QLOGIN session

2018-12-12 Thread Reuti
ed the session: rsh, ssh or built-in. IIRC the last `if [ -n "$MYJOBID" ];` section had only the purpose to display a message, which was set with "-ac" during submission and might not be necessary here. -- Reuti MYPARENT=`ps -p $$ -o ppid --no-header` #MYPARENT=`ps -p $MYPARENT -o

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-08 Thread Reuti
Am 07.12.2018 um 21:33 schrieb Derrick Lin: > Reuti, > > My further tests confirm that $TMP is set inside PROLOG, $TMPDIR is not. > > Both $TMPDIR and $TMP are set in job's environment. > > So technically my problem is solved by switching to $TMP. > &g

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-07 Thread Reuti
ot receive TMPDIR which should be created by > the scheduler. Is $TMP set? -- Reuti > > Other variables such as JOB_ID, PE_HOSTFILE are available though. > > We have been using the same script on the CentOS6 cluster with OGS/GE > 2011.11p1 without an issue

Re: [gridengine users] $TMPDIR With MPI Jobs

2018-12-06 Thread Reuti
I found my entry about this: https://arc.liv.ac.uk/trac/SGE/ticket/570 -- Reuti > Am 06.12.2018 um 19:03 schrieb Reuti : > > Hi, > >> Am 06.12.2018 um 18:36 schrieb Dan Whitehouse : >> >> Hi, >> I've been running some MPI jobs and I expected that whe

Re: [gridengine users] $TMPDIR With MPI Jobs

2018-12-06 Thread Reuti
were supposed to run in a dedicated cleaner.q only with no limits regarding slots (hence they started as soon as they were eligible tun start), but got a job hold on the actual job which submitted them to wait until it finished. -- Reuti > > -- > Dan Whitehouse > Research System

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti
slots limited: $ qconf -se global especially the line "complex_values". And next: any RQS? $ qconf -srqs -- Reuti > El jue., 6 dic. 2018 a las 12:55, Reuti () > escribió: > > > Am 06.12.2018 um 15:19 schrieb Dimar Jaime González Soto > > : > > > >

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti
ooks fine. So we have other settings to investigate: $ qconf -sconf #global: execd_spool_dir /var/spool/sge ... max_aj_tasks 75000 Is max_aj_tasks limited in your setup? -- Reuti > > El jue., 6 dic. 2018 a las 11:13, Reuti () > escribió: >

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti
11:04:02 >1 17-60:1 Aha, so they are running already on remote nodes – fine. As the setting in the queue configuration is per host, this should work and provide more processes per node instead of four. Is there a setting for the exechosts: qconf -se ubuntu-node2

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti
or a running job. There should be a line for each executing task while the waiting once are abbreviated in one line. -- Reuti > > I would try running qalter -w p against the job id to see what it says. > > William > > > >> >>> Am 05.12.2018 um 19:10 s

Re: [gridengine users] problem with concurrent jobs

2018-12-05 Thread Reuti
or the communication? The PE would deliver a hostlist to the application, which then can be used to start processes on other nodes too. Some MPI libraries even discover BTW: as you wrote "16 logical threads": often its advisable for HPC to disable Hyperthr

Re: [gridengine users] Issue with permissions on new servers added to cluster

2018-12-03 Thread Reuti
with 744 permissions. Were the sge_execd on these new machines started by root or sgeadmin? $ ps -e f -o user,ruser,command | grep sge sgeadmin root /usr/sge/bin/lx24-em64t/sge_execd root root \_ /bin/sh /usr/sge/cluster/tmpspace.sh -- Reuti > > > The user directories,

Re: [gridengine users] Processes not exiting

2018-11-14 Thread Reuti
ob_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > Then we have run our application "Maker" like this, > qsub -cwd -N -b y -V -pe mpi /opt/mpich-install/bin/mpiexec > maker Which version of MPICH are you using? Maybe it's not tightly integrat

Re: [gridengine users] Email warning for s_rt ?

2018-10-23 Thread Reuti
Hi, > Am 23.10.2018 um 20:31 schrieb Dj Merrill : > > Hi Reuti, > Thank you for your response. I didn't describe our environment very > well, and I apologize. We only have one queue. We've had a few > instances of people forgetting they ran a job that doesn't app

Re: [gridengine users] Email warning for s_rt ?

2018-10-20 Thread Reuti
There is an introduction to use the checkpoint interface here: https://arc.liv.ac.uk/SGE/howto/checkpointing.html -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Change the format of sge delivered mail

2018-10-17 Thread Reuti
users can check the result of the computation even without login into the cluster. e) this mail-wrapper script needs again to be defined with: qconf -mconf mailer /usr/sge/cluster/mailer.sh -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Bring up execd nodes without explicit configuration

2018-09-15 Thread Reuti
. -- Reuti Von meinem iPhone gesendet > Am 15.09.2018 um 19:15 schrieb Simon Matthews : > > Is there any way to bring up an execd node, without explicitly > configuring it at the qmaster? Perhaps it could come up and be added > to a default queue? > > If it is possible to do

Re: [gridengine users] cpu usage calculation

2018-08-31 Thread Reuti
ombine processes (like for MPI) and threads (like for Open MP). In your case, it looks to me that you assume that necessary cores are available, independent from the actual usage of each node? -- Reuti PS: I assume with CPUS you refer to CORES. > I have written up an article at: >

Re: [gridengine users] Gridengine: error: commlib error: got select error (connection refused)

2018-08-27 Thread Reuti
etup file? > > Running on Ubuntu. Installed by `sudo apt-get install gridengine-master > gridengine-client` and accepted all defaults. I have no clue about the Ubuntu issue. But usually you have to run a setup beforehand twice - one for the master, one for the client. Do you have

Re: [gridengine users] User job fails silently

2018-08-08 Thread Reuti
se), the jobscript is first transferred by SGE's protocol to the node, where the execd writes the jobscript in the shared space, which is on the headnode again. If you peek into the given file, you will hence find the original jobscript of the user. Does the jobscript try to modify itself, and the

Re: [gridengine users] User job fails silently

2018-08-08 Thread Reuti
s and they ran and completed without an problem. Could it be a race condition with the shared file system? -- Reuti > I am wondering what may has caused this situation in general? > > Cheers, > Derrick > ___ > users mailing list > users

Re: [gridengine users] Start jobs on exec host in sequential order

2018-08-01 Thread Reuti
> Am 01.08.2018 um 03:06 schrieb Derrick Lin : > > HI Reuti, > > The prolog script is set to run by root indeed. The xfs quota requires root > privilege. > > I also tried the 2nd approach but it seems that the addgrpid file has not > been created when the prolog

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-30 Thread Reuti
> Am 30.07.2018 um 02:31 schrieb Derrick Lin : > > Hi Reuti, > > The approach sounds great. > > But the prolog script seems to be run by root, so this is what I got: > > XFS_PROJID:uid=0(root) gid=0(root) groups=0(root),396(sfcb) This is quite unusual. Do

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-27 Thread Reuti
> Am 28.07.2018 um 03:00 schrieb Derrick Lin : > > Thanks Reuti, > > I know little about group ID created by SGE, and also pretty much confused > with the Linux group ID. Yes, SGE assigns a conventional group ID to each job to track the CPU and memory consumpt

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-27 Thread Reuti
e. This can be found in the `id` command's output or in location of the spool directory for the execd_spool_dir in ${HOSTNAME}/active_jobs/${JOB_ID}.${TASK_ID}/addgrpid -- Reuti > That's why I am trying to implement the xfs_projid to be independent from SGE. > > > > On Thu, J

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-26 Thread Reuti
did for generating project ID: > > XFS_PROJID_CF="/tmp/xfs_projid_counter" > > echo $JOB_ID >> $XFS_PROJID_CF > xfs_projid=$(wc -l < $XFS_PROJID_CF) The xfs_projid is then the number of lines in the file? Why not using $JOB_ID directly? Is there a limit in max. project ID an

Re: [gridengine users] running failed array jobs in UGE

2018-06-12 Thread Reuti
,56,98,134 failed to finish. > > What can I do to only run the failed job now? Can I use -t option in anyway > or do I have to submit it one by one? You have to submit it one by one, possibly in a `for` loop, but you can use -t to specify the to be used index at least as a single number.

Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

2018-06-11 Thread Reuti
> Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+g...@gmail.com>: > > Hello, > > Thank you for the suggestion, Reuti. Not sure if my users' pipelines can deal > with multiple job ids, perhaps they will be willing to modify their code. Also other commands in SGE

Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

2018-06-11 Thread Reuti
to get all the runs listed though. -- Reuti > This is my test script: > > #!/bin/bash > > #$ -S /bin/bash > #$ -l s_rt=0:0:5,h_rt=0:0:10 > #$ -j y > > set -x > set -e > set -o pipefail > set -u > > trap "exit 99" SIGUSR1 > >

Re: [gridengine users] Scheduling maintenance and using advance reservation

2018-06-06 Thread Reuti
taining all GPU nodes: qconf -mattr queue calendar wartung common@@myGPUgroup -- Reuti > Ilya. > > > On Wed, Jun 6, 2018 at 2:41 AM, Mark Dixon wrote: > On Tue, 5 Jun 2018, Ilya M wrote: > ... > Is there a way to submit AR when there are projects attached to queues? I am &g

Re: [gridengine users] SGE accounting file getting too big...

2018-05-18 Thread Reuti
l /var/spool/sge The default in the script itself I set to: UNCONFIGURED=no ACTION_ON=2 ACTIONSIZE=1024 KEEPOLD=3 Note: to read old accouting files in `qacct` on-the-fly you can use: $ qacct -o reuti -f <(zcat /usr/sge/default/common/accounting.1.gz) -- Reuti > > Thank

Re: [gridengine users] Debugging crash when running program through GridEngine

2018-05-04 Thread Reuti
B instead of being left as unlimited. You can compare the limits for the interactive access and a job submission by using: echo hard limits ulimit -aH echo soft limits ulimit -aS -- Reuti > > On Fri, May 04, 2018 at 01:45:24PM +, Simon Andrews wrote: >> I've got a strange pro

Re: [gridengine users] Is it possible to have load sensor for cluster-wide parameters?

2018-05-02 Thread Reuti
e recent discussion about the problem of using load values and RQS which could prevent scheduling. But the value could be used for any alarm or suspend threshold. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Some strange interaction of PE and RQS

2018-04-22 Thread Reuti
Hi, Am 20.04.2018 um 22:55 schrieb Ilya M: > Hi Reuti, > > There are dozens on hosts in @gpu. In my test submissions, however, I am > using only one host that I specify with '-l hostname='. I disabled all other > queues on this host to make sure nothing else but my test j

  1   2   3   4   5   6   7   8   9   10   >