Re: [gridengine users] help

jan roels Thu, 22 Nov 2012 03:32:25 -0800

Hi,

qstat -j <jobid> didn't show the full error message, this one is the full
error message:


11/22/2012 12:26:11|  main|camilla|E|shepherd of job 76.226 exited with
exit status = 27
11/22/2012 12:26:11|  main|camilla|E|can't open usage file
"active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11|  main|camilla|E|11/22/2012 12:26:10 [0:11412]:
execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
"/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
or directory



2012/11/22 jan roels <[email protected]>

> Hi,
>
> Do you guys now what this error could be:
>
> error reason    2:          11/22/2012 11:12:25 [0:31220]:
> execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
> error reason    3:          11/22/2012 11:12:25 [0:31221]:
> execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
>
> this goes on as long as iets running... and my state went to:
>
>      69 0.50000 SA         root         Eqw   11/22/2012 09:12:05     1
> 1-500:1
>      69 0.00000 SA         root         qw    11/22/2012 09:12:05     1
> 501-4200:1
>
> This is the script i was running:
>
> #!/bin/bash
> #$-cwd
> #$-N SA
> #$-t 1-4200:1
>
> /var/software/packages/Mathematica/7.0/Executables/math -run
> "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
>
> Hope somebody can fix the problem.
>
> Kind Regards
>
>
> 2012/11/14 Reuti <[email protected]>
>
>> Am 14.11.2012 um 10:08 schrieb jan roels:
>>
>> > I got it working again, there was already a proces of execd running
>> that needed to be killed and then restart the services.
>> >
>> > I'm trying to run a script now:
>> >
>> >
>> > #!/bin/bash
>> > #$-cwd
>> > #$-N SA
>> > #$-S /bin/sh
>> > #$-t 1-4200:
>>
>> Don't run scripts at root. If something goes wring it might trash your
>> machine(s).
>>
>>
>> > /var/software/packages/Mathematica/7.0/Executables/math -run
>> "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
>> >
>> > but it gives the following output:
>> >
>> > stdin: is not a tty
>>
>> It's just a warning - unless someone complains I would suggest to ignore
>> it.
>>
>>
>> > and this is the output of my qstat -f:
>> >
>> > queuename                      qtype resv/used/tot. load_avg arch
>>    states
>> >
>> ---------------------------------------------------------------------------------
>> > [email protected]        BIP   0/1/1          0.70     lx26-amd64
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 1
>> >
>> ---------------------------------------------------------------------------------
>> > main.q@node0                   BIP   0/24/24        27.71    lx26-amd64
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 2
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 3
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 4
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 5
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 6
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 7
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 8
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 9
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 10
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 11
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 12
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 13
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 14
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 15
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 16
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 17
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 18
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 19
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 20
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 21
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 22
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 23
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 24
>> >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1
>> 25
>> >
>> >
>> ############################################################################
>> >  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>> >
>> ############################################################################
>> >      35 0.50000 SA         root         qw    11/14/2012 09:57:38     1
>> 26-4200:1
>> >
>> >
>> > root@camilla:/nfs/share/sge#  qstat -explain c -j 35
>> > ==============================================================
>> > job_number:                 35
>> > exec_file:                  job_scripts/35
>> > submission_time:            Wed Nov 14 09:57:38 2012
>> > owner:                      root
>> > uid:                        0
>> > group:                      root
>> > gid:                        0
>> > sge_o_home:                 /root
>> > sge_o_log_name:             root
>> > sge_o_path:
>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>> > sge_o_shell:                /bin/bash
>> > sge_o_workdir:              /nfs/share/sge
>> > sge_o_host:                 camilla
>> > account:                    sge
>> > cwd:                        /nfs/share/sge
>> > mail_list:                  root@camilla
>> > notify:                     FALSE
>> > job_name:                   SA
>> > jobshare:                   0
>> > shell_list:                 NONE:/bin/sh
>> > env_list:
>> > script_file:                HistDisCaCO31.sh
>> > job-array tasks:            1-4200:1
>> > usage    1:                 cpu=00:05:20, mem=105.16135 GBs,
>> io=0.01537, vmem=1.110G, maxvmem=1.110G
>> > usage    2:                 cpu=00:04:17, mem=179.44371 GBs,
>> io=0.01395, vmem=3.643G, maxvmem=3.643G
>> > usage    3:                 cpu=00:04:37, mem=191.69532 GBs,
>> io=0.01394, vmem=3.657G, maxvmem=3.657G
>> > usage    4:                 cpu=00:04:34, mem=188.12645 GBs,
>> io=0.01394, vmem=3.655G, maxvmem=3.655G
>> > usage    5:                 cpu=00:04:16, mem=180.18292 GBs,
>> io=0.01394, vmem=3.636G, maxvmem=3.636G
>> > usage    6:                 cpu=00:04:22, mem=183.47616 GBs,
>> io=0.01394, vmem=3.644G, maxvmem=3.644G
>> > usage    7:                 cpu=00:04:15, mem=179.89624 GBs,
>> io=0.01400, vmem=3.640G, maxvmem=3.640G
>> > usage    8:                 cpu=00:04:55, mem=207.28643 GBs,
>> io=0.01394, vmem=3.669G, maxvmem=3.669G
>> > usage    9:                 cpu=00:04:27, mem=184.86707 GBs,
>> io=0.01394, vmem=3.653G, maxvmem=3.653G
>> > usage   10:                 cpu=00:04:14, mem=179.09446 GBs,
>> io=0.01394, vmem=3.635G, maxvmem=3.635G
>> > usage   11:                 cpu=00:04:47, mem=195.80372 GBs,
>> io=0.01400, vmem=3.668G, maxvmem=3.668G
>> > usage   12:                 cpu=00:04:49, mem=203.43895 GBs,
>> io=0.01394, vmem=3.665G, maxvmem=3.665G
>> > usage   13:                 cpu=00:04:45, mem=196.67175 GBs,
>> io=0.01394, vmem=3.663G, maxvmem=3.663G
>> > usage   14:                 cpu=00:04:24, mem=185.68047 GBs,
>> io=0.01400, vmem=3.648G, maxvmem=3.648G
>> > usage   15:                 cpu=00:04:40, mem=195.96253 GBs,
>> io=0.01394, vmem=3.656G, maxvmem=3.656G
>> > usage   16:                 cpu=00:04:11, mem=179.84016 GBs,
>> io=0.01394, vmem=3.633G, maxvmem=3.633G
>> > usage   17:                 cpu=00:04:43, mem=196.21689 GBs,
>> io=0.01394, vmem=3.662G, maxvmem=3.662G
>> > usage   18:                 cpu=00:04:37, mem=197.39875 GBs,
>> io=0.01394, vmem=3.653G, maxvmem=3.653G
>> > usage   19:                 cpu=00:04:35, mem=191.55982 GBs,
>> io=0.01394, vmem=3.653G, maxvmem=3.653G
>> > usage   20:                 cpu=00:04:26, mem=191.62928 GBs,
>> io=0.01394, vmem=3.643G, maxvmem=3.643G
>> > usage   21:                 cpu=00:04:42, mem=197.87398 GBs,
>> io=0.01394, vmem=3.660G, maxvmem=3.660G
>> > usage   22:                 cpu=00:04:36, mem=193.43107 GBs,
>> io=0.01394, vmem=3.652G, maxvmem=3.652G
>> > usage   23:                 cpu=00:04:32, mem=193.12103 GBs,
>> io=0.01394, vmem=3.652G, maxvmem=3.652G
>> > usage   24:                 cpu=00:04:25, mem=186.56485 GBs,
>> io=0.01400, vmem=3.644G, maxvmem=3.644G
>> > usage   25:                 cpu=00:04:51, mem=201.81706 GBs,
>> io=0.01400, vmem=3.669G, maxvmem=3.669G
>> > scheduling info:            queue instance "main.q@camilla" dropped
>> because it is full
>> >                             queue instance "main.q@node0" dropped
>> because it is full
>> >                             All queues dropped because of overload or
>> full
>> >                             not all array task may be started due to
>> 'max_aj_instances'
>>
>> The machine is just full.
>>
>> -- Reuti
>>
>>
>> > You guys know how this can be solved?
>> >
>> >
>> >
>> > 2012/11/13 Reuti <[email protected]>
>> > Am 13.11.2012 um 13:42 schrieb jan roels:
>> >
>> > > Hi,
>> > >
>> > > I followed the following tutorial:
>> > >
>> > >
>> http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon
>>  how to install the SGE. It all went fine on my masternode but on my exec
>> node i have some troubles.
>> > >
>> > > First it gave the following error:
>> > >
>> > > 11/13/2012 13:44:43|  main|node0|E|communication error for
>> "node0/execd/1" running on port 6445: "can't bind socket"
>> >
>> > Is there already something running on this port - any older version of
>> the execd?
>> >
>> >
>> > > 11/13/2012 13:44:44|  main|node0|E|commlib error: can't bind socket
>> (no additional information available)
>> > > 11/13/2012 13:45:12|  main|node0|C|abort qmaster registration due to
>> communication errors
>> > > 11/13/2012 13:45:14|  main|node0|W|daemonize error: child exited
>> before sending daemonize state
>> > >
>> > > but then i killed the proces and restarted the gridengine-execd but
>> then i get the following:
>> > >
>> > > /etc/init.d/gridengine-exec restart
>> > > * Restarting Sun Grid Engine Execution Daemon sge_execd
>>                                      error: can't resolve host name
>> > > error: can't get configuration from qmaster -- backgrounding
>> > >
>> > > What can i do to fix this?
>> >
>> > Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
>> >
>> > -- Reuti
>> >
>> > > _______________________________________________
>> > > users mailing list
>> > > [email protected]
>> > > https://gridengine.org/mailman/listinfo/users
>> >
>> >
>>
>>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] help

Reply via email to