I think that I have the cause: saved core dump of pid 25405 (/home/build2/install/BluePearl-linux-x86_64-CentOS4.8-6.0.12300.bps_nightlyBuilds_6.0.12300-20120802-230002-PDT/bin/Linux/x86_64/BluePearlCLI) to /var/spool/abrt/ccpp-2012-08-03-01:19:24-25405.new/coredump (3379773440 bytes)
The spool directory is local and it has about 3B of free space. Simon On Fri, Aug 3, 2012 at 11:12 AM, Rayson Ho <[email protected]> wrote: > Check if the execd spool directory local or NFS shared?? The usage > file is written into the execd spool, not the job's working dir. > > Rayson > > > > On Fri, Aug 3, 2012 at 2:07 PM, Simon Matthews > <[email protected]> wrote: > > Can anyone tell me what is going on here? > > > > The machine has plenty of disk space. The directory from which the job > was > > set to run has plenty of disk space. > > > > Also, is there any way to prevent such events from making the queue go > into > > the ERROR state? > > > > Simon > > > > > > Job 8057282 caused action: Queue "[email protected]" set > to > > ERROR > > User = build > > Queue = [email protected] > > Start Time = <unknown> > > End Time = <unknown> > > failed before job:08/03/2012 01:20:51 [600:25020]: can't close file > usage: > > No space left on device > > Shepherd trace: > > 08/03/2012 01:02:01 [600:25020]: shepherd called with uid = 0, euid = 600 > > 08/03/2012 01:02:01 [600:25020]: starting up 6.2u4 > > 08/03/2012 01:02:01 [600:25020]: setpgid(25020, 25020) returned 0 > > 08/03/2012 01:02:01 [600:25020]: no prolog script to start > > 08/03/2012 01:02:01 [600:25020]: parent: forked "job" with pid 25023 > > 08/03/2012 01:02:01 [600:25023]: child: starting son(job, > > /home/gridengine/blue/spool/h2-c6-64-1/job_scripts/8057282, 0); > > 08/03/2012 01:02:01 [600:25020]: parent: job-pid: 25023 > > 08/03/2012 01:02:01 [600:25023]: pid=25023 pgrp=25023 sid=25023 old > > pgrp=25020 getlogin()=root > > 08/03/2012 01:02:01 [600:25023]: reading passwd information for user > 'build' > > 08/03/2012 01:02:01 [600:25023]: setosjobid: uid = 0, euid = 600 > > 08/03/2012 01:02:01 [600:25023]: setting limits > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_CPU setting: (soft 7200 hard > 7200) > > resulting: (soft 7200 hard 7200) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_FSIZE setting: (soft 0^HINFINITY > > hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_DATA setting: (soft 0^HINFINITY > hard > > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_STACK setting: (soft 0^HINFINITY > > hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_CORE setting: (soft 0^HINFINITY > hard > > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_VMEM/RLIMIT_AS setting: (soft > > 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard > 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: RLIMIT_RSS setting: (soft 0^HINFINITY > hard > > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY) > > 08/03/2012 01:02:01 [600:25023]: setting environment > > 08/03/2012 01:02:01 [600:25023]: Initializing error file > > 08/03/2012 01:02:01 [600:25023]: switching to intermediate/target user > > 08/03/2012 01:02:01 [2002:25023]: closing all filedescriptors > > 08/03/2012 01:02:01 [2002:25023]: further messages are in "error" and > > "trace" > > 08/03/2012 01:02:01 [2002:25023]: now running with uid=2002, euid=2002 > > 08/03/2012 01:02:01 [2002:25023]: execvp(/bin/bash, "bash" "-s") > > 08/03/2012 01:20:51 [600:25020]: wait3 returned 25023 (status: 0; > > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > > 08/03/2012 01:20:51 [600:25020]: job exited with exit status 0 > > 08/03/2012 01:20:51 [600:25020]: reaped "job" with pid 25023 > > 08/03/2012 01:20:51 [600:25020]: job exited not due to signal > > 08/03/2012 01:20:51 [600:25020]: job exited with status 0 > > 08/03/2012 01:20:51 [600:25020]: now sending signal KILL to pid -25023 > > 08/03/2012 01:20:51 [600:25020]: writing usage file to "usage" > > 08/03/2012 01:20:51 [600:25020]: can't close file usage: No space left on > > device > > > > Shepherd error: > > 08/03/2012 01:20:51 [600:25020]: can't close file usage: No space left on > > device > > > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
