I think that I have the cause:

saved core dump of pid 25405
(/home/build2/install/BluePearl-linux-x86_64-CentOS4.8-6.0.12300.bps_nightlyBuilds_6.0.12300-20120802-230002-PDT/bin/Linux/x86_64/BluePearlCLI)
to /var/spool/abrt/ccpp-2012-08-03-01:19:24-25405.new/coredump (3379773440
bytes)


The spool directory is local and it has about 3B of free space.


Simon

On Fri, Aug 3, 2012 at 11:12 AM, Rayson Ho <[email protected]> wrote:

> Check if the execd spool directory local or NFS shared?? The usage
> file is written into the execd spool, not the job's working dir.
>
> Rayson
>
>
>
> On Fri, Aug 3, 2012 at 2:07 PM, Simon Matthews
> <[email protected]> wrote:
> > Can anyone tell me what is going on here?
> >
> > The machine has plenty of disk space. The directory from which the job
> was
> > set to run has plenty of disk space.
> >
> > Also, is there any way to prevent such events from making the queue go
> into
> > the ERROR state?
> >
> > Simon
> >
> >
> > Job 8057282 caused action: Queue "[email protected]" set
> to
> > ERROR
> >  User        = build
> >  Queue       = [email protected]
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> > failed before job:08/03/2012 01:20:51 [600:25020]: can't close file
> usage:
> > No space left on device
> > Shepherd trace:
> > 08/03/2012 01:02:01 [600:25020]: shepherd called with uid = 0, euid = 600
> > 08/03/2012 01:02:01 [600:25020]: starting up 6.2u4
> > 08/03/2012 01:02:01 [600:25020]: setpgid(25020, 25020) returned 0
> > 08/03/2012 01:02:01 [600:25020]: no prolog script to start
> > 08/03/2012 01:02:01 [600:25020]: parent: forked "job" with pid 25023
> > 08/03/2012 01:02:01 [600:25023]: child: starting son(job,
> > /home/gridengine/blue/spool/h2-c6-64-1/job_scripts/8057282, 0);
> > 08/03/2012 01:02:01 [600:25020]: parent: job-pid: 25023
> > 08/03/2012 01:02:01 [600:25023]: pid=25023 pgrp=25023 sid=25023 old
> > pgrp=25020 getlogin()=root
> > 08/03/2012 01:02:01 [600:25023]: reading passwd information for user
> 'build'
> > 08/03/2012 01:02:01 [600:25023]: setosjobid: uid = 0, euid = 600
> > 08/03/2012 01:02:01 [600:25023]: setting limits
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_CPU setting: (soft 7200 hard
> 7200)
> > resulting: (soft 7200 hard 7200)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_FSIZE setting: (soft 0^HINFINITY
> > hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_DATA setting: (soft 0^HINFINITY
> hard
> > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_STACK setting: (soft 0^HINFINITY
> > hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_CORE setting: (soft 0^HINFINITY
> hard
> > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> > 0^HINFINITY hard 0^HINFINITY) resulting: (soft 0^HINFINITY hard
> 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: RLIMIT_RSS setting: (soft 0^HINFINITY
> hard
> > 0^HINFINITY) resulting: (soft 0^HINFINITY hard 0^HINFINITY)
> > 08/03/2012 01:02:01 [600:25023]: setting environment
> > 08/03/2012 01:02:01 [600:25023]: Initializing error file
> > 08/03/2012 01:02:01 [600:25023]: switching to intermediate/target user
> > 08/03/2012 01:02:01 [2002:25023]: closing all filedescriptors
> > 08/03/2012 01:02:01 [2002:25023]: further messages are in "error" and
> > "trace"
> > 08/03/2012 01:02:01 [2002:25023]: now running with uid=2002, euid=2002
> > 08/03/2012 01:02:01 [2002:25023]: execvp(/bin/bash, "bash" "-s")
> > 08/03/2012 01:20:51 [600:25020]: wait3 returned 25023 (status: 0;
> > WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> > 08/03/2012 01:20:51 [600:25020]: job exited with exit status 0
> > 08/03/2012 01:20:51 [600:25020]: reaped "job" with pid 25023
> > 08/03/2012 01:20:51 [600:25020]: job exited not due to signal
> > 08/03/2012 01:20:51 [600:25020]: job exited with status 0
> > 08/03/2012 01:20:51 [600:25020]: now sending signal KILL to pid -25023
> > 08/03/2012 01:20:51 [600:25020]: writing usage file to "usage"
> > 08/03/2012 01:20:51 [600:25020]: can't close file usage: No space left on
> > device
> >
> > Shepherd error:
> > 08/03/2012 01:20:51 [600:25020]: can't close file usage: No space left on
> > device
> >
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> >
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to