Re: [gridengine users] grid engine log files corruptions

Karun K Mon, 12 May 2014 16:17:08 -0700

Thanks!


On Mon, May 12, 2014 at 11:19 AM, Reuti <[email protected]> wrote:

> Am 12.05.2014 um 20:12 schrieb Karun K:
>
> > The qsub man page says "If job scripts are available on the execution
> nodes, e.g. via NFS, binary submission can be the better choice".
>
> This depends. In case of a non-binary, the script will be copied at
> submission time and this copy later be executed. I.e., you can even delete
> the job script like our submission system is doing it. Submitting a job
> script as binary might lead to the effect, that you apply changes to the
> script and suddenly jobs submitted in the past fail due to an error in the
> script - as the actual version is executed now.
>
> -- Reuti
>
>
> > On Mon, May 12, 2014 at 11:01 AM, Karun K <[email protected]> wrote:
> > It's writing output to job submission directory (default behavior) which
> works for us.
> > Regarding using -V for shell scripts, I need to consult my engineers
> about it. Other than exporting current environment variables, is there any
> other difference ?
> >
> > Thanks!
> >
> >
> >
> > On Sat, May 10, 2014 at 5:05 AM, Reuti <[email protected]>
> wrote:
> > Am 10.05.2014 um 01:18 schrieb Karun K:
> >
> > > Here is the job script,
> > >
> > > qsub -N myusername$N -l h_vmem=5.0G ../job1.sh my1$N
> > >
> > > from .sge_request
> > > # default SGE options
> > > -j y -cwd -b y
> > > # -j y -cwd
> > > # -cwd
> >
> > I don't see a -o Option here - how does ../job1.sh decide where the
> output should go to; why are you submitting a script as binary?
> >
> > -- Reuti
> >
> >
> > > On Fri, May 9, 2014 at 3:35 PM, Reuti <[email protected]>
> wrote:
> > > Am 10.05.2014 um 00:18 schrieb Karun K:
> > >
> > > > Reuti,
> > > >
> > > > Some of them are array jobs, looks like we have been using $task_id
> for array jobs.
> > > > The issue we are seeing are for non-array jobs.
> > > >
> > > > Here is a snippet from one of the corrupted job output log file, the
> numbers in between the txt lines are actually output from a different job.
> > >
> > > How exactly and where are you specifying this output path: command
> line or inside the job script?
> > >
> > > What does the job script look like?
> > >
> > > -- Reuti
> > >
> > >
> > > > Processing Haplotype 7204 of 15166 ...
> > > >     Outputting Individual 450996750985279->450996750985279 ...
> > > >   Processing Haplotype 7205 of 15166 ...
> > > >   Processing Haplotype 7206 of 15166 ...
> > > >     Outputting Individual 632999004155376->632999004155376 ...
> > > >   Processing Haplotype 7207 of 15955    0.532   0.994   0.538 0.998
>     0.999   0.988   0.561   0.560   0.995   0.607   0.978   0.949   0.577
> 0.998   0.926   0.998
> > > >         0.927   0.938   0.532   0.997   0.999   0.994   0.965 0.533
>     0.994   0.938   0.738   0.945   0.995   0.534   0.529   0.998   0.999
> 0.968   0.534   0.994
> > > >         0.531   0.997   0.539   0.529   0.945   0.529   0.999 0.996
>     0.926   0.535   0.546   0.946   0.999   0.999   0.945   0.996   0.998
> 0.979   0.978   0.532
> > > >         0.925   0.987   0.994   0.945   0.984   0.998   0.969 0.999
>     0.983   0.543   0.718   0.918   0.555   0.501   0.998   0.541   0.998
> 0.999   0.997   0.553
> > > >         0.946   0.987   0.995   0.999   0.979   0.999   0.999 0.881
>     0.543   0.541   0.538   0.900   0.979   0.999   0.998   0.999   0.999
> 0.999   0.999   0.999
> > > >         0.990   0.989   0.986   0.931   0.997   0.997   0.999 0.999
>     0.530   0.997   0.925   0.994   0.986   0.795   0.999   0.999   0.978
> 0.993   0.721   0.978
> > > >         0.538   0.998   0.999   0.984   0.999   0.997   0.997 0.979
>     0.553   0.795   0.999   0.979   0.998   0.995   0.999   0.988   0.946
> 0.543   0.558   0.995
> > > >         0.983   0.992   0.926   0.567   0.979   0.923   0.919 0.949
>     0.652   0.940   0.995   0.999   0.999   0.647   0.996   0.678   0.933
> 0.870   0.997   0.690
> > > > 0.995   0.992   0.981   0.932   0.995   0.993   0.999 0.998
> 0.861   0.861   0.979   0.995   0.999   0.999   0.584   0.861   0.978
> 0.870   0.872   0.932
> > > >         0.999   0.790   0.995   0.999   0.932   0.999   0.863 0. of
> 15166 ...
> > > >   Processing Haplotype 8564 of 15166 ...
> > > >     Outputting Individual 770954964699120->770954964699120 ...
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, May 9, 2014 at 2:46 PM, Reuti <[email protected]>
> wrote:
> > > > Am 09.05.2014 um 23:29 schrieb Karun K:
> > > >
> > > > > Thanks Reuti.
> > > > >
> > > > > But how come other log files are fine and we only see this
> behavior on few output logs randomly?
> > > >
> > > > And all are array jobs?
> > > >
> > > > In case just one runs after the other, they will override the old
> logfile.
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > Shouldn't it be consistent with all other output logs too ?
> > > > >
> > > > >
> > > > > On Fri, May 9, 2014 at 2:17 PM, Reuti <[email protected]>
> wrote:
> > > > > Am 09.05.2014 um 23:04 schrieb Karun K:
> > > > >
> > > > > > Yes, these are array jobs with output path set to -cwd during
> job submission.
> > > > >
> > > > > Well, then you also have to use the $TASK_ID in the -o option to
> distinguish between different tasks.
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > On Fri, May 9, 2014 at 12:20 PM, Reuti <
> [email protected]> wrote:
> > > > > > Am 09.05.2014 um 20:18 schrieb Karun K:
> > > > > >
> > > > > > > Reuti,
> > > > > > >
> > > > > > > These are the job output logs not
> /var/spool/sge/qmaster/message. These are in user job directories with
> jobname.o$jobid
> > > > > >
> > > > > > How exactly and where are you specifying this output path:
> command line or inside the job script?
> > > > > >
> > > > > > Are these array jobs?
> > > > > >
> > > > > > -- Reuti
> > > > > >
> > > > > >
> > > > > > > On Fri, May 9, 2014 at 11:02 AM, Reuti <
> [email protected]> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Am 09.05.2014 um 19:43 schrieb Karun K:
> > > > > > >
> > > > > > > > We are using OGS/GE 2011.11p1
> > > > > > > >
> > > > > > > > We encountered log file corruptions, in ge log files there
> is output of some other jobs written to it (in very few log files),
> filesystem is working fine, no corruptions with data files just in some ge
> log files randomly.
> > > > > > >
> > > > > > > What file do you refer to in detail - the
> /var/spool/sge/qmaster/messages and alike? Although it's best to have them
> local on each node, even having them in an NFS locations still means that
> only one process - the sge_exed/sge_qmaster will write to it.
> > > > > > >
> > > > > > > -- Reuti
> > > > > > >
> > > > > > > >
> > > > > > > > Has anyone else seen this issue?
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > _______________________________________________
> > > > > > > > users mailing list
> > > > > > > > [email protected]
> > > > > > > > https://gridengine.org/mailman/listinfo/users
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] grid engine log files corruptions

Reply via email to