On Thu, 3 Mar 2011, Dave Love wrote: ...
Agreed. I keep meaning to put together a script to extract the available info from qacct, the GE log files, and possibly syslog, post mortem for a job (assuming shared classic spooling). Does anyone else fancy having a go?
Do you mean something like the attached script? It ticks 2 out of your 3 boxes. It's hideously out of date, but worked pretty well for SGE 6.0 and where the client spool directories were in a central location.
I've been meaning to dust it off and bring it up to date for 6.2 and where the client spool is local to each compute node (but I arrange for the messages file to end up in the central location anyway), but need to sort-out a few other problems first (with the execd logging, etc.)
e.g. Here is an example of a task array job on a 6.0 cluster: some subtasks ran ok, some were killed because they ran out of memory, others were killed because they ran out of time.
$ ./qacct-summary -w 414599 WARNING: this program is in beta and may change at any time. 414599.1 ok 414599.2-295 h_vmem 414599.296 h_rt 414599.297-460 h_vmem 414599.461 h_rt 414599.462-473 h_vmem 414599.474 h_rt 414599.475-526 h_vmem 414599.527 h_rt 414599.528-563 h_vmem 414599.564 h_rt 414599.565-566 h_vmem 414599.567 h_rt 414599.568-588 h_vmem 414599.589-590 h_rt 414599.591-604 h_vmem 414599.605 h_rt 414599.606-614 h_vmem 414599.615 h_rt 414599.616-647 h_vmem 414599.648 h_rt 414599.649-666 h_vmem 414599.667 h_rt 414599.668-692 h_vmem 414599.693 h_rt 414599.694-699 h_vmem 414599.700-701 h_rt 414599.702-716 h_vmem 414599.717 h_rt 414599.718 h_vmem 414599.719 h_rt 414599.720-732 h_vmem 414599.733 h_rt 414599.734 h_vmem 414599.735 h_rt
Agreed. The standard 137 exit code is not really quite enough. For the administrator, it could be *very* useful to know how many jobs are being killed as a result of exceeding resource requests,(In case people don't know) You can arrange for the admin to get more useful mail about at least some failed jobs, but I can't remember off-hand what configures it. That's at least partially broken in 6.2u5, and I typically just get mailed the hostfile, though I have a fix which hasn't been properly tested <https://arc.liv.ac.uk/trac/SGE/ticket/1307>. It can also swamp you if an array job fails because its working directory disappeared, for instance.
I think it's enabled if you set the administrator_mail option in "qconf -mconf"?
TTFN Mark -- ----------------------------------------------------------------- Mark Dixon Email : [email protected] HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
#!/usr/bin/perl # Copyright (C) 2007 University of Leeds # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Collect specific data generated from qacct. Most useful for finding what # task array jobs failed, or the maximum memory usage made by a program. # # 09Oct07 Mark Dixon <[email protected]> # # $Id: qacct-summary 1341 2011-03-04 08:30:17Z issmcd $ use strict; use Getopt::Std; use Env('SGE_ROOT'); use File::Basename; use IO::File; my($jobID,$maxTaskid,$minTaskid,@data,%opts,$padLen,$parallel); my($usage) = "Usage: ". basename($0) ." [-wefGMh] [jobid...]\n"; my $qconf = "qconf"; my $acctData = $SGE_ROOT ."/default/common/accounting"; # Check for command-line arguments getopts('wefGMh', \%opts); # Print help, if necessary if ($opts{h}) { print $usage; print "\n"; print " -w : print the reason why a job ended\n"; print " ok : no problems were detected\n"; print " exit-<n> : job script exited with exit code <n>\n"; print " h_rt : task exceeded requested h_rt\n"; print " h_vmem : task exceeded requested h_vmem\n"; print " killed : killed for unknown reason (perhaps by qdel)\n"; print " unknown : unanticipated failure case - please email support\n"; print " -e : print exit status of job (or SGE status in case of certain errors)\n"; print " -f : print failure record for job\n"; print " -G : print maximum memory usage, in G (rounded-up)\n"; print " -M : print maximum memory usage, in M (rounded-up)\n"; print " -h : print this message\n"; print "\n"; exit 1; } # Print usage method, if necessary if ($#ARGV == -1) { print STDERR $usage; exit 1; } my $defmaxvmem; if ($opts{w}) { # Get default job memory limit $defmaxvmem = SASGEgetDefMemLimit(); } # Collect qacct data foreach $jobID (@ARGV) { my $fh = SASGEacctOpen(); my @d = @{SASGEacctRead($fh,quotemeta($jobID))}; foreach my $r (@d) { # Extract requested h_vmem from resource list (not always needed) my $h_vmem; if ($opts{w}) { $h_vmem = SASGEtoBytes(SASGEacctResource($r,"h_vmem")); $h_vmem = $defmaxvmem; } if ($opts{G} || $opts{M}) { $parallel++ if ($r->{granted_pe} ne "NONE"); } # Initialise job end reason my $why = "unknown"; # Determine job end reason $why = "ok" if ($r->{failed} == 0 && $r->{exit_status} == 0); $why = "exit-". $r->{exit_status} if ($r->{failed} == 0 && $r->{exit_status} != 0); # A bit iffy # Fix task_number: "0" denotes a non-taskarray job but reported as "1" elsewhere in SGE. $r->{task_number} = 1 if ($r->{task_number} == 0); # Determine range of task_numbers $minTaskid = $r->{task_number} if ($minTaskid eq undef || $r->{task_number} < $minTaskid); $maxTaskid = $r->{task_number} if ($maxTaskid eq undef || $r->{task_number} > $minTaskid); # Store data for later read-out $data[$r->{task_number}] = { failed => $r->{failed}, exit_status => $r->{exit_status}, maxvmem => $r->{maxvmem}, hostname => $r->{hostname}, maxgb => (int($r->{maxvmem}/(1024*1024*1024))+1), maxmb => (int($r->{maxvmem}/(1024*1024))+1), why => $why, h_vmem => $h_vmem }; } # Perform additional processing for "why?" flag if ($opts{w}) { # Get default job memory limit my $defmaxvmem = SASGEgetDefMemLimit(); # Search qmaster messages for "killed" messages open(MESG,"cat $SGE_ROOT/default/spool/qmaster/messages |"); while(<MESG>) { next if (! /\|job\s+([0-9]+)/); next if ($1 ne $jobID); if (/job\s+[0-9]+\.([0-9]+)\s+failed on host\s+\S+\s+assumedly after job because: job\s+[0-9.]+\s+died through signal KILL/) { my $task_number = $1; $data[$task_number]{why} = "killed" if (ref($data[$task_number]) eq "HASH"); } } close(MESG); # Search sgeexecd messages for "killed" reasons open(MESG,"cat $SGE_ROOT/default/spool/*/messages |"); while(<MESG>) { next if (! /\|job\s+([0-9]+)(\s+|\.)/); next if ($1 ne $jobID); if (/job\s+([0-9]+)\.([0-9]+)\s+exceeded hard wallclock time/) { my $task_number = $2; $data[$task_number]{why} = "h_rt" if (ref($data[$task_number]) eq "HASH"); } # Would normally search here for exceeding "h_vmem", but message does not include # the task_number. } close(MESG); # Check if memory was exceeded foreach my $d (@data) { next if (ref($d) ne "HASH"); if ($d->{why} eq "killed" && $d->{maxvmem} >= $d->{h_vmem}) { $d->{why} = "h_vmem"; } } } print STDERR "WARNING: this program is in beta and may change at any time.\n"; if ($parallel) { print STDERR "WARNING: memory records for parallel jobs are not correct.\n"; } # Get formatting information $padLen = calcMaxStringLen($jobID,$minTaskid,$maxTaskid); # Sort and print data my @groups; if ($opts{w}) { @groups = sortGrps("why",$minTaskid,$maxTaskid,@data); } elsif ($opts{e}) { @groups = sortGrps("exit_status",$minTaskid,$maxTaskid,@data); } elsif ($opts{f}) { @groups = sortGrps("failed",$minTaskid,$maxTaskid,@data); } elsif ($opts{G}) { @groups = sortGrps("maxgb",$minTaskid,$maxTaskid,@data); } elsif ($opts{M}) { @groups = sortGrps("maxmb",$minTaskid,$maxTaskid,@data); } printGrps($jobID .".", @groups); } exit 0; sub sortGrps { my($key,$start,$end,@hashdata) = @_; my($i,@grps,%hash); my($grp) = undef; for ($i = $start; $i <= $end; $i++) { if (ref($hashdata[$i]) ne "HASH") { # End current group because job does not exist $grp = undef; next; } # Temp value used for brevity %hash = %{$hashdata[$i]}; # Add value to existing group, or start a new one if ($grp) { if ($hash{$key} eq $grp->{value}) { # Add this to the group $grp->{end}++; } else { # Start a new group (value changed) $grp = { start => $i, end => $i, value => $hash{$key} }; push(@grps,$grp); } } else { # Start a new group (no current group) $grp = { start => $i, end => $i, value => $hash{$key} }; push(@grps,$grp); } } return @grps; } sub printGrps { my($prefix, @grps) = @_; my($grp); foreach $grp (@grps) { if ($grp->{start} eq $grp->{end}) { print padString($prefix . $grp->{start}) ." ". $grp->{value} ."\n"; } else { print padString($prefix . $grp->{start} ."-". $grp->{end}) ." ". $grp->{value} ."\n"; } } } sub calcMaxStringLen { my($prefix,$start,$end) = @_; my($max); # Just in case $max = $end; $max = $start if ($start > $max); return (length($prefix) + 2 + 2* length($max)); } sub padString { return sprintf("%-${padLen}s", $_[0]) } # Obtain the default job memory limit # (returns undef if there isn't one) sub SASGEgetDefMemLimit { my($memlimit) = undef; my($fh) = new IO::File; $fh->open("$qconf -sc |") || die("Problem running qconf"); while(<$fh>) { if (/^h_vmem\s+h_vmem\s+MEMORY\s+<=\s+YES\s+YES\s+(\S+)/) { $memlimit = SASGEtoBytes($1); } } $fh->close || die("Problem while running qconf"); return $memlimit; } # Open the SGE accounting file sub SASGEacctOpen { my($fh) = IO::File->new; $fh->open("< $acctData"); return $fh; } # Read records from SGE accounting file # $fh : Filehandle to accounting file # $findJobs : regex for jobs to search for (returns all if blank) sub SASGEacctRead { my($fh,$findJobs) = @_; my(@d,$r,$c); $c = 0; while(<$fh>) { $r = $_; chomp($r); # Only load specific job id(s), if necessary next if ($findJobs && ($r !~ /^([^:]+:){5}${findJobs}:/)); if ($r =~ /^([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):([^:]+):(.*):([^:]+):([^:]+):([^:]+)$/) { $d[$c]{qname} = $1; $d[$c]{hostname} = $2; $d[$c]{group} = $3; $d[$c]{owner} = $4; $d[$c]{job_name} = $5; $d[$c]{job_number} = $6; $d[$c]{account} = $7; $d[$c]{priority} = $8; $d[$c]{submission_time} = $9; $d[$c]{start_time} = $10; $d[$c]{end_time} = $11; $d[$c]{failed} = $12; $d[$c]{exit_status} = $13; $d[$c]{ru_wallclock} = $14; $d[$c]{ru_utime} = $15; $d[$c]{ru_stime} = $16; $d[$c]{ru_maxrss} = $17; $d[$c]{ru_ixrss} = $18; $d[$c]{ru_ismrss} = $19; $d[$c]{ru_idrss} = $20; $d[$c]{ru_isrss} = $21; $d[$c]{ru_minflt} = $22; $d[$c]{ru_majflt} = $23; $d[$c]{ru_nswap} = $24; $d[$c]{ru_inblock} = $25; $d[$c]{ru_oublock} = $26; $d[$c]{ru_msgsnd} = $27; $d[$c]{ru_msgrcv} = $28; $d[$c]{ru_nsignals} = $29; $d[$c]{ru_nvcsw} = $30; $d[$c]{ru_nivcsw} = $31; $d[$c]{project} = $32; $d[$c]{department} = $33; $d[$c]{granted_pe} = $34; $d[$c]{slots} = $35; $d[$c]{task_number} = $36; $d[$c]{cpu} = $37; $d[$c]{mem} = $38; $d[$c]{io} = $39; $d[$c]{category} = $40; $d[$c]{iow} = $41; $d[$c]{pe_taskid} = $42; $d[$c]{maxvmem} = $43; $c++; } } return \@d; } # Return the SGE memory value in bytes sub SASGEtoBytes { my($value) = @_; if ($value =~ /([0-9.]+)K/) { return ($1 * 1024); } elsif ($value =~ /([0-9.]+)k/) { return ($1 * 1000); } elsif ($value =~ /([0-9.]+)M$/) { return ($1 * 1024*1024); } elsif ($value =~ /([0-9.]+)m$/) { return ($1 * 1000*1000); } elsif ($value =~ /([0-9.]+)G$/) { return ($1 * 1024*1024*1024); } elsif ($value =~ /([0-9.]+)g$/) { return ($1 * 1000*1000*1000); } } # Returns the requested resource value from the accounting record sub SASGEacctResource { my($r,$key) = @_; if ($r->{category} =~ /\s+-l\s+(\S+)/) { my $list = $1; if ($key) { foreach my $res (split(/,/,$1)) { if ($res =~ /([^=]*)=(.*)/) { if ($1 eq $key) { return $2; } } } } else { my(%d); foreach my $res (split(/,/,$1)) { if ($res =~ /([^=]*)=(.*)/) { $d{$1} = $2; } } return \%d; } } return undef; }
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
