Thanks I appreciate it. As always with this sort of operation I will tread with all due caution.
-Paul Edmon- On 06/06/2013 10:02 AM, Loris Bennett wrote: > Hi Paul, > > Paul Edmon <[email protected]> writes: > >> If you could send me the script that would be appreciated. Thanks. >> >> -Paul Edmon- > So first off, I'm providing this code as used successfully by me, but, > to quote the GPL: > > This program is distributed in the hope that it will be useful, > but WITHOUT ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > So if it deletes all your data, sets fire to your computer, or makes > your house fall down, it's not my fault. > > You are going to have to make the following changes: > > 1. The relevant database tables all have the name of the cluster as part > of the table name. So, since our cluster is called 'soroban', the > job information is stored in the table 'soroban_job_table'. You > probably need to do > > use slurm_acct_db; > show tables; > > in MySQL to see what the correct names are. > > 2. The main program consists of a loop over all the jobs selected by the > SELECT statement right at the beginning of the program. Then there > is a loop over all the corresponding steps. You then have to decide > which cases you want to deal with and how to deal with them. This is > all contained in the horrible mess of 'if's and 'else's. > > 3. There is an argument to the program, maxjobid, used in the main > SELECT. This is because I had to distinguish between jobs which > really were running and those which were not, but had no end_time. > So I only considered ID lower than the lowest I knew really was > running. Only jobs with lower IDs are considered for modification. > > The program generates a file to make the changes, 'update.sql', and a > file to undo the changes, 'update_undo.sql. You can apply them with the > following > > mysql -u root -p slurm_acct_db < /path/to/update.sql > > Note that if you run the program, apply the changes, then run the > program again, you will be prompted whether you want to overwrite the > existing files. However, since you are going to be trying this out on a > test system using a dump of the production database, it is not going to > matter, right? > > If you have any problems, let me know. > > Cheers, > > Loris > > #!/usr/bin/perl > > use strict; > use warnings; > > use Carp; > use Getopt::Long; > use Pod::Usage; > > use DBI; > use DBD::mysql; > use Term::ReadKey; > use List::Util qw /max min/; > > ############################################################################### > # Configuration > ############################################################################### > > my $platform = 'mysql'; > my $database = 'slurm_acct_db'; > my $host = 'localhost'; > my $port = '3306'; > my $user = 'root'; > my $dsn = "dbi:$platform:$database:$host:$port"; > my %states_by_code = > (0 => 'PENDING', 1 => 'RUNNING', > 2 => 'SUSPENDED',3 => 'COMPLETED', > 4 => 'CANCELLED',5 => 'FAILED', > 6 => 'TIMEOUT', 7 => 'NODE_FAIL', > 8 => 'PREEMPTED'); > my %states_by_string = reverse %states_by_code; > > my $output_dir = './'; > my $update_file = $output_dir . 'update.sql'; > my $undo_file = $output_dir . 'update_undo.sql'; > > my @jobs_without_end_times; > my @jobs_without_end_times_fixed; > my @sql_new_values; > my @sql_old_values; > > ############################################################################### > # Options and Arguments > ############################################################################### > > my $help; > #my $max_jobid = 123400; > my $max_jobid = 148213; > my $std_time = 1; # Thu Jan 1 01:00:01 1970 > > GetOptions('help|h' => \$help, > 'maxjobid|m=i' => \$max_jobid, > ); > if ($help) { > pod2usage(-verbose => 2); > exit; > } > > ############################################################################### > # Main Program > ############################################################################### > > print "Enter " . $user. "'s $platform password: "; > ReadMode 'noecho'; > my $password = ReadLine 0; > chomp $password; > ReadMode 'normal'; > print "\n"; > > my $dbh = DBI->connect($dsn, $user, $password, > { > 'RaiseError' => 1}); > my $select_jobs_stmt = > 'select job_db_inx,id_job,' . > 'time_start,time_end,state' . > ' from soroban_job_table' . > ' where id_job<' . $max_jobid . > ' and time_end=0'; > my $job_href = > $dbh->selectall_hashref($select_jobs_stmt,'id_job'); > > # Loop over jobs > # -------------- > my @job_ids = sort {$a <=> $b} keys(%{$job_href}); > foreach my $job_id (@job_ids) { > printf > '%12s%12s%12s%12s%12s','job_db_inx','Job/StepID','Start','End','State'; > print "\n------------ ---------- ---------- ---------- ----------\n"; > my @nonzero_step_start_times; > my @nonzero_step_end_times; > my $job = $job_href->{$job_id}; > printf '%12d%12d%12d%12d%12s',$job->{job_db_inx},$job_id, > $job->{time_start},$job->{time_end},$states_by_code{$job->{'state'}}; > print "\n"; > > my $select_steps_stmt = > 'select job_db_inx,id_step,time_start,time_end,state' . > ' from soroban_step_table' . > ' where job_db_inx=' . $job->{job_db_inx}; > my $step_href = > $dbh->selectall_hashref($select_steps_stmt,'job_db_inx'); > > # Loop over job steps > # ------------------- > while (my ($step_db_inx, $step) = each %{$step_href}) { > my $step_id = $step->{id_step}; > printf '%12d%12d%12d%12d%12s',$step_db_inx,$step_id, > $step->{time_start},$step->{time_end}, > $states_by_code{$step->{state}}; > print "\n"; > > # Collect nonzero step start/end times > if ( $step->{time_start} > 0 ) { > push(@nonzero_step_start_times,$step->{time_start}); > } > if ( $step->{time_end} > 0 ) { > push(@nonzero_step_end_times,$step->{time_end}); > } > > # Solutions to inconsistencies > # ---------------------------- > # 1. Jobs/step with RUNNING/PENDING should be set to FAILED > # 2. Jobs with a valid step_end_time use that > # 3. Jobs w/o a valid step_end_time use job_start_time > > # Check step start/end times > if ($step->{time_start} == 0) { > print "WARNING: Step start time = $step->{time_start}!\n" . > "NO SOLUTION DECIDED YET!\n"; > exit; > } > if ($step->{time_end} == 0) { > print "Change step end time to step start time:" . > " $step->{time_end} -> $step->{time_start}\n"; > update_step(\@sql_old_values,$job->{job_db_inx},$step_id, > 'time_end',$step->{time_end}); > update_step(\@sql_new_values,$job->{job_db_inx},$step_id, > 'time_end',$step->{time_start}); > > } > if ($step->{time_start} == 0 and $step->{time_end} == 0) { > print "Both step start/end times = 0!\n" . > "NO SOLUTION DECIDED YET!\n"; > exit; > } > > # Fix step state > # -------------- > if ( $step->{state} == $states_by_string{PENDING} or > $step->{state} == $states_by_string{RUNNING}) { > if ( $job->{state} == $states_by_string{PENDING} or > $job->{state} == $states_by_string{RUNNING}) { > print "Change step state: $states_by_code{$step->{state}}" . > " -> FAILED\n"; > update_step(\@sql_old_values,$job->{job_db_inx},$step_id, > 'state',$step->{state}); > update_step(\@sql_new_values,$job->{job_db_inx},$step_id, > 'state',$states_by_string{FAILED}); > } else { > print "Change step state to job state:" . > " $states_by_code{$step->{state}} -> > $states_by_code{$job->{state}}\n"; > update_step(\@sql_old_values,$job->{job_db_inx},$step_id, > 'state',$step->{state}); > update_step(\@sql_new_values,$job->{job_db_inx},$step_id, > 'state',$job->{state}); > } > } > } > # ---------------------- > # End of loop over steps > > # Fix job state > # ------------- > if ( $job->{state} == $states_by_string{PENDING} or > $job->{state} == $states_by_string{RUNNING}) { > print "Change job state from $states_by_code{$job->{state}}" . > " to FAILED\n"; > update_job(\@sql_old_values,$job->{job_db_inx},'state',$job->{state}); > update_job(\@sql_new_values,$job->{job_db_inx},'state', > $states_by_string{FAILED}); > } > > # Fix job start/end_time > # ---------------- > if ($job->{time_start} > 0 ) { # if start_time is non-0 > if (scalar(@nonzero_step_end_times)<1) { # & no non-0 step end_times > if (scalar(@nonzero_step_start_times)<1) { # & no non-0 step > start_times > # Set job_end_time to job_start_time > print "Change job end time to job start time:" . > " 0 -> $job->{time_start}\n"; > push(@jobs_without_end_times,$job_id); > update_job(\@sql_old_values,$job->{job_db_inx},'time_end',0); > update_job(\@sql_new_values,$job->{job_db_inx},'time_end', > $job->{time_start}); > } else { > # Set job_end_time to max step_start_time > my $time = max(@nonzero_step_start_times); > print "Change job end time to max step start time:" . > " 0 -> $time\n"; > update_job(\@sql_old_values,$job->{job_db_inx},'time_end',0); > update_job(\@sql_new_values,$job->{job_db_inx},'time_end',$time); > } > } else { > # otherwise use latest step end time for job > my $time = max(@nonzero_step_end_times); > print "Change job end time to last step end time:" . > " 0 -> $time\n"; > update_job(\@sql_old_values,$job->{job_db_inx},'time_end',0); > update_job(\@sql_new_values,$job->{job_db_inx},'time_end',$time); > } > } else { # job start time == 0 > if (scalar(@nonzero_step_start_times)<1) { > # Use now > print "Change job start/end time both to $std_time:" . > " 0 -> " . localtime($std_time) . "\n"; > update_job(\@sql_old_values,$job->{job_db_inx},'time_start',0); > update_job(\@sql_new_values,$job->{job_db_inx}, > 'time_start',$std_time); > update_job(\@sql_old_values, > $job->{job_db_inx}, > 'time_end',0); > update_job(\@sql_new_values, > $job->{job_db_inx}, > 'time_end',$std_time); > } else { > # otherwise use earliest/latest step start time for job > my $min_start_time = min(@nonzero_step_start_times); > print "Change job start time to first step start time:" . > "$job->{time_start} -> $min_start_time\n"; > update_job(\@sql_old_values, > $job->{job_db_inx}, > 'time_start',0); > update_job(\@sql_new_values, > $job->{job_db_inx}, > 'time_start',$min_start_time); > my $max_start_time = max(@nonzero_step_start_times); > print "Change job end time to last step start time:" . > " $job->{time_end} -> $max_start_time\n"; > update_job(\@sql_old_values, > $job->{job_db_inx}, > 'time_end',0); > update_job(\@sql_new_values, > $job->{job_db_inx}, > 'time_end',$max_start_time); > } > } > print "------------------------------------------------------------\n"; > } > # --------------------- > # End of loop over jobs > > print <<END > > Summary > ======= > > END > ; > print "Total number of jobs w/o end time: " . > scalar(keys(%{$job_href})) . "\n"; > print "Job w/o job and w/o step end time: " . > scalar(@jobs_without_end_times) . "\n"; > #foreach (@jobs_without_end_times) { > # print "\t$_\n"; > #} > > write_sql(\@sql_new_values,$update_file); > write_sql(\@sql_old_values,$undo_file); > > $dbh->disconnect(); > exit; > > ############################################################################### > # Subroutines > ############################################################################### > > sub update_job { > my ($aref,$job_db_inx,$col,$val) = @_; > my $stmt = > 'update soroban_job_table' . > ' set ' . $col . '=' . $val . > ' where job_db_inx=' . $job_db_inx; > push(@{$aref},$stmt); > } > > sub update_step { > my ($aref,$job_db_inx,$id_step,$col,$val) = @_; > my $stmt = > 'update soroban_step_table' . > ' set ' . $col . '=' . $val . > ' where job_db_inx=' . $job_db_inx . > ' and id_step=' . $id_step; > push(@{$aref},$stmt); > } > > sub write_sql { > my ($sql_statements_aref,$file_name) = @_; > if (-r $file_name) { > print "$file_name exists. Overwrite? [y/n]: "; > my $ans = <STDIN>; > chomp $ans; > return if $ans ne 'y'; > } > open(my $output,'>',$file_name) > or croak "Can't open $file_name: $!"; > foreach my $sql (@$sql_statements_aref) { > print $output "$sql;\n"; > } > close($output); > } > > __END__ > > ############################################################################## > # POD > ############################################################################## > > =head1 NAME > > fix_sacct_db.pl - fixes jobs/steps w/o end time > > =head1 SYNOPSIS > > fix_sacct_db -m > > =head1 DESCRIPTION > > The program fixes errors in the sacct database, which result in the command > 'sacct' showing jobs as running which have in fact come to an end. All the > jobs > with zero end time are selected and a suitable finite end time determined. > > =head1 OPTIONS > > -h --help this help > -m --maxid max. Job ID considered > > =cut > > # Local Variables: > # cperl-indent-level: 2 > # End: >
