Figured it out.  The problem was that I have BackupPC set to run 8
nightlies at once (which usually takes 12 or more hours), but it was
ending up in a state where only one was running at a time.

This may be the longest, most detailed debugging writeup I've ever
done in 15 years of being a computer professional; I hope y'all
appreciate it.  :)  I had to do this to hold all the relevant state
in my head.

It turns out that the issue occurs when the 24-hour-ly nightlies job
is already running, and you do

   sudo -u backuppc BackupPC_serverMesg BackupPC_nightly run

which I've been doing a lot.

Deciding to queue new nightly jobs goes like this:

  while ( $CmdJob eq "" && @CmdQueue > 0 && $RunNightlyWhenIdle != 1
    || @CmdQueue > 0 && $RunNightlyWhenIdle == 2
    && $bpc->isAdminJob($CmdQueue[0]->{host}) ) {

We'll be coming back to this a lot.  "isAdminJob" matches nightly
jobs only AFAICT.

CmdQueue State: Empty
CmdJob: Empty
RunNightlyWhenIdle: 0
While State: False, since @CmdQueue <= 0
Running Job State: Empty
Event:

  Normal nightly run occurs.  RunNightlyWhenIdle is set to 1, which
  triggers all the nightly jobs getting added to the queue, and
  RunNightlyWhenIdle getting set to 0

CmdQueue State: 8 nightly jobs
CmdJob: Empty
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
While State: True, via Branch 2
Running Job State: Empty
Event:

  Nightly jobs get kicked off, all 8 of them.


CmdQueue State: Empty
CmdJob: non-empty; "admin7" or similar
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: False
While State: False, since @CmdQueue <= 0
Running Job State: 8 nightly jobs
Event:

  A backup finishes, and queues up a BackupPC_link job.  This
  happens several times, since the nightly jobs take 8+ hours, even
  split into 8 parts, on my machine (4+TiB of backups per backup
  machine).

CmdQueue State: Several link jobs
CmdJob: non-empty; "admin7" or similar
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: False; link jobs don't match
While State: False
Running Job State: 8 nightly jobs
Event:

  User runs "sudo -u backuppc BackupPC_serverMesg BackupPC_nightly
  run".  This causes RunNightlyWhenIdle to be set to 1, but before
  that hits the while, the jobs are actually queued, *USING
  unshift*, which puts them at the front of the queue.  This is
  where things start to go horribly wrong.

CmdQueue State: 8 nightly jobs, *THEN* Several link jobs
CmdJob: non-empty; "admin7" or similar
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *TRUE*
While State: True, branch 2
Running Job State: 8 nightly jobs
Event:

  *Pop* a job from the queue.  This means that even though the
  *test* is for the job from the *front* of the queue, the job that
  actually gets handled is the job at the *end* of the queue.

  So, the last job on the queue, a link job, gets run.  THIS SHOULD
  NEVER HAPPEN, as I understand it, because nightly jobs (the first
  set) are still running.  The link job sets CmdJob, but that
  doesn't matter because we're going through the *second* branch of
  the while, which doesn't care about CmdJob.  So, it happily
  launches another link job:
  
CmdQueue State: 8 nightly jobs, then N-1 link jobs
CmdJob: hostname non-empty from the last link job
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
While State: True, branch 2
Running Job State: 8 nightly jobs, 1 link job
Event:

  Runs the next link job.  And all the others.  We end up with *all*
  link jobs running at once.  THIS SHOULD NEVER HAPPEN; CmdQueue is
  supposed to be one at a time.

  But wait, it gets better!

  When each link job starts, it sets $CmdJob to its own host name;
  this means that at the end of the run through the queue, it's set
  to the last link job that ran, like this:

CmdQueue State: 8 nightly jobs
CmdJob: hostname from the last link job
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
While State: True, branch 2
Running Job State: 8 nightly jobs, N link jobs
Event:

  So, from here, it tries to run the last (remember, pop) nightly
  job, but it can't, because any given nightly segment can only run
  one at a time, because they are named to prevent duplicates
  (leading to the "Botch on admin job for admin7 : already in use!!"
  log messages: that means the 8th nightly job is running, so you
  can't start it again).

  Having failed to run the nightly job, it unshifts it onto the
  front of the CmdQueue.

  It runs through all the queued nightly jobs in this way.

  Eventually, a link job finishes.

CmdQueue State: 8 nightly jobs
CmdJob: hostname from the last link job
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
While State: True, branch 2
Running Job State: 8 nightly jobs, N-1 link jobs
Event:

  When each link job finshes, this test runs:

    if ( $CmdJob eq $host || $bpc->isAdminJob($host) ) {

  This will only match the last link job.  If it doesn't match, the
  host is tested for whether it needs linking, and if so, it
  requeues the (already completed) link, *at the front of the
  queue*.  This happens for all but the last link job.  This means
  the thing at the front of the queue isn't a nightly job, so branch
  2 of the while fails, which would restore some kind of order.

  The kicker is that the last job, the one that passes that if
  above, *clears CmdJob*, which means the first branch *passes*.

CmdQueue State: N-1 link jobs, 8 nightly jobs
CmdJob: *EMPTY*
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *FALSE*
While State: True, *BRANCH ONE*
Running Job State: 8 nightly jobs
Event:

  It pops *the last job*, a nightly, which fails, gets requeud *at
  the front*, without setting CmdJob, and now we're back to Branch 2
  passing.  This happens with all the nightly jobs, and then it's
  ready to re-run all the (already completed once) link jobs:

CmdQueue State: 8 nightly jobs, N-1 link jobs
CmdJob: Empty
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *TRUE*
While State: True, *BRANCH TWO*
Running Job State: 8 nightly jobs
Event:
  
  Runs a link job.

CmdQueue State: 8 nightly jobs, N-2 link jobs
CmdJob: Empty
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *TRUE*
While State: True, *BRANCH TWO*
Running Job State: 8 nightly jobs, 1 link job
Event:

  This keeps happening until all the link jobs have been completed
  (meaning that the last link job to finish has been run a total of
  N times):
  
CmdQueue State: 8 nightly jobs
CmdJob: Empty
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
Running Job State: 8 nightly jobs
While State: True, branch 2



Here's a log snippet of all this craziness:

  2010-11-24 07:29:45 Running BackupPC_link host00002 (pid=26391)
  2010-11-24 07:29:45 Running BackupPC_link host00005 (pid=26392)
  2010-11-24 07:29:45 Running BackupPC_link hostother1 (pid=26393)
  2010-11-24 07:29:45 Running BackupPC_link host00006 (pid=26394)
  2010-11-24 07:29:45 Running BackupPC_link host00011 (pid=26395)
  2010-11-24 07:29:45 Running BackupPC_link host00000 (pid=26396)
  2010-11-24 07:29:45 Running BackupPC_link host00003 (pid=26397)
  2010-11-24 07:29:45 Running BackupPC_link host00004 (pid=26398)
  2010-11-24 07:29:45 Botch on admin job for  admin7 : already in use!!
  2010-11-24 07:29:45 Finished host00004 (BackupPC_link host00004)
  2010-11-24 07:29:45 Botch on admin job for  admin : already in use!!
  2010-11-24 07:29:47 Botch on admin job for  admin1 : already in use!!
  2010-11-24 07:29:47 Botch on admin job for  admin2 : already in use!!
  2010-11-24 07:29:47 Botch on admin job for  admin3 : already in use!!
  2010-11-24 07:29:47 Botch on admin job for  admin4 : already in use!!
  2010-11-24 07:29:47 Botch on admin job for  admin5 : already in use!!
  2010-11-24 07:29:57 Botch on admin job for  admin6 : already in use!!
  2010-11-24 07:29:57 Botch on admin job for  admin7 : already in use!!
  2010-11-24 07:29:57 Running BackupPC_link host00002 (pid=26411)
  2010-11-24 07:29:57 Running BackupPC_link host00005 (pid=26412)
  2010-11-24 07:29:57 Running BackupPC_link hostother1 (pid=26413)
  2010-11-24 07:29:57 Running BackupPC_link host00006 (pid=26414)
  2010-11-24 07:29:57 Running BackupPC_link host00011 (pid=26415)
  2010-11-24 07:29:57 Running BackupPC_link host00003 (pid=26416)
  2010-11-24 07:29:57 Running BackupPC_link host00000 (pid=26417)
  2010-11-24 07:29:57 Botch on admin job for  admin : already in use!!
  2010-11-24 07:29:57 Finished host00000 (BackupPC_link host00000)
  2010-11-24 07:29:57 Botch on admin job for  admin1 : already in use!!
  2010-11-24 07:29:57 Botch on admin job for  admin2 : already in use!!
  2010-11-24 07:29:59 Botch on admin job for  admin3 : already in use!!
  2010-11-24 07:29:59 Botch on admin job for  admin4 : already in use!!
  2010-11-24 07:29:59 Botch on admin job for  admin5 : already in use!!
  2010-11-24 07:29:59 Botch on admin job for  admin6 : already in use!!
  2010-11-24 07:29:59 Botch on admin job for  admin7 : already in use!!
  2010-11-24 07:30:18 Botch on admin job for  admin : already in use!!
  2010-11-24 07:30:18 Running BackupPC_link host00002 (pid=26453)
  2010-11-24 07:30:18 Running BackupPC_link host00005 (pid=26454)
  2010-11-24 07:30:18 Running BackupPC_link hostother1 (pid=26455)
  2010-11-24 07:30:18 Running BackupPC_link host00006 (pid=26456)
  2010-11-24 07:30:18 Running BackupPC_link host00011 (pid=26457)
  2010-11-24 07:30:18 Botch on admin job for  admin1 : already in use!!
  2010-11-24 07:30:18 Running BackupPC_link host00003 (pid=26459)
  2010-11-24 07:30:18 Botch on admin job for  admin2 : already in use!!
  2010-11-24 07:30:18 Botch on admin job for  admin3 : already in use!!
  2010-11-24 07:30:18 Finished host00003 (BackupPC_link host00003)

None of this explains why we end up with only one nightly job
running at a time, but it's background on how very badly messed up
things can get if you do "BackupPC_nightly run".

Here's a repeat of the while, to save you some scrolling:

  while ( $CmdJob eq "" && @CmdQueue > 0 && $RunNightlyWhenIdle != 1
    || @CmdQueue > 0 && $RunNightlyWhenIdle == 2
    && $bpc->isAdminJob($CmdQueue[0]->{host}) ) {

Ok, so, the current state is this:

CmdQueue State: 8 nightly jobs
CmdJob: Empty
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
Running Job State: 8 nightly jobs
While State: True, branch 2

  So it keep running through the queue and spewing "Botch" messages
  to the logs.  Then one of the nightlies *actually finishes*.  This
  means one of the *other* nightlies can actually run, which sets
  CmdJob, which is our prelude to disaster.

CmdQueue State: *SEVEN* nightly jobs
CmdJob: *NON-EMPTY*; "adminN", for some N
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: True
Running Job State: 8 nightly jobs
While State: True, branch 2

  And it goes back to cycling through the nightlies and "Botch"ing.

  But eventually an actual backup finishes, and a link job gets
  queued *to the front*.  The while is now false:

CmdQueue State: 1 link job, 7 nightly jobs
CmdJob: "adminN"
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *FALSE*
Running Job State: 8 nightly jobs
While State: *FALSE*, since isAdminJob is false and $CmdJob is not ""

  At this point, the CmdQueue will not get run until something
  changes, which is the correct behaviour when nightlies are
  running.

  Note that the link job that cause the while to go false does not,
  itself, actually get run.

  The nightlies continue to complete.  Finishing a nightly DOES NOT
  chang CmdJob, so this state continues until every nightly job
  currently running has completed.  Note that what's currently
  running is 7 of the original nightly jobs, and one from the second
  round.

  When there are 0 nightly jobs running, BackupPC considers the run
  of nightlies to be complete.  At this point, it sets
  RunNightlyWhenIdle to 0 (i.e. "run normally") and sets CmdJob to
  "", so we have this:

CmdQueue State: 7 nightly jobs
CmdJob: ""
RunNightlyWhenIdle: 0
isAdminJob Matches First Job: *TRUE*
Running Job State: *EMPTY*
While State: *TRUE*, *BRANCH ONE* (CmdJob empty, and suddenly
  RunNightlyWhenIdle isn't 2 anymore)
  
  Since the while is true, it launches the last job on the queue, a
  nightly job, which leads to this state:

CmdQueue State: 6 nightly jobs
CmdJob: "adminN"
RunNightlyWhenIdle: 0
isAdminJob Matches First Job: True
Running Job State: 1 nightly job
While State: *FALSE*; RunNightlyWhenIdle isn't 2, so branch 2 can't
  be true, and CmdJob isn't "", so branch 1 can't be true
  
  No more CmdQueue checking, until the current nightly job ends.  At
  that point, there are 0 nightly jobs running, which makes BackupPC
  thing everything is done, so it reports like everything is done,
  and sets CmdJob to "", which puts us back to one state above.
  Repeat until all nightly jobs are *actually* done, which leaves us
  in a proper state.

The really bad part: *NO LINK JOBS RUN IN THIS STATE*.

On my large hosts, each nightly job, run by itself, takes *ten or
more hours*.

This means that once we've entered this state, we've got *at least*
70 hours with no link jobs.  A host will not run a new backup until
the link from the last backup finishes.  This means that it's very
easy for hosts to become 2 or 3 days out of date.

The daily nightlies run tries to fix it, but it doesn't actually
work.  When the daily nightly run kicks off, we end up with
nightlies running, nightlies in the queue, and RunNightlyWhenIdle
set to 2, which is the state we basically started in way above; it's
only a matter of time before it gets stuck again.  In my experience,
in that case, it actually gets stuck again within a matter of
seconds: 

  2010-11-25 01:00:00 Running BackupPC_nightly 224 255 (pid=25383)
  2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 2 ; new 
BackupPCNightlyLock counter: 2
  2010-11-25 01:00:00 Running BackupPC_nightly -m 0 31 (pid=25384)
  2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 3 ; new 
BackupPCNightlyLock counter: 3
  2010-11-25 01:00:00 Running BackupPC_nightly 32 63 (pid=25385)
  2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 4 ; new 
BackupPCNightlyLock counter: 4
  2010-11-25 01:00:00 Running BackupPC_nightly 64 95 (pid=25386)
  2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 5 ; new 
BackupPCNightlyLock counter: 5
  2010-11-25 01:00:00 Running BackupPC_nightly 96 127 (pid=25387)
  2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 6 ; new 
BackupPCNightlyLock counter: 6
  2010-11-25 01:00:00 Running BackupPC_link mc00-s00106 (pid=25388)
  2010-11-25 01:00:00 Running BackupPC_link mc00-s00009-prod-shared-redbubble 
(pid=25389)
  2010-11-25 01:00:00 Running BackupPC_link 
mavenlink--tm25-e00053--tm25-s00106---db-backup (pid=25390)
  2010-11-25 01:00:00 Running BackupPC_link 
provolvesolutions--tm25-e00080--tm25-s00157---data (pid=25391)
  2010-11-25 01:00:00 Running BackupPC_link tm25-s00090__nfs (pid=25392)
  2010-11-25 01:00:00 Running BackupPC_link 
provolvesolutions--tm25-e00079--tm25-s00155---db-backup (pid=25393)
  2010-11-25 01:00:00 Running BackupPC_link 
mavenlink--tm25-e00054--tm25-s00107---data (pid=25394)
  2010-11-25 01:00:00 Running BackupPC_link 
howcast--tm25-e00076--tm25-s00149---db-backup (pid=25395)
  2010-11-25 01:00:00 Running BackupPC_link 
brighthouse--tm25-e00039--tm25-s00083---db-backup (pid=25396)
  2010-11-25 01:00:00 Running BackupPC_link 
howcast--tm25-e00071--tm25-s00134---db-backup (pid=25397)
  2010-11-25 01:00:00 Botch on admin job for  admin : already in use!!
  2010-11-25 01:00:00 Next wakeup is 2010-11-25 02:00:00
  2010-11-25 01:00:00 Botch on admin job for  admin1 : already in use!!
  2010-11-25 01:00:01 Botch on admin job for  admin2 : already in use!!
  2010-11-25 01:00:02 Finished howcast--tm25-e00071--tm25-s00134---db-backup 
(BackupPC_link howcast--tm25-e00071--tm25-s00134---db-backup)
  2010-11-25 01:00:02 Botch on admin job for  admin3 : already in use!!
  2010-11-25 01:00:07 Running BackupPC_nightly 128 159 (pid=25414)
  2010-11-25 01:00:07 New BackupPCNightlyJobs counter: 7 ; new 
BackupPCNightlyLock counter: 7

So *some of* the nightlies ran, and set RunNightlyWhenIdle to 2,
which caused all the links to kick off for reasons explained above,
leaving the queue with just the nightly jobs.

The non-last links finish, leaving a link job at the front of the
queue.

Then a link finishes, leaving CmdJob empty.

Then another nightly job runs, setting CmdJob, so we end up with:

CmdQueue State: some number of nightly jobs
CmdJob: "adminN"
RunNightlyWhenIdle: 2
isAdminJob Matches First Job: *FALSE*
Running Job State: some number of nightly job
While State: *FALSE*

And round and round she goes.  Total progress in 24 hours from
BackupPC's point of view: *one link job*.

It can keep doing this *forever*, as each 24-hour-ly run of
nightlies restarts the problem, unless all the sequential nightlies
can finish within 24 hours, which is *most definitely* not the case
here.

      SOLUTION(S).  TOTALLY UNTESTED.

1.  Don't run a non-nightly job from the CmdQueue when there are
nightly job running, *EVER*.  (my copy is locally modified with
debugging stuff to try to figure out these issues, so if this diff
doesn't patch, that's why, sorry).

--- /var/tmp/BackupPC   2010-11-25 01:52:30.000000000 +0000
+++ /usr/local/bin/BackupPC     2010-11-25 02:00:03.000000000 +0000
@@ -522,6 +522,15 @@
         $req = pop(@CmdQueue);

         $host = $req->{host};
+
+       if( $BackupPCNightlyJobs > 0 && ! $bpc->isAdminJob($host) )
+       {
+            print(LOG $bpc->timeStamp, "Tried to run ".$req->{cmd}." on $host 
when there are nightlies running.  That's bad.\n");
+
+            unshift(@CmdQueue, $req);
+            return;
+       }
+
         if ( defined($Jobs{$host}) ) {
             print(LOG $bpc->timeStamp,
                        "Botch on admin job for $host: already in use!!\n");


2.  That means that if nightlies are running, and you do a
"BackupPC_nightly run", it'll queue them, 8 at a time total, behind
the previous ones, which means the whole run will take about twice
as long as it would have otherwise.  I don't know about you, but
that's certainly not what *I* expect when I launch a nightly run; I
expect all the old ones to die in favour of new ones:

@@ -1362,7 +1371,19 @@
             } elsif ( $cmd =~ /^backup all$/ ) {
                 QueueAllPCs();
             } elsif ( $cmd =~ /^BackupPC_nightly run$/ ) {
+                print(LOG $bpc->timeStamp,
+                 "Running nightlies at user request.\n" );
+               foreach my $host (keys %Jobs) {
+                 if( $bpc->isAdminJob( $host ) ) {
+                   my $pid = $Jobs{$host}{pid};
+                   kill($bpc->sigName2num("INT"), $pid);
+                    print(LOG $bpc->timeStamp,
+                     "Killing nightly job $host with PID $pid to make way for 
manual run.\n");
+                 }
+               }
                 $RunNightlyWhenIdle = 1;
+               $BackupPCNightlyJobs = 0;
+               $BackupPCNightlyLock = 0;
             } elsif ( $cmd =~ /^backup (\S+)\s+(\S+)\s+(\S+)\s+(\S+)/ ) {
                 my $hostIP     = $1;
                 $host          = $2;

-Robin

-- 
http://singinst.org/ :  Our last, best hope for a fantastic future.
Lojban (http://www.lojban.org/): The language in which "this parrot
is dead" is "ti poi spitaki cu morsi", but "this sentence is false"
is "na nei".   My personal page: http://www.digitalkingdom.org/rlp/

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to