Re: [BackupPC-users] Failing incrementals and fulls resolved by moving aside prior full?

2010-09-21 Thread John Rouillard
On Fri, Sep 17, 2010 at 02:02:10PM -0500, Les Mikesell wrote:
 On 9/17/2010 1:10 PM, John Rouillard wrote:
  I mention this since there seem to have been a few other mentions of
  hangs over the years and this may get somebody past the problem.
 
  Also I am hoping that somebody can figure out what is happening
  here. It seems that some state in the prior (reference) backup is
  causing the rsync protocol to stall.
 
  So anybody with a bright idea of what I can try looking at?
 
 I think most of the stalls were either buggy cygwin/windows versions or 
 some stateful firewall/nat network device in the path that time out and 
 break the connection between devices in the long idle times you might 
 have in a backup run with mostly-identical files.   If neither of these 
 are possible, maybe you have filesystem corruption of some kind.

All good ideas. They also remind me that I forgot to supply some info
this time around.

This is centos 5.5 to centos 5.5 with kernel 2.6.18-194.3.1.el5 on the
server and 2.6.16-xenU on the client. However this has happened in the
past with the same 2.6.18 kernel on both (real) boxes. Also this
eliminates the whole windows morass.

The backup is occurring over a vpn w/o any firewalls/nat. Also we have
ServerAlive messages enabled every 30 seconds for the ssh session
(because the route to some of the hosts we back up do have stateful
firewalls in place). I can see ssh traffic using tcpdump when the
rsync is stalled which tells me that the network/ssh layer is fine and
the rsync protocol is wacky.

Disk corruption isn't impossible. However the filesystem is a 4.5TB
ext3 on top of 2 software (md) raid 6 arrays with 7 disks that are
striped (raid 0) together. Forcing an fsck's in the past hasn't turned
up any issues (but does take backups offline for a long bit 8-(). The
arrays are scrubbed weekly and disk selftests (using smartctl) are
done monthly.

When I have increased the logging level in the past to try to diagnose
this, no obvious errors popped up. It proceeded normally until it just
kind of stopped. Then there was the sigalarm notice.

-- 
-- rouilj

John Rouillard   System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Failing incrementals and fulls resolved by moving aside prior full?

2010-09-17 Thread Les Mikesell
On 9/17/2010 1:10 PM, John Rouillard wrote:
 Hi all:

 I have been having an issue with one of my systems for the past 4
 days. It backs up /, /var/bak and gets the directory tree for
 /var/log, but it never transfers any files. It looks like the rsync
 just stalls with a repeating select (according to strace) on file
 descriptor 1 (the input from the backuppc server) like it's waiting
 for a command.

 I tried starting a new full backup which has gotten me around this
 problem before, but it was stalling at the same point after a couple
 of trials. Also when it stalls, the client rsync has no open file
 descriptor 3 (the file being backed up) but does have fd 0, 1 and 2
 open (according to lsof).

 The incremental and full backups were using backup #90 as their
 reference. This morning I moved backup 90 out of the way (to 90.aside)
 and started a full backup. At this point it is running through the
 files in the files in the /var/log/ tree and has been going strong for
 a couple of hours.

 I mention this since there seem to have been a few other mentions of
 hangs over the years and this may get somebody past the problem.

 Also I am hoping that somebody can figure out what is happening
 here. It seems that some state in the prior (reference) backup is
 causing the rsync protocol to stall.

 So anybody with a bright idea of what I can try looking at?

I think most of the stalls were either buggy cygwin/windows versions or 
some stateful firewall/nat network device in the path that time out and 
break the connection between devices in the long idle times you might 
have in a backup run with mostly-identical files.   If neither of these 
are possible, maybe you have filesystem corruption of some kind.

-- 
   Les Mikesell
lesmikes...@gmail.com



--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/