Re: [BackupPC-users] Failing incrementals and fulls resolved by moving aside prior full?
On Fri, Sep 17, 2010 at 02:02:10PM -0500, Les Mikesell wrote: On 9/17/2010 1:10 PM, John Rouillard wrote: I mention this since there seem to have been a few other mentions of hangs over the years and this may get somebody past the problem. Also I am hoping that somebody can figure out what is happening here. It seems that some state in the prior (reference) backup is causing the rsync protocol to stall. So anybody with a bright idea of what I can try looking at? I think most of the stalls were either buggy cygwin/windows versions or some stateful firewall/nat network device in the path that time out and break the connection between devices in the long idle times you might have in a backup run with mostly-identical files. If neither of these are possible, maybe you have filesystem corruption of some kind. All good ideas. They also remind me that I forgot to supply some info this time around. This is centos 5.5 to centos 5.5 with kernel 2.6.18-194.3.1.el5 on the server and 2.6.16-xenU on the client. However this has happened in the past with the same 2.6.18 kernel on both (real) boxes. Also this eliminates the whole windows morass. The backup is occurring over a vpn w/o any firewalls/nat. Also we have ServerAlive messages enabled every 30 seconds for the ssh session (because the route to some of the hosts we back up do have stateful firewalls in place). I can see ssh traffic using tcpdump when the rsync is stalled which tells me that the network/ssh layer is fine and the rsync protocol is wacky. Disk corruption isn't impossible. However the filesystem is a 4.5TB ext3 on top of 2 software (md) raid 6 arrays with 7 disks that are striped (raid 0) together. Forcing an fsck's in the past hasn't turned up any issues (but does take backups offline for a long bit 8-(). The arrays are scrubbed weekly and disk selftests (using smartctl) are done monthly. When I have increased the logging level in the past to try to diagnose this, no obvious errors popped up. It proceeded normally until it just kind of stopped. Then there was the sigalarm notice. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Failing incrementals and fulls resolved by moving aside prior full?
On 9/17/2010 1:10 PM, John Rouillard wrote: Hi all: I have been having an issue with one of my systems for the past 4 days. It backs up /, /var/bak and gets the directory tree for /var/log, but it never transfers any files. It looks like the rsync just stalls with a repeating select (according to strace) on file descriptor 1 (the input from the backuppc server) like it's waiting for a command. I tried starting a new full backup which has gotten me around this problem before, but it was stalling at the same point after a couple of trials. Also when it stalls, the client rsync has no open file descriptor 3 (the file being backed up) but does have fd 0, 1 and 2 open (according to lsof). The incremental and full backups were using backup #90 as their reference. This morning I moved backup 90 out of the way (to 90.aside) and started a full backup. At this point it is running through the files in the files in the /var/log/ tree and has been going strong for a couple of hours. I mention this since there seem to have been a few other mentions of hangs over the years and this may get somebody past the problem. Also I am hoping that somebody can figure out what is happening here. It seems that some state in the prior (reference) backup is causing the rsync protocol to stall. So anybody with a bright idea of what I can try looking at? I think most of the stalls were either buggy cygwin/windows versions or some stateful firewall/nat network device in the path that time out and break the connection between devices in the long idle times you might have in a backup run with mostly-identical files. If neither of these are possible, maybe you have filesystem corruption of some kind. -- Les Mikesell lesmikes...@gmail.com -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/