Hi,

A while back[1], I was having an issue with one machine failing on full
backups, erroring with "Network send error to SD" in the log.  This
particular job would bomb at around 3 hours into the backup.  I suspected
the network at the first sign of a problem, but was unable to pinpoint the
specific cause of the failures.  Since the original problem, I've upgraded
all my machines from 2.4.3 to 5.0.0.

Recently, this machine was moved to a different colo, so I scheduled a
full backup after enabling logging with '-d 150' on the FD, the same debug
level I was using previously, in hopes I would not need to refer to the
log.

Unfortunately I did, and I'm not quite sure I understand the error.  Here
is what I see in the debug output:

------------------------------------------------------------------------
servername-fd: heartbeat.c:91-0 Got BNET_SIG -4 from SD
servername-fd: heartbeat.c:96-0 wait_intr=1 stop=1
servername-fd: backup.c:1023-14662 Send data to SD len=65536
servername-fd: backup.c:1023-14662 Send data to SD len=65536
servername-fd: backup.c:1023-14662 Send data to SD len=65536
servername-fd: heartbeat.c:142-14662 Send kill to heartbeat id
servername-fd: backup.c:211-14662 end blast_data ok=0
servername-fd: job.c:1626-14662 Error in blast_data.
servername-fd: job.c:276-14662 Quit command loop. Canceled=1
servername-fd: job.c:303-14662 End FD msg: Jmsg \
        Job=servername.2010-04-19_09.42.48_34 type=3 \
        level=1271721175 servername-fd \
        JobId 14662: Fatal error: backup.c:1019 \
        Network send error to SD.  ERR=Broken pipe
servername-fd: job.c:382-14662 Calling term_find_files
servername-fd: job.c:385-14662 Done with term_find_files
servername-fd: jcr.c:183-14662 write_last_jobs seek to 188
servername-fd: job.c:387-0 Done with free_jcr
------------------------------------------------------------------------

Though I found information on bnet_sig()[2], I am not clear on how it
would receive '-4', and more importantly, what that signal is telling the
FD.

I found one post from December [3] with the same error, though it suggests
the NIC had fallen asleep due to power saving features, which is not the
case here.  More importantly, this backup chewed through over 35GB before
it got to this point.

Any insight would be appreciated.

[1] - http://adsm.org/lists/html/Bacula-users/2010-02/msg00532.html
[2] - 
http://oss.org.cn/man/network/bacula/bacula_dev/TCP_IP_Network_Protocol.html#SECTION000188000000000000000
[3] - 
http://www.mail-archive.com/bacula-users@lists.sourceforge.net/msg38708.html

-- 
Glen Barber

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to