Hi, A while back[1], I was having an issue with one machine failing on full backups, erroring with "Network send error to SD" in the log. This particular job would bomb at around 3 hours into the backup. I suspected the network at the first sign of a problem, but was unable to pinpoint the specific cause of the failures. Since the original problem, I've upgraded all my machines from 2.4.3 to 5.0.0.
Recently, this machine was moved to a different colo, so I scheduled a full backup after enabling logging with '-d 150' on the FD, the same debug level I was using previously, in hopes I would not need to refer to the log. Unfortunately I did, and I'm not quite sure I understand the error. Here is what I see in the debug output: ------------------------------------------------------------------------ servername-fd: heartbeat.c:91-0 Got BNET_SIG -4 from SD servername-fd: heartbeat.c:96-0 wait_intr=1 stop=1 servername-fd: backup.c:1023-14662 Send data to SD len=65536 servername-fd: backup.c:1023-14662 Send data to SD len=65536 servername-fd: backup.c:1023-14662 Send data to SD len=65536 servername-fd: heartbeat.c:142-14662 Send kill to heartbeat id servername-fd: backup.c:211-14662 end blast_data ok=0 servername-fd: job.c:1626-14662 Error in blast_data. servername-fd: job.c:276-14662 Quit command loop. Canceled=1 servername-fd: job.c:303-14662 End FD msg: Jmsg \ Job=servername.2010-04-19_09.42.48_34 type=3 \ level=1271721175 servername-fd \ JobId 14662: Fatal error: backup.c:1019 \ Network send error to SD. ERR=Broken pipe servername-fd: job.c:382-14662 Calling term_find_files servername-fd: job.c:385-14662 Done with term_find_files servername-fd: jcr.c:183-14662 write_last_jobs seek to 188 servername-fd: job.c:387-0 Done with free_jcr ------------------------------------------------------------------------ Though I found information on bnet_sig()[2], I am not clear on how it would receive '-4', and more importantly, what that signal is telling the FD. I found one post from December [3] with the same error, though it suggests the NIC had fallen asleep due to power saving features, which is not the case here. More importantly, this backup chewed through over 35GB before it got to this point. Any insight would be appreciated. [1] - http://adsm.org/lists/html/Bacula-users/2010-02/msg00532.html [2] - http://oss.org.cn/man/network/bacula/bacula_dev/TCP_IP_Network_Protocol.html#SECTION000188000000000000000 [3] - http://www.mail-archive.com/bacula-users@lists.sourceforge.net/msg38708.html -- Glen Barber ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users