Re: [Bacula-users] connections timing out

2011-08-16 Thread Geert Stappers
Op 20110811 om 22:15 schreef Boudewijn Ector:
 On 07/27/2011 10:31 AM, Pietro Bertera wrote:
  2011/7/26 Boudewijn Ectorboudew...@boudewijnector.nl:
  Can someone please point me out where I should start to investigate this
  problem?
 
From the internet, I can reach the director and the SD @ the 'leiden'
  system.
  I can reach the FD's at all servers which are to be backed up.
  the command status client=xxx in bconsole returns everything correctly ?
 
  Regards,
 
  Pietro
 
 Hi Pietro,
 
 
 Sorry for the late reply, since I've been on a holiday.
 Nothing has changed, and the problem can still be reproduced:
 
 *status client=www
 Connecting to Client www at www.boudewijnector.nl:9102
 
 www.boudewijnector.nl-fd Version: 5.0.2 (28 April 2010)  
 x86_64-pc-linux-gnu debian squeeze/sid
 Daemon started 11-Aug-11 18:22, 1 Job run since started.
   Heap: heap=1,597,440 smbytes=176,189 max_bytes=267,404 bufs=145 
 max_bufs=279
   Sizeof: boffset_t=8 size_t=8 debug=0 trace=0
 
 Running Jobs:
 JobId 293 Job wwwjob.2011-08-11_21.45.14_07 is running.
  Full Backup Job started: 11-Aug-11 21:45
  Files=3,607 Bytes=11,011,930 Bytes/sec=1,101,193 Errors=1
  Files Examined=3,608
  Processing file: /root/home/boudewijn/IMG_9895.JPG
  SDReadSeqNo=5 fd=5
 Director connected at: 11-Aug-11 21:45
 
 
 Terminated Jobs:
   JobId  LevelFiles  Bytes   Status   FinishedName
 ==
 292  Full 93,8766.468 G  Error11-Aug-11 20:23 wwwjob
 
 *
 
 
 
 So the director seems to be able to connect to the file daemon, am I 
 correct?
 
 
 Cheers,
 Boudewijn Ector
 


See if adding a line like
   Heartbeat Interval = 60  # seconden  ( 0 is uit )
to both bacula-dir and bacula-fd helps.


Cheers
Geert Stappers
who had in the past also time out errors on long back-ups

--
uberSVN's rich system and user administration capabilities and model 
configuration take the hassle out of deploying and managing Subversion and 
the tools developers use with it. Learn more about uberSVN and get a free 
download at:  http://p.sf.net/sfu/wandisco-dev2dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] connections timing out

2011-08-11 Thread Boudewijn Ector
On 07/27/2011 10:31 AM, Pietro Bertera wrote:
 2011/7/26 Boudewijn Ectorboudew...@boudewijnector.nl:
 Can someone please point me out where I should start to investigate this
 problem?

   From the internet, I can reach the director and the SD @ the 'leiden'
 system.
 I can reach the FD's at all servers which are to be backed up.
 the command status client=xxx in bconsole returns everything correctly ?

 Regards,

 Pietro

Hi Pietro,


Sorry for the late reply, since I've been on a holiday.
Nothing has changed, and the problem can still be reproduced:

*status client=www
Connecting to Client www at www.boudewijnector.nl:9102

www.boudewijnector.nl-fd Version: 5.0.2 (28 April 2010)  
x86_64-pc-linux-gnu debian squeeze/sid
Daemon started 11-Aug-11 18:22, 1 Job run since started.
  Heap: heap=1,597,440 smbytes=176,189 max_bytes=267,404 bufs=145 
max_bufs=279
  Sizeof: boffset_t=8 size_t=8 debug=0 trace=0

Running Jobs:
JobId 293 Job wwwjob.2011-08-11_21.45.14_07 is running.
 Full Backup Job started: 11-Aug-11 21:45
 Files=3,607 Bytes=11,011,930 Bytes/sec=1,101,193 Errors=1
 Files Examined=3,608
 Processing file: /root/home/boudewijn/IMG_9895.JPG
 SDReadSeqNo=5 fd=5
Director connected at: 11-Aug-11 21:45


Terminated Jobs:
  JobId  LevelFiles  Bytes   Status   FinishedName
==
292  Full 93,8766.468 G  Error11-Aug-11 20:23 wwwjob

*



So the director seems to be able to connect to the file daemon, am I 
correct?


Cheers,

Boudewijn Ector

--
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. 
http://p.sf.net/sfu/wandisco-dev2dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] connections timing out

2011-07-26 Thread Boudewijn Ector
Hi guys,


For my company I've been trying to get bacula up and running properly.
My currect situation:

Host 'leiden' :

Located at my home, multiple large (8TB) raid arrays attached.
Therefore running bacula-sd and bacula-dir.
 100mbit download bandwidth.
Running debian testing, bacula version 5.0.3.


Multiple hosts to be backed up, on a 100/100 connection.
debian stable, bacula 5.0.3
running bacula-fd, default config.


The complete bacula-dir.conf is located at: http://pastebin.com/8JvCdmL9
Please note that I have substituted all passwords by an X.

Relevant parts are:


Director {# define myself
   Name = leiden-dir
   QueryFile = /etc/bacula/scripts/query.sql
   WorkingDirectory = /var/lib/bacula
   PidDirectory = /var/run/bacula
   Maximum Concurrent Jobs = 10
   Password = X # Console password
   Messages = Daemon
   DirAddresses = {
 ip = { addr = 192.168.1.44; port = 9101 }
 ip = { addr = 127.0.0.1; port =9101 }
   }
}

JobDefs {
   Name = sql-weekly
   Type = Backup
   Level = Incremental
   Client = sql
   FileSet = Full Set
   Schedule = WeeklyCycle
   Storage = leiden-filestorage
   Messages = Standard
   Pool = LeidenPool
   Priority = 10
}


JobDefs {
   Name = mail-weekly
   Type = Backup
   Level = Incremental
   Client = mail
   FileSet = Full Set
   Schedule = WeeklyCycle
   Storage = leiden-filestorage
   Messages = Standard
   Pool = LeidenPool
   Priority = 10
}


Job {
   Name = sqljob
   JobDefs = sql-weekly
   Write Bootstrap = /var/lib/bacula/sql.bsr
}
Job {
   Name = mailjob
   JobDefs = mail-weekly
   Write Bootstrap = /var/lib/bacula/mail.bsr
}
# Client (File Services) to backup
Client {
   Name = sql
   Address = sql.boudewijnector.nl
   FDPort = 9102
   Catalog = MyCatalog
   Password = X  # password for FileDaemon
   File Retention = 30 days# 30 days
   Job Retention = 6 months# six months
   AutoPrune = yes # Prune expired Jobs/Files
}

Client {
   Name = mail
   Address = mail.boudewijnector.nl
   FDPort = 9102
   Catalog = MyCatalog
   Password = X  # password for FileDaemon
   File Retention = 30 days# 30 days
   Job Retention = 6 months# six months
   AutoPrune = yes # Prune expired Jobs/Files
}



The current problem is that I get errors on some hosts, such as:


17-Jul 02:52 leiden-dir JobId 94: Fatal error: Network error with FD 
during Backup: ERR=Connection reset by peer
17-Jul 02:52 leiden-dir JobId 94: Fatal error: No Job status returned 
from FD.
17-Jul 02:52 leiden-dir JobId 94: Error: Bacula leiden-dir 5.0.3 
(04Aug10): 17-Jul-2011 02:52:30
   Build OS:   i486-pc-linux-gnu debian wheezy/sid
   JobId:  94
   Job:BLAjob.2011-07-17_00.52.14_10
   Backup Level:   Full (upgraded from Incremental)
   Client: client4 5.0.2 (28Apr10) 
x86_64-pc-linux-gnu,debian,squeeze/sid
   FileSet:Home Set 2011-07-16 23:49:43
   Pool:   LeidenPool (From Job resource)
   Catalog:MyCatalog (From Client resource)
   Storage:leiden-filestorage (From Job resource)
   Scheduled time: 17-Jul-2011 00:52:13
   Start time: 17-Jul-2011 00:52:16
   End time:   17-Jul-2011 02:52:30
   Elapsed time:   2 hours 14 secs
   Priority:   10
   FD Files Written:   0
   SD Files Written:   137,033
   FD Bytes Written:   0 (0 B)
   SD Bytes Written:   3,586,674,915 (3.586 GB)
   Rate:   0.0 KB/s
   Software Compression:   None
   VSS:no
   Encryption: no
   Accurate:   no
   Volume name(s): LeidenVol0005
   Volume Session Id:  20
   Volume Session Time:1310599400
   Last Volume Bytes:  12,025,925,394 (12.02 GB)
   Non-fatal FD errors:0
   SD Errors:  0
   FD termination status:  Error
   SD termination status:  OK
   Termination:*** Backup Error ***


When trying to rerun the job it also fails after 2 hours  I tried to 
fix it this way:


In the Job @ bacula-dir , I added Max Run Time = 144000 because it 
seemed like bacula shut down the connection after 2 hours.
I also changed the keep-alive time on the machine running bacula-dir :

sysctl -w net.ipv4.tcp_keepalive_time=60

When I did so, it failed completely:

   Elapsed time:   15 hours 22 mins 58 secs
   Priority:   10
   FD Files Written:   0
   SD Files Written:   0
   FD Bytes Written:   0 (0 B)
   SD Bytes Written:   0 (0 B)
   Rate:   0.0 KB/s
   Software Compression:   None
   VSS:no
   Encryption: no
   Accurate:   no
   Volume name(s):
   Volume Session Id:  33
   Volume Session Time:1310599400

That's really bad, my router did not detect any traffic at all except 
for some