Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-22 Thread Venkatesh Srinivas
    The memory use can be bounded with some additional work on the software,
    if someone wants to have a go at it.  Basically the way you limit memory
    use is by dynamically limiting the CRC range that you observe in a pass.
    As you reach a self-imposed memory limit you reduce the CRC range and
    throw away out-of-range records.  Once the pass is done you start a new
    pass with the remaining range.  Rinse, repeat until the whole thing is
    done.

    That would make it possible to run de-dup with bounded memory.  However,
    the extra I/O's required to verify duplicate data cannot be avoided.

Currently the dedup code (in sbin/hammer/cmd_dedup.c) kicks off in
hammer_cmd_dedup(); scan_pfs() calls process_btree_elm() for every
data record in the B-Tree. There is an RB tree constructed of data
records, keyed on their CRCs. process_btree_elm() has an easy job --
for every new record, it checks for a matching CRC in the tree; if it
finds one, it attempts a dedup ioctl [the kernel performs a full block
comparison, don't worry].

There is a really straightforward way to dramatically reduce memory
use by dedup at a time cost -- run a fixed number of passes, each pass
only storing records in the RB tree where (CRC % numpass ==
current_pass). After each pass, clear out the CRC RB tree. This will
not run dedup in bounded space, but it is really straightforward to do
and can result in dramatic memory use reductions (16 passes should
reduce memory use by a factor of 16, for example).

I've done a crude patch to try it, only changing ~20 lines of code in
cmd_dedup.c. Something like this might work (very rough atm):

in hammer_cmd_dedup():
-   scan_pfs(av[0], process_btree_elm);
+
+   for (i = 0; i  npasses; i++)
+   scan_pfs(av[0], process_btree_elm);
   ...
assert(RB_EMPTY(dedup_tree));
   ...
+   passnum++;
+
+   } /* for npasses */
+

in process_btree_elm():
de = RB_LOOKUP(dedup_entry_rb_tree, dedup_tree, scan_leaf-data_crc);
if (de == NULL) {
+   if (scan_leaf-data_crc % (passnum + 1))
+   goto end;
+


To run in bounded space is also possible, but would require a variable
number of passes. Imagine having a fixed number of CRC RB records you
are willing to create each pass, MAXCRC. In each pass, you should keep
accepting blocks with new CRCs into the RB tree until you've accepted
MAXCRC ones. Then you record the highest accepted CRC for that pass
and continue walking the disk, deduping blocks with matching CRCs but
not accepting new ones to the tree. On the next pass, you will accept
records with a CRC higher than the highest one you accepted on the
last pass, up to MAXCRC. Between each pass, you clear the CRC RB tree.

Example:

Lets say I can have two CRCs in my tree (I have an old computer) and
my FS has records with CRCs: [A B C A A B B D C C D D E].
On pass one, I'd store A and B in my RB tree as I see records. I'd
record B as the highest CRC I dedup-ed on pass one; then I'd finish my
disk walk and dedup all As and Bs. On pass two, I'd see a C and then
later a D. I'd keep dedup C and D blocks on this pass and record D. So
on and on till I've dedup-ed the highest CRC on disk.

This would be a pretty neat way to do dedup! But it would involve more
work than the fixed-numpasses approach. Either would be pretty good
projects for someone who wanted to get started
strikebreaking/strike working on DragonFly. There is very little
that can go wrong with dedup strategies -- the kernel validates all
data records before dedup-ing. In fact, a correct (if stupid) approach
that'd involve nearly no memory commitment would be to run dedup
ioctls for every data record with every other data record...

-- vs



Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-21 Thread Sascha Wildner
On Thu, 21 Jul 2011 06:23:16 +0200, Siju George sgeorge...@gmail.com  
wrote:



On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch
fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote:


nice statistics. I can not provide stats of my own, as I don't run
Dragonfly yet, so I'm more of a hypothetical user right now. But one
thing that's of interest to me is how long did the de-dupe process take?



I ran them one by one. at my own pace but the biggest two
simultaneously did not take more than 2 hrs.
So I guess 2-3 hrs would be a nice approximation :-)


My experiences were different on a file system containing a lot of data  
(2TB).


I didn't try dedup itself but a dedup-simulate already ran for more than  
two days (consuming a lot of memory in the process) before I finally  
cancelled it.


So yes, dedup seems to run fine but in my experience doesn't yet scale  
very well to larger amounts of data.


Sascha

--
http://yoyodyne.ath.cx


Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-21 Thread Steve O'Hara-Smith
On Thu, 21 Jul 2011 09:56:38 +0200
Sascha Wildner s...@online.de wrote:

 On Thu, 21 Jul 2011 06:23:16 +0200, Siju George sgeorge...@gmail.com  
 wrote:
 
  On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch
  fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote:
 
  nice statistics. I can not provide stats of my own, as I don't run
  Dragonfly yet, so I'm more of a hypothetical user right now. But one
  thing that's of interest to me is how long did the de-dupe process
  take?
 
 
  I ran them one by one. at my own pace but the biggest two
  simultaneously did not take more than 2 hrs.
  So I guess 2-3 hrs would be a nice approximation :-)
 
 My experiences were different on a file system containing a lot of data  
 (2TB).
 
 I didn't try dedup itself but a dedup-simulate already ran for more than  
 two days (consuming a lot of memory in the process) before I finally  
 cancelled it.

Most odd - I just tried a dedup-simulate on a 2TB filesystem with
about 840GB used, it finished in about 30 seconds and reported a ratio of
1.01 (dedup has been running automatically every night on this FS).

-- 
Steve O'Hara-Smith  |   Directable Mirror Arrays
C:WIN  | A better way to focus the sun
The computer obeys and wins.|licences available see
You lose and Bill collects. |http://www.sohara.org/


Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-20 Thread Thomas Keusch
On 2011-07-19, Siju George sgeorge...@gmail.com wrote:

Hi Siju,

 Short Sumary before dedup of firtst Hard Disk

 FilesystemSize   Used  Avail Capacity  Mounted on
 Backup1   454G   451G   2.8G99%/Backup1

 Short Sumary after dedup of firtst Hard Disk

 FilesystemSize   Used  Avail Capacity  Mounted on
 /Backup1/pfs/@@-1:1   454G   313G   141G69%/Backup1/Data
[...]

nice statistics. I can not provide stats of my own, as I don't run
Dragonfly yet, so I'm more of a hypothetical user right now. But one
thing that's of interest to me is how long did the de-dupe process take?

Regards
Thomas


Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-20 Thread Siju George
On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch
fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote:

 nice statistics. I can not provide stats of my own, as I don't run
 Dragonfly yet, so I'm more of a hypothetical user right now. But one
 thing that's of interest to me is how long did the de-dupe process take?


I ran them one by one. at my own pace but the biggest two
simultaneously did not take more than 2 hrs.
So I guess 2-3 hrs would be a nice approximation :-)

thanks

--Siju


Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-19 Thread Siju George
Hi,

Finally I got free after a long busy season to work on my DragonFlyBSD
Backup Servers.
One of the Backup Server has around 10 years of Company  Archives.

Short Sumary before dedup of firtst Hard Disk

FilesystemSize   Used  Avail Capacity  Mounted on
Backup1   454G   451G   2.8G99%/Backup1

Short Sumary after dedup of firtst Hard Disk

FilesystemSize   Used  Avail Capacity  Mounted on
/Backup1/pfs/@@-1:1   454G   313G   141G69%/Backup1/Data

Reclaimed 138 GB i.e 30% of Disk space without deleting anything or
considerably affecting the perfomance of the Server.

Full Story:

The first backups server was Debian Sarge, then Debian Etch and then
OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that
would even detect the 120 GB hard disks we had back then.
Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity
chceks and Easy FS Snapshots )
So this Dragonfly backup server has around 10 years old backups of

1) Web files of Projects ( html, php, images etc )

2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the
luxury to do

http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/

But now we have SQL dumps of induvidual databses taken every hour and
made available to the developers using snapshots in the same manner
:-)

3) MS Word, Excell Doc files - Company documents and User backups

4) PSD files and such from Designers which takes a larg space.

5) Git, SVN repositories backup

6) Virtual Machine images ( mostly qcow2 )

7) Configuration files of several servers and other details backuped
daily/hourly os some times every 15 minutes and maintained with coarse
grained snapshots without pruning.

8) Several Softwares and CD ISO images

9) Video/Audio files such as mp3,avi.flv,mpg and so on.


The OS version currently is

DragonFly v2.11.0.247.gda17d9-DEVELOPMENT

 Processor is

AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU)

Memory is

real memory  = 2113336320 (2015 MB)
avail memory = 2029342720 (1935 MB)

with four 500GB SATA Disks mirroring PFS from each other and also from
another Dragonfly Backup Server on a differrent floor using
'mirror-stream' started at boot using cron with an entry similar to

@reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data 


I have never reinstalled the OS but kept following the development
version from July 2009 so that is two years of rolling release which
is a great advantage in itself :-)

The first Disk is mounted as /Backup1 and seems to be a good Candidate
for dedup because it is almost full.

==
FilesystemSize   Used  Avail Capacity  Mounted on

Backup1   454G   451G   2.8G99%/Backup1
/Backup1/pfs/@@-1:1   454G   451G   2.8G99%/Backup1/Data
/Backup1/pfs/@@-1:9   454G   451G   2.8G99%/Backup1/pkgsrc
/Backup1/pfs/@@-1:2   454G   451G   2.8G99%/Backup1/VersionControl
/Backup1/pfs/@@-1:3   454G   451G   2.8G99%/Backup1/test
/Backup1/pfs/@@-1:5   454G   451G   2.8G99%
/Backup1/www-5mbak/www-hot
/Backup1/pfs/@@-1:6   454G   451G   2.8G99%
/Backup1/mysql-1hbak/mysql-hot
/Backup1/pfs/@@-1:7   454G   451G   2.8G99%
/Backup1/project-docs-bak/project-docs
===

Full Details below.

=

Label   Backup1
No. Volumes 1
FSIDe182...
HAMMER Version  4
Big block information
Total   58140
Used57713 (99.27%)
Reserved   69 (0.12%)
Free  358 (0.62%)
Space information
No. Inodes   11350364
Total size   454G (487713669120 bytes)
Used 451G (99.27%)
Reserved 552M (0.12%)
Free 2.8G (0.62%)
PFS information
PFS ID  ModeSnaps  Mounted on
 0  MASTER  0  /Backup1
 1  MASTER  0  /Backup1/Data
 2  MASTER  0  /Backup1/VersionControl
 3  MASTER  0  /Backup1/test
 5  MASTER  0  /Backup1/www-5mbak/www-hot
 6  MASTER  0  /Backup1/mysql-1hbak/mysql-hot
 7  MASTER  0  /Backup1/project-docs-bak/project-docs
 9  MASTER  0  /Backup1/pkgsrc
==


De Duping Steps Taken:
--


1) Version Upgrading from 4 to 6.

=
dfly-bkpsrv# hammer version-upgrade /Backup1 5
hammer version-upgrade: succeeded
dfly-bkpsrv# hammer version-upgrade /Backup1 6
hammer version-upgrade: 

Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-19 Thread Siju George
Some Copy Paste mistakes in the first one. Hereis the updated one.

Hi,

Finally I got free after a long busy season to work on my DragonFlyBSD
Backup Servers.
One of the Backup Server has around 10 years of Company  Archives.

Short Sumary before dedup of firtst Hard Disk

Filesystem                Size   Used  Avail Capacity  Mounted on
Backup1                   454G   451G   2.8G    99%    /Backup1

Short Sumary after dedup of firtst Hard Disk

Filesystem                Size   Used  Avail Capacity  Mounted on
Backup1   454G   313G   141G    69%    /Backup1

Reclaimed 138 GB i.e 30% of Disk space without deleting anything or
considerably affecting the perfomance of the Server.

Full Story:

The first backups server was Debian Sarge, then Debian Etch and then
OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that
would even detect the 120 GB hard disks we had back then.
Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity
chceks and Easy FS Snapshots )
So this Dragonfly backup server has around 10 years old backups of

1) Web files of Projects ( html, php, images etc )

2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the
luxury to do

http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/

But now we have SQL dumps of induvidual databses taken every hour and
made available to the developers using snapshots in the same manner
:-)

3) MS Word, Excell Doc files - Company documents and User backups

4) PSD files and such from Designers which takes a larg space.

5) Git, SVN repositories backup

6) Virtual Machine images ( mostly qcow2 )

7) Configuration files of several servers and other details backuped
daily/hourly os some times every 15 minutes and maintained with coarse
grained snapshots without pruning.

8) Several Softwares and CD ISO images

9) Video/Audio files such as mp3,avi.flv,mpg and so on.


The OS version currently is

DragonFly v2.11.0.247.gda17d9-DEVELOPMENT

 Processor is

AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU)

Memory is

real memory  = 2113336320 (2015 MB)
avail memory = 2029342720 (1935 MB)

with four 500GB SATA Disks mirroring PFS from each other and also from
another Dragonfly Backup Server on a differrent floor using
'mirror-stream' started at boot using cron with an entry similar to

@reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data 


I have never reinstalled the OS but kept following the development
version from July 2009 so that is two years of rolling release which
is a great advantage in itself :-)

The first Disk is mounted as /Backup1 and seems to be a good Candidate
for dedup because it is almost full.

==
Filesystem                Size   Used  Avail Capacity  Mounted on

Backup1                   454G   451G   2.8G    99%    /Backup1
/Backup1/pfs/@@-1:1   454G   451G   2.8G    99%    /Backup1/Data
/Backup1/pfs/@@-1:9   454G   451G   2.8G    99%    /Backup1/pkgsrc
/Backup1/pfs/@@-1:2   454G   451G   2.8G    99%    /Backup1/VersionControl
/Backup1/pfs/@@-1:3   454G   451G   2.8G    99%    /Backup1/test
/Backup1/pfs/@@-1:5   454G   451G   2.8G    99%
/Backup1/www-5mbak/www-hot
/Backup1/pfs/@@-1:6   454G   451G   2.8G    99%
/Backup1/mysql-1hbak/mysql-hot
/Backup1/pfs/@@-1:7   454G   451G   2.8G    99%
/Backup1/project-docs-bak/project-docs
===

Full Details below.

=

       Label               Backup1
       No. Volumes         1
       FSID                e182...
       HAMMER Version      4
Big block information
       Total           58140
       Used            57713 (99.27%)
       Reserved           69 (0.12%)
       Free              358 (0.62%)
Space information
       No. Inodes   11350364
       Total size       454G (487713669120 bytes)
       Used             451G (99.27%)
       Reserved         552M (0.12%)
       Free             2.8G (0.62%)
PFS information
       PFS ID  Mode    Snaps  Mounted on
            0  MASTER      0  /Backup1
            1  MASTER      0  /Backup1/Data
            2  MASTER      0  /Backup1/VersionControl
            3  MASTER      0  /Backup1/test
            5  MASTER      0  /Backup1/www-5mbak/www-hot
            6  MASTER      0  /Backup1/mysql-1hbak/mysql-hot
            7  MASTER      0  /Backup1/project-docs-bak/project-docs
            9  MASTER      0  /Backup1/pkgsrc
==


De Duping Steps Taken:
--


1) Version Upgrading from 4 to 6.

=
dfly-bkpsrv# hammer version-upgrade /Backup1 5
hammer version-upgrade: succeeded
dfly-bkpsrv# hammer 

Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

2011-07-19 Thread Dean Hamstead
i would be intetested to see how this compares to other dedupliction 
implementations

D



On 19/07/2011, at 9:06 PM, Siju George sgeorge...@gmail.com wrote:

 Some Copy Paste mistakes in the first one. Hereis the updated one.
 
 Hi,
 
 Finally I got free after a long busy season to work on my DragonFlyBSD
 Backup Servers.
 One of the Backup Server has around 10 years of Company  Archives.
 
 Short Sumary before dedup of firtst Hard Disk
 
 FilesystemSize   Used  Avail Capacity  Mounted on
 Backup1   454G   451G   2.8G99%/Backup1
 
 Short Sumary after dedup of firtst Hard Disk
 
 FilesystemSize   Used  Avail Capacity  Mounted on
 Backup1   454G   313G   141G69%/Backup1
 
 Reclaimed 138 GB i.e 30% of Disk space without deleting anything or
 considerably affecting the perfomance of the Server.
 
 Full Story:
 
 The first backups server was Debian Sarge, then Debian Etch and then
 OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that
 would even detect the 120 GB hard disks we had back then.
 Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity
 chceks and Easy FS Snapshots )
 So this Dragonfly backup server has around 10 years old backups of
 
 1) Web files of Projects ( html, php, images etc )
 
 2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the
 luxury to do
 
 http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/
 
 But now we have SQL dumps of induvidual databses taken every hour and
 made available to the developers using snapshots in the same manner
 :-)
 
 3) MS Word, Excell Doc files - Company documents and User backups
 
 4) PSD files and such from Designers which takes a larg space.
 
 5) Git, SVN repositories backup
 
 6) Virtual Machine images ( mostly qcow2 )
 
 7) Configuration files of several servers and other details backuped
 daily/hourly os some times every 15 minutes and maintained with coarse
 grained snapshots without pruning.
 
 8) Several Softwares and CD ISO images
 
 9) Video/Audio files such as mp3,avi.flv,mpg and so on.
 
 
 The OS version currently is
 
 DragonFly v2.11.0.247.gda17d9-DEVELOPMENT
 
  Processor is
 
 AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU)
 
 Memory is
 
 real memory  = 2113336320 (2015 MB)
 avail memory = 2029342720 (1935 MB)
 
 with four 500GB SATA Disks mirroring PFS from each other and also from
 another Dragonfly Backup Server on a differrent floor using
 'mirror-stream' started at boot using cron with an entry similar to
 
 @reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data 
 
 
 I have never reinstalled the OS but kept following the development
 version from July 2009 so that is two years of rolling release which
 is a great advantage in itself :-)
 
 The first Disk is mounted as /Backup1 and seems to be a good Candidate
 for dedup because it is almost full.
 
 ==
 FilesystemSize   Used  Avail Capacity  Mounted on
 
 Backup1   454G   451G   2.8G99%/Backup1
 /Backup1/pfs/@@-1:1   454G   451G   2.8G99%/Backup1/Data
 /Backup1/pfs/@@-1:9   454G   451G   2.8G99%/Backup1/pkgsrc
 /Backup1/pfs/@@-1:2   454G   451G   2.8G99%/Backup1/VersionControl
 /Backup1/pfs/@@-1:3   454G   451G   2.8G99%/Backup1/test
 /Backup1/pfs/@@-1:5   454G   451G   2.8G99%
 /Backup1/www-5mbak/www-hot
 /Backup1/pfs/@@-1:6   454G   451G   2.8G99%
 /Backup1/mysql-1hbak/mysql-hot
 /Backup1/pfs/@@-1:7   454G   451G   2.8G99%
 /Backup1/project-docs-bak/project-docs
 ===
 
 Full Details below.
 
 =
 
Label   Backup1
No. Volumes 1
FSIDe182...
HAMMER Version  4
 Big block information
Total   58140
Used57713 (99.27%)
Reserved   69 (0.12%)
Free  358 (0.62%)
 Space information
No. Inodes   11350364
Total size   454G (487713669120 bytes)
Used 451G (99.27%)
Reserved 552M (0.12%)
Free 2.8G (0.62%)
 PFS information
PFS ID  ModeSnaps  Mounted on
 0  MASTER  0  /Backup1
 1  MASTER  0  /Backup1/Data
 2  MASTER  0  /Backup1/VersionControl
 3  MASTER  0  /Backup1/test
 5  MASTER  0  /Backup1/www-5mbak/www-hot
 6  MASTER  0  /Backup1/mysql-1hbak/mysql-hot
 7  MASTER  0  /Backup1/project-docs-bak/project-docs
 9  MASTER  0  /Backup1/pkgsrc