Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
The memory use can be bounded with some additional work on the software, if someone wants to have a go at it. Basically the way you limit memory use is by dynamically limiting the CRC range that you observe in a pass. As you reach a self-imposed memory limit you reduce the CRC range and throw away out-of-range records. Once the pass is done you start a new pass with the remaining range. Rinse, repeat until the whole thing is done. That would make it possible to run de-dup with bounded memory. However, the extra I/O's required to verify duplicate data cannot be avoided. Currently the dedup code (in sbin/hammer/cmd_dedup.c) kicks off in hammer_cmd_dedup(); scan_pfs() calls process_btree_elm() for every data record in the B-Tree. There is an RB tree constructed of data records, keyed on their CRCs. process_btree_elm() has an easy job -- for every new record, it checks for a matching CRC in the tree; if it finds one, it attempts a dedup ioctl [the kernel performs a full block comparison, don't worry]. There is a really straightforward way to dramatically reduce memory use by dedup at a time cost -- run a fixed number of passes, each pass only storing records in the RB tree where (CRC % numpass == current_pass). After each pass, clear out the CRC RB tree. This will not run dedup in bounded space, but it is really straightforward to do and can result in dramatic memory use reductions (16 passes should reduce memory use by a factor of 16, for example). I've done a crude patch to try it, only changing ~20 lines of code in cmd_dedup.c. Something like this might work (very rough atm): in hammer_cmd_dedup(): - scan_pfs(av[0], process_btree_elm); + + for (i = 0; i npasses; i++) + scan_pfs(av[0], process_btree_elm); ... assert(RB_EMPTY(dedup_tree)); ... + passnum++; + + } /* for npasses */ + in process_btree_elm(): de = RB_LOOKUP(dedup_entry_rb_tree, dedup_tree, scan_leaf-data_crc); if (de == NULL) { + if (scan_leaf-data_crc % (passnum + 1)) + goto end; + To run in bounded space is also possible, but would require a variable number of passes. Imagine having a fixed number of CRC RB records you are willing to create each pass, MAXCRC. In each pass, you should keep accepting blocks with new CRCs into the RB tree until you've accepted MAXCRC ones. Then you record the highest accepted CRC for that pass and continue walking the disk, deduping blocks with matching CRCs but not accepting new ones to the tree. On the next pass, you will accept records with a CRC higher than the highest one you accepted on the last pass, up to MAXCRC. Between each pass, you clear the CRC RB tree. Example: Lets say I can have two CRCs in my tree (I have an old computer) and my FS has records with CRCs: [A B C A A B B D C C D D E]. On pass one, I'd store A and B in my RB tree as I see records. I'd record B as the highest CRC I dedup-ed on pass one; then I'd finish my disk walk and dedup all As and Bs. On pass two, I'd see a C and then later a D. I'd keep dedup C and D blocks on this pass and record D. So on and on till I've dedup-ed the highest CRC on disk. This would be a pretty neat way to do dedup! But it would involve more work than the fixed-numpasses approach. Either would be pretty good projects for someone who wanted to get started strikebreaking/strike working on DragonFly. There is very little that can go wrong with dedup strategies -- the kernel validates all data records before dedup-ing. In fact, a correct (if stupid) approach that'd involve nearly no memory commitment would be to run dedup ioctls for every data record with every other data record... -- vs
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
On Thu, 21 Jul 2011 06:23:16 +0200, Siju George sgeorge...@gmail.com wrote: On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote: nice statistics. I can not provide stats of my own, as I don't run Dragonfly yet, so I'm more of a hypothetical user right now. But one thing that's of interest to me is how long did the de-dupe process take? I ran them one by one. at my own pace but the biggest two simultaneously did not take more than 2 hrs. So I guess 2-3 hrs would be a nice approximation :-) My experiences were different on a file system containing a lot of data (2TB). I didn't try dedup itself but a dedup-simulate already ran for more than two days (consuming a lot of memory in the process) before I finally cancelled it. So yes, dedup seems to run fine but in my experience doesn't yet scale very well to larger amounts of data. Sascha -- http://yoyodyne.ath.cx
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
On Thu, 21 Jul 2011 09:56:38 +0200 Sascha Wildner s...@online.de wrote: On Thu, 21 Jul 2011 06:23:16 +0200, Siju George sgeorge...@gmail.com wrote: On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote: nice statistics. I can not provide stats of my own, as I don't run Dragonfly yet, so I'm more of a hypothetical user right now. But one thing that's of interest to me is how long did the de-dupe process take? I ran them one by one. at my own pace but the biggest two simultaneously did not take more than 2 hrs. So I guess 2-3 hrs would be a nice approximation :-) My experiences were different on a file system containing a lot of data (2TB). I didn't try dedup itself but a dedup-simulate already ran for more than two days (consuming a lot of memory in the process) before I finally cancelled it. Most odd - I just tried a dedup-simulate on a 2TB filesystem with about 840GB used, it finished in about 30 seconds and reported a ratio of 1.01 (dedup has been running automatically every night on this FS). -- Steve O'Hara-Smith | Directable Mirror Arrays C:WIN | A better way to focus the sun The computer obeys and wins.|licences available see You lose and Bill collects. |http://www.sohara.org/
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
On 2011-07-19, Siju George sgeorge...@gmail.com wrote: Hi Siju, Short Sumary before dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on Backup1 454G 451G 2.8G99%/Backup1 Short Sumary after dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on /Backup1/pfs/@@-1:1 454G 313G 141G69%/Backup1/Data [...] nice statistics. I can not provide stats of my own, as I don't run Dragonfly yet, so I'm more of a hypothetical user right now. But one thing that's of interest to me is how long did the de-dupe process take? Regards Thomas
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
On Thu, Jul 21, 2011 at 7:18 AM, Thomas Keusch fwd+usenet-spam201...@bsd-solutions-duesseldorf.de wrote: nice statistics. I can not provide stats of my own, as I don't run Dragonfly yet, so I'm more of a hypothetical user right now. But one thing that's of interest to me is how long did the de-dupe process take? I ran them one by one. at my own pace but the biggest two simultaneously did not take more than 2 hrs. So I guess 2-3 hrs would be a nice approximation :-) thanks --Siju
Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
Hi, Finally I got free after a long busy season to work on my DragonFlyBSD Backup Servers. One of the Backup Server has around 10 years of Company Archives. Short Sumary before dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on Backup1 454G 451G 2.8G99%/Backup1 Short Sumary after dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on /Backup1/pfs/@@-1:1 454G 313G 141G69%/Backup1/Data Reclaimed 138 GB i.e 30% of Disk space without deleting anything or considerably affecting the perfomance of the Server. Full Story: The first backups server was Debian Sarge, then Debian Etch and then OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that would even detect the 120 GB hard disks we had back then. Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity chceks and Easy FS Snapshots ) So this Dragonfly backup server has around 10 years old backups of 1) Web files of Projects ( html, php, images etc ) 2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the luxury to do http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/ But now we have SQL dumps of induvidual databses taken every hour and made available to the developers using snapshots in the same manner :-) 3) MS Word, Excell Doc files - Company documents and User backups 4) PSD files and such from Designers which takes a larg space. 5) Git, SVN repositories backup 6) Virtual Machine images ( mostly qcow2 ) 7) Configuration files of several servers and other details backuped daily/hourly os some times every 15 minutes and maintained with coarse grained snapshots without pruning. 8) Several Softwares and CD ISO images 9) Video/Audio files such as mp3,avi.flv,mpg and so on. The OS version currently is DragonFly v2.11.0.247.gda17d9-DEVELOPMENT Processor is AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU) Memory is real memory = 2113336320 (2015 MB) avail memory = 2029342720 (1935 MB) with four 500GB SATA Disks mirroring PFS from each other and also from another Dragonfly Backup Server on a differrent floor using 'mirror-stream' started at boot using cron with an entry similar to @reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data I have never reinstalled the OS but kept following the development version from July 2009 so that is two years of rolling release which is a great advantage in itself :-) The first Disk is mounted as /Backup1 and seems to be a good Candidate for dedup because it is almost full. == FilesystemSize Used Avail Capacity Mounted on Backup1 454G 451G 2.8G99%/Backup1 /Backup1/pfs/@@-1:1 454G 451G 2.8G99%/Backup1/Data /Backup1/pfs/@@-1:9 454G 451G 2.8G99%/Backup1/pkgsrc /Backup1/pfs/@@-1:2 454G 451G 2.8G99%/Backup1/VersionControl /Backup1/pfs/@@-1:3 454G 451G 2.8G99%/Backup1/test /Backup1/pfs/@@-1:5 454G 451G 2.8G99% /Backup1/www-5mbak/www-hot /Backup1/pfs/@@-1:6 454G 451G 2.8G99% /Backup1/mysql-1hbak/mysql-hot /Backup1/pfs/@@-1:7 454G 451G 2.8G99% /Backup1/project-docs-bak/project-docs === Full Details below. = Label Backup1 No. Volumes 1 FSIDe182... HAMMER Version 4 Big block information Total 58140 Used57713 (99.27%) Reserved 69 (0.12%) Free 358 (0.62%) Space information No. Inodes 11350364 Total size 454G (487713669120 bytes) Used 451G (99.27%) Reserved 552M (0.12%) Free 2.8G (0.62%) PFS information PFS ID ModeSnaps Mounted on 0 MASTER 0 /Backup1 1 MASTER 0 /Backup1/Data 2 MASTER 0 /Backup1/VersionControl 3 MASTER 0 /Backup1/test 5 MASTER 0 /Backup1/www-5mbak/www-hot 6 MASTER 0 /Backup1/mysql-1hbak/mysql-hot 7 MASTER 0 /Backup1/project-docs-bak/project-docs 9 MASTER 0 /Backup1/pkgsrc == De Duping Steps Taken: -- 1) Version Upgrading from 4 to 6. = dfly-bkpsrv# hammer version-upgrade /Backup1 5 hammer version-upgrade: succeeded dfly-bkpsrv# hammer version-upgrade /Backup1 6 hammer version-upgrade:
Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
Some Copy Paste mistakes in the first one. Hereis the updated one. Hi, Finally I got free after a long busy season to work on my DragonFlyBSD Backup Servers. One of the Backup Server has around 10 years of Company Archives. Short Sumary before dedup of firtst Hard Disk Filesystem Size Used Avail Capacity Mounted on Backup1 454G 451G 2.8G 99% /Backup1 Short Sumary after dedup of firtst Hard Disk Filesystem Size Used Avail Capacity Mounted on Backup1 454G 313G 141G 69% /Backup1 Reclaimed 138 GB i.e 30% of Disk space without deleting anything or considerably affecting the perfomance of the Server. Full Story: The first backups server was Debian Sarge, then Debian Etch and then OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that would even detect the 120 GB hard disks we had back then. Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity chceks and Easy FS Snapshots ) So this Dragonfly backup server has around 10 years old backups of 1) Web files of Projects ( html, php, images etc ) 2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the luxury to do http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/ But now we have SQL dumps of induvidual databses taken every hour and made available to the developers using snapshots in the same manner :-) 3) MS Word, Excell Doc files - Company documents and User backups 4) PSD files and such from Designers which takes a larg space. 5) Git, SVN repositories backup 6) Virtual Machine images ( mostly qcow2 ) 7) Configuration files of several servers and other details backuped daily/hourly os some times every 15 minutes and maintained with coarse grained snapshots without pruning. 8) Several Softwares and CD ISO images 9) Video/Audio files such as mp3,avi.flv,mpg and so on. The OS version currently is DragonFly v2.11.0.247.gda17d9-DEVELOPMENT Processor is AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU) Memory is real memory = 2113336320 (2015 MB) avail memory = 2029342720 (1935 MB) with four 500GB SATA Disks mirroring PFS from each other and also from another Dragonfly Backup Server on a differrent floor using 'mirror-stream' started at boot using cron with an entry similar to @reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data I have never reinstalled the OS but kept following the development version from July 2009 so that is two years of rolling release which is a great advantage in itself :-) The first Disk is mounted as /Backup1 and seems to be a good Candidate for dedup because it is almost full. == Filesystem Size Used Avail Capacity Mounted on Backup1 454G 451G 2.8G 99% /Backup1 /Backup1/pfs/@@-1:1 454G 451G 2.8G 99% /Backup1/Data /Backup1/pfs/@@-1:9 454G 451G 2.8G 99% /Backup1/pkgsrc /Backup1/pfs/@@-1:2 454G 451G 2.8G 99% /Backup1/VersionControl /Backup1/pfs/@@-1:3 454G 451G 2.8G 99% /Backup1/test /Backup1/pfs/@@-1:5 454G 451G 2.8G 99% /Backup1/www-5mbak/www-hot /Backup1/pfs/@@-1:6 454G 451G 2.8G 99% /Backup1/mysql-1hbak/mysql-hot /Backup1/pfs/@@-1:7 454G 451G 2.8G 99% /Backup1/project-docs-bak/project-docs === Full Details below. = Label Backup1 No. Volumes 1 FSID e182... HAMMER Version 4 Big block information Total 58140 Used 57713 (99.27%) Reserved 69 (0.12%) Free 358 (0.62%) Space information No. Inodes 11350364 Total size 454G (487713669120 bytes) Used 451G (99.27%) Reserved 552M (0.12%) Free 2.8G (0.62%) PFS information PFS ID Mode Snaps Mounted on 0 MASTER 0 /Backup1 1 MASTER 0 /Backup1/Data 2 MASTER 0 /Backup1/VersionControl 3 MASTER 0 /Backup1/test 5 MASTER 0 /Backup1/www-5mbak/www-hot 6 MASTER 0 /Backup1/mysql-1hbak/mysql-hot 7 MASTER 0 /Backup1/project-docs-bak/project-docs 9 MASTER 0 /Backup1/pkgsrc == De Duping Steps Taken: -- 1) Version Upgrading from 4 to 6. = dfly-bkpsrv# hammer version-upgrade /Backup1 5 hammer version-upgrade: succeeded dfly-bkpsrv# hammer
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
i would be intetested to see how this compares to other dedupliction implementations D On 19/07/2011, at 9:06 PM, Siju George sgeorge...@gmail.com wrote: Some Copy Paste mistakes in the first one. Hereis the updated one. Hi, Finally I got free after a long busy season to work on my DragonFlyBSD Backup Servers. One of the Backup Server has around 10 years of Company Archives. Short Sumary before dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on Backup1 454G 451G 2.8G99%/Backup1 Short Sumary after dedup of firtst Hard Disk FilesystemSize Used Avail Capacity Mounted on Backup1 454G 313G 141G69%/Backup1 Reclaimed 138 GB i.e 30% of Disk space without deleting anything or considerably affecting the perfomance of the Server. Full Story: The first backups server was Debian Sarge, then Debian Etch and then OpenBSD with RAIDFRAME mirrors because it was the only Unix/Linux that would even detect the 120 GB hard disks we had back then. Later I turned to DragonFlyBSD due to HAMMER ( No fsck, No RAID Parity chceks and Easy FS Snapshots ) So this Dragonfly backup server has around 10 years old backups of 1) Web files of Projects ( html, php, images etc ) 2) SQL dumps both zipped and unzipped .Hammer snapshots gave me the luxury to do http://www.dragonflybsd.org/docs/real_time_backup_server_for_microsoft_windows__44___linux__44___bsd_and_mac_os_x_clients/ But now we have SQL dumps of induvidual databses taken every hour and made available to the developers using snapshots in the same manner :-) 3) MS Word, Excell Doc files - Company documents and User backups 4) PSD files and such from Designers which takes a larg space. 5) Git, SVN repositories backup 6) Virtual Machine images ( mostly qcow2 ) 7) Configuration files of several servers and other details backuped daily/hourly os some times every 15 minutes and maintained with coarse grained snapshots without pruning. 8) Several Softwares and CD ISO images 9) Video/Audio files such as mp3,avi.flv,mpg and so on. The OS version currently is DragonFly v2.11.0.247.gda17d9-DEVELOPMENT Processor is AMD Athlon(tm) 64 Processor 3400+ (2193.63-MHz 686-class CPU) Memory is real memory = 2113336320 (2015 MB) avail memory = 2029342720 (1935 MB) with four 500GB SATA Disks mirroring PFS from each other and also from another Dragonfly Backup Server on a differrent floor using 'mirror-stream' started at boot using cron with an entry similar to @reboot /sbin/hammer mirror-stream /Backup1/Data /Backup2/Data I have never reinstalled the OS but kept following the development version from July 2009 so that is two years of rolling release which is a great advantage in itself :-) The first Disk is mounted as /Backup1 and seems to be a good Candidate for dedup because it is almost full. == FilesystemSize Used Avail Capacity Mounted on Backup1 454G 451G 2.8G99%/Backup1 /Backup1/pfs/@@-1:1 454G 451G 2.8G99%/Backup1/Data /Backup1/pfs/@@-1:9 454G 451G 2.8G99%/Backup1/pkgsrc /Backup1/pfs/@@-1:2 454G 451G 2.8G99%/Backup1/VersionControl /Backup1/pfs/@@-1:3 454G 451G 2.8G99%/Backup1/test /Backup1/pfs/@@-1:5 454G 451G 2.8G99% /Backup1/www-5mbak/www-hot /Backup1/pfs/@@-1:6 454G 451G 2.8G99% /Backup1/mysql-1hbak/mysql-hot /Backup1/pfs/@@-1:7 454G 451G 2.8G99% /Backup1/project-docs-bak/project-docs === Full Details below. = Label Backup1 No. Volumes 1 FSIDe182... HAMMER Version 4 Big block information Total 58140 Used57713 (99.27%) Reserved 69 (0.12%) Free 358 (0.62%) Space information No. Inodes 11350364 Total size 454G (487713669120 bytes) Used 451G (99.27%) Reserved 552M (0.12%) Free 2.8G (0.62%) PFS information PFS ID ModeSnaps Mounted on 0 MASTER 0 /Backup1 1 MASTER 0 /Backup1/Data 2 MASTER 0 /Backup1/VersionControl 3 MASTER 0 /Backup1/test 5 MASTER 0 /Backup1/www-5mbak/www-hot 6 MASTER 0 /Backup1/mysql-1hbak/mysql-hot 7 MASTER 0 /Backup1/project-docs-bak/project-docs 9 MASTER 0 /Backup1/pkgsrc