Test results for [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-14 Thread Konstantinos Skarlatos

Hello,

Here are the test results from my testing of the latest patches of btrfs 
dedup.


TLDR;
I rsynced 10 separate copies of a 3.8GB folder with 138 RAW photographs 
(23-36MiB) on a btrfs volume with dedup enabled.
On the first try, the copy was very slow, and a sync after that took 
over 10 minutes to complete.
For the next copies sync was much faster, but still took up to one 
minute to complete.
The copy itself was quite slow, until the fifth try when it went from 
8MB/sec to 22-40MB/sec.
Each copy after the first consumed about 60-65MiB of metadata, or 
120-130MiB of free space due to metadata being DUP.


Obvious question:
Can dedup recognize that 2 files are the same and dedup them on a file 
level, saving much more space in the process?


In any case I am very thankful of the work being done here, and i am 
willing to help in any way i can.





AMD Phenom(tm) II X4 955 Processor
MemTotal:  8 GB
Hard Disk: Seagate Barracuda 7200.12 [160 GB]
kernel: 3.14.0-1-git

$ mkfs.btrfs /dev/loop0 -f && mount /storage/btrfs_dedup && mount |grep 
dedup && btrfs dedup enable /storage/btrfs_dedup && btrfs dedup on 
/storage/btrfs_dedup && for i in {01..10}; do time rsync -a 
/storage/btrfs/costas/Photo_library/2014/ /storage/btrfs_dedup/copy$i/ 
--stats && time btrfs fi sync /storage/btrfs_dedup/ && df 
/storage/btrfs_dedup/ && btrfs fi df /storage/btrfs_dedup ; done && time 
umount /storage/btrfs_dedup


/root/btrfs.img on /storage/btrfs_dedup type btrfs 
(rw,noatime,nodiratime,space_cache)


sent 4,017,134,246 bytes  received 2,689 bytes  8,274,226.44 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.85s user 
45.04s system 13% cpu 8:05.48 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.36s system 0% cpu 
10:43.27 total

/dev/loop1 46080  4119 40173  10% /storage/btrfs_dedup
Data, single: total=4.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=143.45MiB

sent 4,017,134,246 bytes  received 2,689 bytes  8,956,827.06 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.29s user 
42.32s system 14% cpu 7:28.74 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
4.173 total

/dev/loop1 46080  4250 40173  10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=208.72MiB

sent 4,017,134,246 bytes  received 2,689 bytes  9,691,524.57 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  20.95s user 
31.69s system 12% cpu 6:54.90 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.00s system 0% cpu 
3.254 total

/dev/loop1 46080  4371 40172  10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=269.39MiB

sent 4,017,134,246 bytes  received 2,689 bytes  9,037,428.43 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  20.54s user 
36.70s system 12% cpu 7:23.93 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
5.578 total

/dev/loop1 46080  4497 40172  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=331.98MiB

sent 4,017,134,246 bytes  received 2,689 bytes  29,004,598.81 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  22.30s user 
13.01s system 25% cpu 2:18.15 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
23.447 total

/dev/loop1 46080  4617 40172  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=391.91MiB

sent 4,017,134,246 bytes  received 2,689 bytes  39,971,511.79 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.60s user 
11.85s system 33% cpu 1:39.74 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
32.178 total

/dev/loop1 46080  4747 40171  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=456.48MiB

sent 4,017,134,246 bytes  received 2,689 bytes  32,009,059.24 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  25.68s user 
13.94s system 31% cpu 2:04.42 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
29.313 total

/dev/loop1 46080  4870 40171  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=518.09MiB

sent 4,017,134,246 bytes  received 2,689 bytes  30,782,658.51 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.84s user 
12.63s system 26% cpu 2:10.20 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.00s system 0% cpu 
41.074 total

/dev/loop1 46080  4990 40171  12% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=578.16MiB

sent 4,017,134,246 bytes  received 2,689 bytes  22,379,592.95 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --st

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Michael

Hi Liu,

Thanks for your work.

Each test copy 2gb file from sdf (btrfs) to sde (btrfs with dedup 4k 
blocksize).

Before every test i recreate filesystem.
On second write all goods.

Test 1
Nodesize = leafsize = 4k
Write overhead ~ x1.5

Test 2
Nodesize = leafsize = 16k
Write overhead ~ x19



Test 1:

fs created label (null) on /dev/sde
nodesize 4096 leafsize 4096 sectorsize 4096 size 931.32GiB
Btrfs v3.14-dirty
./btrfs dedup enable /mnt/backupkvm/ ; ./btrfs dedup on -b4k 
/mnt/backupkvm/


[root@hranilka2 ~]# cat /sys/block/sde/stat
  157149 3096  1285992   128900  6472058 12962342 718660664 
15112940620  1576195 1527264468

[root@hranilka2 ~]# cat /sys/block/sde/stat
  157149 3096  1285992   128900  6526802 13268740 724690168 
15121175000  1587233 1528090320

write sectors: 724690168 - 718660664 = 6029504 sector
write mbyte: 6029504 * 512 / 1024 / 1024 ~ 2944 mb

[root@hranilka2 ~]# cat /sys/block/sdf/stat
  338633  165 346540680  704377174 1496 35
0   610795  7043095

[root@hranilka2 ~]# cat /sys/block/sdf/stat
  342737  165 350743176  712754674 1496 35
0   618652  7126860

read sectors: 350743176 - 346540680 = 4202496 sector
read mbyte: 4202496 * 512 / 1024 / 1024 ~ 2052 mb



Test 2:

fs created label (null) on /dev/sde
nodesize 16384 leafsize 16384 sectorsize 4096 size 931.32GiB
Btrfs v3.14-dirty
./btrfs dedup enable /mnt/backupkvm/ ; ./btrfs dedup on -b4k 
/mnt/backupkvm/



[root@hranilka2 ~]# cat /sys/block/sde/stat
  157440 3313  1290392   129277  6526912 13272046 724722864 
15121179200  1587500 1528091116

[root@hranilka2 ~]# cat /sys/block/sde/stat
  157440 3313  1290392   129277  6964494 13836479 803994376 
15149358800  1730081 1530910351

write sectors: 803994376 - 724722864 = 79271512 sector
write mbyte: 79271512 * 512 / 1024 / 1024 ~ 38706 mb

[root@hranilka2 ~]# cat /sys/block/sdf/stat
  342737  165 350743176  712754674 1496 35
0   618652  7126860

[root@hranilka2 ~]# cat /sys/block/sdf/stat
  346841  165 354945672  726623174 1496 35
0   629824  7265539

read sectors: 354945672 - 350743176 = 4202496 sector
read mbyte: 4202496 * 512 / 1024 / 1024 ~ 2052 mb

--
Michael Serikov
--
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Liu Bo
On Fri, Apr 11, 2014 at 11:28:48AM +0200, Martin Steigerwald wrote:
> Hi Liu,
> 
> Am Donnerstag, 10. April 2014, 23:55:21 schrieb Liu Bo:
> > Hi,
> > 
> > Just FYI, these patches are also available on the following site,
> > 
> > kernel:
> > https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux
> > 
> > progs:
> > https://github.com/liubogithub/btrfs-progs.git dedup
> 
> I bet its good to only test it with test data so far or would you consider it 
> safe enough to test on production data already?

Definitely at an experimental stage(still has known bugs), that's why I tagged
a 'RFC' on it.

> 
> Fortunately since I added that additional mSATA SSD I have some spare storage 
> to put some test setup into.

Thanks for testing it!

Thanks,
-liubo

> 
> Thanks,
> Martin
> 
> > thanks,
> > -liubo
> > 
> > On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> > > Hello,
> > > 
> > > This the 10th attempt for in-band data dedupe, based on Linux _3.14_
> > > kernel.
> > > 
> > > Data deduplication is a specialized data compression technique for
> > > eliminating duplicate copies of repeating data.[1]
> > > 
> > > This patch set is also related to "Content based storage" in project
> > > ideas[2], it introduces inband data deduplication for btrfs and
> > > dedup/dedupe is for short.
> > > 
> > > * PATCH 1 is a speed-up improvement, which is about dedup and quota.
> > > 
> > > * PATCH 2-5 is the preparation work for dedup implementation.
> > > 
> > > * PATCH 6 shows how we implement dedup feature.
> > > 
> > > * PATCH 7 fixes a backref walking bug with dedup.
> > > 
> > > * PATCH 8 fixes a free space bug of dedup extents on error handling.
> > > 
> > > * PATCH 9 adds the ioctl to control dedup feature.
> > > 
> > > * PATCH 10 targets delayed refs' scalability problem of deleting refs,
> > > which is> 
> > >   uncovered by the dedup feature.
> > > 
> > > * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> > > 
> > >   transaction abortion and crash.
> > > 
> > > * btrfs-progs patch(PATCH 17) offers all details about how to control the
> > > 
> > >   dedup feature on progs side.
> > > 
> > > I've tested this with xfstests by adding a inline dedup 'enable & on' in
> > > xfstests' mount and scratch_mount.
> > > 
> > > 
> > > ***NOTE***
> > > Known bugs:
> > > * Mounting with options "flushoncommit" and enabling dedupe feature will
> > > end up> 
> > >   with _deadlock_.
> > > 
> > > TODO:
> > > * a bit-to-bit comparison callback.
> > > 
> > > All comments are welcome!
> > > 
> > > 
> > > [1]: http://en.wikipedia.org/wiki/Data_deduplication
> > > [2]:
> > > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_stora
> > > ge
> > > 
> > > v10:
> > > - fix a typo in the subject line.
> > > - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> > > 
> > >   'Inappropriate ioctl for device'.
> > > 
> > > v9:
> > > - fix a deadlock and a crash reported by users.
> > > - fix the metadata ENOSPC problem with dedup again.
> > > 
> > > v8:
> > > - fix the race crash of dedup ref again.
> > > - fix the metadata ENOSPC problem with dedup.
> > > 
> > > v7:
> > > - rebase onto the lastest btrfs
> > > - break a big patch into smaller ones to make reviewers happy.
> > > - kill mount options of dedup and use ioctl method instead.
> > > - fix two crash due to the special dedup ref
> > > 
> > > For former patch sets:
> > > v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> > > v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> > > v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> > > v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> > > v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
> > > 
> > > Liu Bo (16):
> > >   Btrfs: disable qgroups accounting when quota_enable is 0
> > >   Btrfs: introduce dedup tree and relatives
> > >   Btrfs: introduce dedup tree operations
> > >   Btrfs: introduce 

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Martin Steigerwald
Hi Liu,

Am Donnerstag, 10. April 2014, 23:55:21 schrieb Liu Bo:
> Hi,
> 
> Just FYI, these patches are also available on the following site,
> 
> kernel:
> https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux
> 
> progs:
> https://github.com/liubogithub/btrfs-progs.git dedup

I bet its good to only test it with test data so far or would you consider it 
safe enough to test on production data already?

Fortunately since I added that additional mSATA SSD I have some spare storage 
to put some test setup into.

Thanks,
Martin

> thanks,
> -liubo
> 
> On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> > Hello,
> > 
> > This the 10th attempt for in-band data dedupe, based on Linux _3.14_
> > kernel.
> > 
> > Data deduplication is a specialized data compression technique for
> > eliminating duplicate copies of repeating data.[1]
> > 
> > This patch set is also related to "Content based storage" in project
> > ideas[2], it introduces inband data deduplication for btrfs and
> > dedup/dedupe is for short.
> > 
> > * PATCH 1 is a speed-up improvement, which is about dedup and quota.
> > 
> > * PATCH 2-5 is the preparation work for dedup implementation.
> > 
> > * PATCH 6 shows how we implement dedup feature.
> > 
> > * PATCH 7 fixes a backref walking bug with dedup.
> > 
> > * PATCH 8 fixes a free space bug of dedup extents on error handling.
> > 
> > * PATCH 9 adds the ioctl to control dedup feature.
> > 
> > * PATCH 10 targets delayed refs' scalability problem of deleting refs,
> > which is> 
> >   uncovered by the dedup feature.
> > 
> > * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> > 
> >   transaction abortion and crash.
> > 
> > * btrfs-progs patch(PATCH 17) offers all details about how to control the
> > 
> >   dedup feature on progs side.
> > 
> > I've tested this with xfstests by adding a inline dedup 'enable & on' in
> > xfstests' mount and scratch_mount.
> > 
> > 
> > ***NOTE***
> > Known bugs:
> > * Mounting with options "flushoncommit" and enabling dedupe feature will
> > end up> 
> >   with _deadlock_.
> > 
> > TODO:
> > * a bit-to-bit comparison callback.
> > 
> > All comments are welcome!
> > 
> > 
> > [1]: http://en.wikipedia.org/wiki/Data_deduplication
> > [2]:
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_stora
> > ge
> > 
> > v10:
> > - fix a typo in the subject line.
> > - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> > 
> >   'Inappropriate ioctl for device'.
> > 
> > v9:
> > - fix a deadlock and a crash reported by users.
> > - fix the metadata ENOSPC problem with dedup again.
> > 
> > v8:
> > - fix the race crash of dedup ref again.
> > - fix the metadata ENOSPC problem with dedup.
> > 
> > v7:
> > - rebase onto the lastest btrfs
> > - break a big patch into smaller ones to make reviewers happy.
> > - kill mount options of dedup and use ioctl method instead.
> > - fix two crash due to the special dedup ref
> > 
> > For former patch sets:
> > v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> > v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> > v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> > v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> > v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
> > 
> > Liu Bo (16):
> >   Btrfs: disable qgroups accounting when quota_enable is 0
> >   Btrfs: introduce dedup tree and relatives
> >   Btrfs: introduce dedup tree operations
> >   Btrfs: introduce dedup state
> >   Btrfs: make ordered extent aware of dedup
> >   Btrfs: online(inband) data dedup
> >   Btrfs: skip dedup reference during backref walking
> >   Btrfs: don't return space for dedup extent
> >   Btrfs: add ioctl of dedup control
> >   Btrfs: improve the delayed refs process in rm case
> >   Btrfs: fix a crash of dedup ref
> >   Btrfs: fix deadlock of dedup work
> >   Btrfs: fix transactin abortion in __btrfs_free_extent
> >   Btrfs: fix wrong pinned bytes in __btrfs_free_extent
> >   Btrfs: use total_bytes instead of bytes_used for global_rsv
> >   Btrfs: fix dedup enospc problem
> >  
> >  fs/btrfs/backref.c   |   9 +
> >  fs/btrfs/ctree.c |   2 +-
>

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Liu Bo
Hi,

Just FYI, these patches are also available on the following site,

kernel:
https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux

progs:
https://github.com/liubogithub/btrfs-progs.git dedup

thanks,
-liubo

On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> Hello,
> 
> This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
> 
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
> 
> This patch set is also related to "Content based storage" in project ideas[2],
> it introduces inband data deduplication for btrfs and dedup/dedupe is for 
> short.
> 
> * PATCH 1 is a speed-up improvement, which is about dedup and quota.
> 
> * PATCH 2-5 is the preparation work for dedup implementation.
> 
> * PATCH 6 shows how we implement dedup feature.
> 
> * PATCH 7 fixes a backref walking bug with dedup.
> 
> * PATCH 8 fixes a free space bug of dedup extents on error handling.
> 
> * PATCH 9 adds the ioctl to control dedup feature.
> 
> * PATCH 10 targets delayed refs' scalability problem of deleting refs, which 
> is 
>   uncovered by the dedup feature.
> 
> * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
>   transaction abortion and crash.
> 
> * btrfs-progs patch(PATCH 17) offers all details about how to control the
>   dedup feature on progs side.
> 
> I've tested this with xfstests by adding a inline dedup 'enable & on' in 
> xfstests'
> mount and scratch_mount.
> 
> 
> ***NOTE***
> Known bugs:
> * Mounting with options "flushoncommit" and enabling dedupe feature will end 
> up
>   with _deadlock_.
> 
> 
> TODO:
> * a bit-to-bit comparison callback.
> 
> All comments are welcome!
> 
> 
> [1]: http://en.wikipedia.org/wiki/Data_deduplication
> [2]: 
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
> 
> v10:
> - fix a typo in the subject line.
> - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
>   'Inappropriate ioctl for device'.
> 
> v9:
> - fix a deadlock and a crash reported by users.
> - fix the metadata ENOSPC problem with dedup again.
> 
> v8:
> - fix the race crash of dedup ref again.
> - fix the metadata ENOSPC problem with dedup.
> 
> v7:
> - rebase onto the lastest btrfs
> - break a big patch into smaller ones to make reviewers happy.
> - kill mount options of dedup and use ioctl method instead.
> - fix two crash due to the special dedup ref
> 
> For former patch sets:
> v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
> 
> Liu Bo (16):
>   Btrfs: disable qgroups accounting when quota_enable is 0
>   Btrfs: introduce dedup tree and relatives
>   Btrfs: introduce dedup tree operations
>   Btrfs: introduce dedup state
>   Btrfs: make ordered extent aware of dedup
>   Btrfs: online(inband) data dedup
>   Btrfs: skip dedup reference during backref walking
>   Btrfs: don't return space for dedup extent
>   Btrfs: add ioctl of dedup control
>   Btrfs: improve the delayed refs process in rm case
>   Btrfs: fix a crash of dedup ref
>   Btrfs: fix deadlock of dedup work
>   Btrfs: fix transactin abortion in __btrfs_free_extent
>   Btrfs: fix wrong pinned bytes in __btrfs_free_extent
>   Btrfs: use total_bytes instead of bytes_used for global_rsv
>   Btrfs: fix dedup enospc problem
> 
>  fs/btrfs/backref.c   |   9 +
>  fs/btrfs/ctree.c |   2 +-
>  fs/btrfs/ctree.h |  86 ++
>  fs/btrfs/delayed-ref.c   |  26 +-
>  fs/btrfs/delayed-ref.h   |   3 +
>  fs/btrfs/disk-io.c   |  37 +++
>  fs/btrfs/extent-tree.c   | 235 +---
>  fs/btrfs/extent_io.c |  22 +-
>  fs/btrfs/extent_io.h |  16 ++
>  fs/btrfs/file-item.c | 244 +
>  fs/btrfs/inode.c | 635 
> ++-
>  fs/btrfs/ioctl.c | 167 
>  fs/btrfs/ordered-data.c  |  44 ++-
>  fs/btrfs/ordered-data.h  |  13 +-
>  fs/btrfs/qgroup.c|   3 +
>  fs/btrfs/relocation.c|   3 +
>  fs/btrfs/transaction.c   |  41 +++
>  fs/btrfs/transaction.h   |   1 +
>  include/trace/events/btrfs.h |   3 +-
>  include/uapi/linux/btrfs.h   |  12 +
>  20 files changed, 1471 insertions(+), 131 deletions(-)
> 
> -- 
> 1.8.2.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Liu Bo
On Thu, Apr 10, 2014 at 12:08:17PM +0300, Konstantinos Skarlatos wrote:
> On 10/4/2014 6:48 πμ, Liu Bo wrote:
> >Hello,
> >
> >This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
> >
> >Data deduplication is a specialized data compression technique for 
> >eliminating
> >duplicate copies of repeating data.[1]
> >
> >This patch set is also related to "Content based storage" in project 
> >ideas[2],
> >it introduces inband data deduplication for btrfs and dedup/dedupe is for 
> >short.
> >
> >* PATCH 1 is a speed-up improvement, which is about dedup and quota.
> >
> >* PATCH 2-5 is the preparation work for dedup implementation.
> >
> >* PATCH 6 shows how we implement dedup feature.
> >
> >* PATCH 7 fixes a backref walking bug with dedup.
> >
> >* PATCH 8 fixes a free space bug of dedup extents on error handling.
> >
> >* PATCH 9 adds the ioctl to control dedup feature.
> >
> >* PATCH 10 targets delayed refs' scalability problem of deleting refs, which 
> >is
> >   uncovered by the dedup feature.
> >
> >* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> >   transaction abortion and crash.
> >
> >* btrfs-progs patch(PATCH 17) offers all details about how to control the
> >   dedup feature on progs side.
> >
> >I've tested this with xfstests by adding a inline dedup 'enable & on' in 
> >xfstests'
> >mount and scratch_mount.
> >
> >
> >***NOTE***
> >Known bugs:
> >* Mounting with options "flushoncommit" and enabling dedupe feature will end 
> >up
> >   with _deadlock_.
> >
> >
> >TODO:
> >* a bit-to-bit comparison callback.
> >
> >All comments are welcome!
> Hi Liu,
> Thanks for doing this work.
> I tested your previous patches a few months ago, and will now test
> the new ones. One question about memory requirements, are they in
> the same league as ZFS dedup (ie needing 10's of gb of RAM for multi
> TB filesystems) or are they more reasonable?
> Thanks

Hi Konstantinos,

It depends on Linux native memory management which can reclaim memory when
lacking memory, but still, it'd lead to high memory pressure according to my
experiments.

Thanks for testing it!

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Konstantinos Skarlatos

On 10/4/2014 6:48 πμ, Liu Bo wrote:

Hello,

This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
   uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
   transaction abortion and crash.

* btrfs-progs patch(PATCH 17) offers all details about how to control the
   dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options "flushoncommit" and enabling dedupe feature will end up
   with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

Hi Liu,
Thanks for doing this work.
I tested your previous patches a few months ago, and will now test the 
new ones. One question about memory requirements, are they in the same 
league as ZFS dedup (ie needing 10's of gb of RAM for multi TB 
filesystems) or are they more reasonable?

Thanks



[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v10:
- fix a typo in the subject line.
- update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
   'Inappropriate ioctl for device'.

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
   Btrfs: disable qgroups accounting when quota_enable is 0
   Btrfs: introduce dedup tree and relatives
   Btrfs: introduce dedup tree operations
   Btrfs: introduce dedup state
   Btrfs: make ordered extent aware of dedup
   Btrfs: online(inband) data dedup
   Btrfs: skip dedup reference during backref walking
   Btrfs: don't return space for dedup extent
   Btrfs: add ioctl of dedup control
   Btrfs: improve the delayed refs process in rm case
   Btrfs: fix a crash of dedup ref
   Btrfs: fix deadlock of dedup work
   Btrfs: fix transactin abortion in __btrfs_free_extent
   Btrfs: fix wrong pinned bytes in __btrfs_free_extent
   Btrfs: use total_bytes instead of bytes_used for global_rsv
   Btrfs: fix dedup enospc problem

  fs/btrfs/backref.c   |   9 +
  fs/btrfs/ctree.c |   2 +-
  fs/btrfs/ctree.h |  86 ++
  fs/btrfs/delayed-ref.c   |  26 +-
  fs/btrfs/delayed-ref.h   |   3 +
  fs/btrfs/disk-io.c   |  37 +++
  fs/btrfs/extent-tree.c   | 235 +---
  fs/btrfs/extent_io.c |  22 +-
  fs/btrfs/extent_io.h |  16 ++
  fs/btrfs/file-item.c | 244 +
  fs/btrfs/inode.c | 635 ++-
  fs/btrfs/ioctl.c | 167 
  fs/btrfs/ordered-data.c  |  44 ++-
  fs/btrfs/ordered-data.h  |  13 +-
  fs/btrfs/qgroup.c|   3 +
  fs/btrfs/relocation.c|   3 +
  fs/btrfs/transaction.c   |  41 +++
  fs/btrfs/transaction.h   |   1 +
  include/trace/events/btrfs.h |   3 +-
  include/uapi/linux/btrfs.h   |  12 +
  20 files changed, 1471 insertions(+), 131 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello,

This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is 
  uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
  transaction abortion and crash.

* btrfs-progs patch(PATCH 17) offers all details about how to control the
  dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options "flushoncommit" and enabling dedupe feature will end up
  with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v10:
- fix a typo in the subject line.
- update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
  'Inappropriate ioctl for device'.

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
  Btrfs: disable qgroups accounting when quota_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: add ioctl of dedup control
  Btrfs: improve the delayed refs process in rm case
  Btrfs: fix a crash of dedup ref
  Btrfs: fix deadlock of dedup work
  Btrfs: fix transactin abortion in __btrfs_free_extent
  Btrfs: fix wrong pinned bytes in __btrfs_free_extent
  Btrfs: use total_bytes instead of bytes_used for global_rsv
  Btrfs: fix dedup enospc problem

 fs/btrfs/backref.c   |   9 +
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |  86 ++
 fs/btrfs/delayed-ref.c   |  26 +-
 fs/btrfs/delayed-ref.h   |   3 +
 fs/btrfs/disk-io.c   |  37 +++
 fs/btrfs/extent-tree.c   | 235 +---
 fs/btrfs/extent_io.c |  22 +-
 fs/btrfs/extent_io.h |  16 ++
 fs/btrfs/file-item.c | 244 +
 fs/btrfs/inode.c | 635 ++-
 fs/btrfs/ioctl.c | 167 
 fs/btrfs/ordered-data.c  |  44 ++-
 fs/btrfs/ordered-data.h  |  13 +-
 fs/btrfs/qgroup.c|   3 +
 fs/btrfs/relocation.c|   3 +
 fs/btrfs/transaction.c   |  41 +++
 fs/btrfs/transaction.h   |   1 +
 include/trace/events/btrfs.h |   3 +-
 include/uapi/linux/btrfs.h   |  12 +
 20 files changed, 1471 insertions(+), 131 deletions(-)

-- 
1.8.2.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v9 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello,

This the 9th attempt for in-band data dedupe.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is 
  uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
  transaction abortion and crash.

* btrfs-progs patch(PATCH 17) which offers all details about how to control the
  dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options "flushoncommit" and enabling dedupe feature will end up
  with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: add ioctl of dedup control
  Btrfs: improve the delayed refs process in rm case
  Btrfs: fix a crash of dedup ref
  Btrfs: fix deadlock of dedup work
  Btrfs: fix transactin abortion in __btrfs_free_extent
  Btrfs: fix wrong pinned bytes in __btrfs_free_extent
  Btrfs: use total_bytes instead of bytes_used for global_rsv
  Btrfs: fix dedup enospc problem

 fs/btrfs/backref.c   |   9 +
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |  86 ++
 fs/btrfs/delayed-ref.c   |  26 +-
 fs/btrfs/delayed-ref.h   |   3 +
 fs/btrfs/disk-io.c   |  37 +++
 fs/btrfs/extent-tree.c   | 235 +---
 fs/btrfs/extent_io.c |  22 +-
 fs/btrfs/extent_io.h |  16 ++
 fs/btrfs/file-item.c | 244 +
 fs/btrfs/inode.c | 635 ++-
 fs/btrfs/ioctl.c | 167 
 fs/btrfs/ordered-data.c  |  44 ++-
 fs/btrfs/ordered-data.h  |  13 +-
 fs/btrfs/qgroup.c|   3 +
 fs/btrfs/relocation.c|   3 +
 fs/btrfs/transaction.c   |  41 +++
 fs/btrfs/transaction.h   |   1 +
 include/trace/events/btrfs.h |   3 +-
 include/uapi/linux/btrfs.h   |  11 +
 20 files changed, 1470 insertions(+), 131 deletions(-)

-- 
1.8.2.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-26 Thread Liu Bo
Hi Jannis,

On Wed, Feb 26, 2014 at 08:20:01PM +, Jannis Achstetter wrote:
> Jannis Achstetter  web.de> writes: 
> > I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and
> > it seems that the deduplication works great (when copying the same or
> > similar data to the file system, the used size reported by df -h grows less
> > than the data that is copied to it on the second time).
> > However, there are disturbing messages in the kernel log:
> 
> Me again :)
> 
> Today (PC rebooted), there are other messages in the log with traces like:
> [  253.971188] BUG: scheduling while atomic: btrfs-transacti/5985/0x0003
> [  253.971194] Modules linked in: snd_aloop vboxnetflt(O) vboxnetadp(O)
> vboxdrv(O) microcode
> [  253.971208] CPU: 2 PID: 5985 Comm: btrfs-transacti Tainted: GW  O
> 3.13.2-gentoo #4
> [  253.971212] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS
> V1.12 10/06/2011
> [  253.971215]  88022cf3b8a8 81920b14 880237c92b40
> 8191c304
> [  253.971221]  81924bd5 88022df7d750 88022cf3bfd8
> 00012b40
> [  253.971227]  00012b40 88022df7d750 8800c2ff7aa8
> 0020
> [  253.971233] Call Trace:
> [  253.971243]  [] ? dump_stack+0x49/0x6a
> [  253.971251]  [] ? __schedule_bug+0x3e/0x4b
> [  253.971258]  [] ? __schedule+0x7f5/0x8e0
> [  253.971266]  [] ? submit_bio+0x60/0x130
> [  253.971273]  [] ? btrfs_map_bio+0x2d3/0x540
> [  253.971280]  [] ? ktime_get_ts+0x3d/0xd0
> [  253.971287]  [] ? delayacct_end+0x84/0xa0
> [  253.971293]  [] ? filemap_fdatawait+0x20/0x20
> [  253.971299]  [] ? io_schedule+0x83/0xd0
> [  253.971305]  [] ? sleep_on_page+0x5/0x10
> [  253.971312]  [] ? __wait_on_bit+0x54/0x80
> [  253.971319]  [] ? wait_on_page_bit+0x7f/0x90
> [  253.971321]  [] ? autoremove_wake_function+0x30/0x30
> [  253.971323]  [] ? read_extent_buffer_pages+0x2a2/0x2d0
> [  253.971325]  [] ? free_root_pointers+0x60/0x60
> [  253.971327]  [] ?
> btree_read_extent_buffer_pages.constprop.53+0xa9/0x110
> [  253.971330]  [] ? read_tree_block+0x4a/0x80
> [  253.971332]  [] ? 
> read_block_for_search.isra.32+0x177/0x3a0
> [  253.971334]  [] ? unlock_up+0x13a/0x160
> [  253.971336]  [] ? btrfs_search_slot+0x400/0x970
> [  253.971338]  [] ? btrfs_free_dedup_extent+0x7a/0x1c0
> [  253.971340]  [] ? 
> extent_data_ref_offset.isra.30+0x79/0x110
> [  253.971342]  [] ? __btrfs_free_extent+0xa1c/0xc70
> [  253.971344]  [] ? run_clustered_refs+0x47c/0x1110
> [  253.971347]  [] ? find_ref_head+0x5d/0x90
> [  253.971348]  [] ? btrfs_run_delayed_refs+0xc8/0x510
> [  253.971351]  [] ? btrfs_commit_transaction+0x55/0x990
> [  253.971353]  [] ? start_transaction+0x8a/0x560
> [  253.971355]  [] ? transaction_kthread+0x19d/0x230
> [  253.971357]  [] ? btrfs_cleanup_transaction+0x540/0x540
> [  253.971360]  [] ? kthread+0xc1/0xe0
> [  253.971362]  [] ? kthread_create_on_node+0x190/0x190
> [  253.971364]  [] ? ret_from_fork+0x7c/0xb0
> [  253.971366]  [] ? kthread_create_on_node+0x190/0x190
> 
> (More of them at http://bpaste.net/show/183075/ )

Yeah, I've also found this and fixed it locally :)

> 
> One more question: Do I have to run "btrfs dedup on -b 128k /mnt/steamdir"
> after every mount or is that info stored across mounts?

For now, yes, the default value is 8K, ie, we reset it on every mount.

Thanks for the report!

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-26 Thread Jannis Achstetter
Jannis Achstetter  web.de> writes: 
> I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and
> it seems that the deduplication works great (when copying the same or
> similar data to the file system, the used size reported by df -h grows less
> than the data that is copied to it on the second time).
> However, there are disturbing messages in the kernel log:

Me again :)

Today (PC rebooted), there are other messages in the log with traces like:
[  253.971188] BUG: scheduling while atomic: btrfs-transacti/5985/0x0003
[  253.971194] Modules linked in: snd_aloop vboxnetflt(O) vboxnetadp(O)
vboxdrv(O) microcode
[  253.971208] CPU: 2 PID: 5985 Comm: btrfs-transacti Tainted: GW  O
3.13.2-gentoo #4
[  253.971212] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS
V1.12 10/06/2011
[  253.971215]  88022cf3b8a8 81920b14 880237c92b40
8191c304
[  253.971221]  81924bd5 88022df7d750 88022cf3bfd8
00012b40
[  253.971227]  00012b40 88022df7d750 8800c2ff7aa8
0020
[  253.971233] Call Trace:
[  253.971243]  [] ? dump_stack+0x49/0x6a
[  253.971251]  [] ? __schedule_bug+0x3e/0x4b
[  253.971258]  [] ? __schedule+0x7f5/0x8e0
[  253.971266]  [] ? submit_bio+0x60/0x130
[  253.971273]  [] ? btrfs_map_bio+0x2d3/0x540
[  253.971280]  [] ? ktime_get_ts+0x3d/0xd0
[  253.971287]  [] ? delayacct_end+0x84/0xa0
[  253.971293]  [] ? filemap_fdatawait+0x20/0x20
[  253.971299]  [] ? io_schedule+0x83/0xd0
[  253.971305]  [] ? sleep_on_page+0x5/0x10
[  253.971312]  [] ? __wait_on_bit+0x54/0x80
[  253.971319]  [] ? wait_on_page_bit+0x7f/0x90
[  253.971321]  [] ? autoremove_wake_function+0x30/0x30
[  253.971323]  [] ? read_extent_buffer_pages+0x2a2/0x2d0
[  253.971325]  [] ? free_root_pointers+0x60/0x60
[  253.971327]  [] ?
btree_read_extent_buffer_pages.constprop.53+0xa9/0x110
[  253.971330]  [] ? read_tree_block+0x4a/0x80
[  253.971332]  [] ? read_block_for_search.isra.32+0x177/0x3a0
[  253.971334]  [] ? unlock_up+0x13a/0x160
[  253.971336]  [] ? btrfs_search_slot+0x400/0x970
[  253.971338]  [] ? btrfs_free_dedup_extent+0x7a/0x1c0
[  253.971340]  [] ? extent_data_ref_offset.isra.30+0x79/0x110
[  253.971342]  [] ? __btrfs_free_extent+0xa1c/0xc70
[  253.971344]  [] ? run_clustered_refs+0x47c/0x1110
[  253.971347]  [] ? find_ref_head+0x5d/0x90
[  253.971348]  [] ? btrfs_run_delayed_refs+0xc8/0x510
[  253.971351]  [] ? btrfs_commit_transaction+0x55/0x990
[  253.971353]  [] ? start_transaction+0x8a/0x560
[  253.971355]  [] ? transaction_kthread+0x19d/0x230
[  253.971357]  [] ? btrfs_cleanup_transaction+0x540/0x540
[  253.971360]  [] ? kthread+0xc1/0xe0
[  253.971362]  [] ? kthread_create_on_node+0x190/0x190
[  253.971364]  [] ? ret_from_fork+0x7c/0xb0
[  253.971366]  [] ? kthread_create_on_node+0x190/0x190

(More of them at http://bpaste.net/show/183075/ )

One more question: Do I have to run "btrfs dedup on -b 128k /mnt/steamdir"
after every mount or is that info stored across mounts?

Best regards,
 Jannis

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-25 Thread Jannis Achstetter
Jannis Achstetter  web.de> writes:

> 
> Hello Liu, hello list,
> 
> Liu Bo  oracle.com> writes:
> > Here is the New Year patch bomb 
> 

Some more info I forgot: I set the dedup block size to 128k but I forgot it
the first time:

> btrfs dedup enable /mnt/steamdir
> btrfs dedup on /mnt/steamdir
> btrfs dedup off /mnt/steamdir
> btrfs dedup on -b 128k /mnt/steamdir

The data written to the file system seems not to get corrupted. I copied a
>20GB file to the fs two times with different names and the sha1sum is
identical both time and matches the source file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-25 Thread Jannis Achstetter
Hello Liu, hello list,

Liu Bo  oracle.com> writes:
> Here is the New Year patch bomb 

I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and
it seems that the deduplication works great (when copying the same or
similar data to the file system, the used size reported by df -h grows less
than the data that is copied to it on the second time).
However, there are disturbing messages in the kernel log:
[46620.519741] btrfs-delalloc-: page allocation failure: order:5, mode:0x104050
[46620.519743] CPU: 1 PID: 25246 Comm: btrfs-delalloc- Tainted: G  
O 3.13.2-gentoo #4
[46620.519744] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS
V1.12 10/06/2011
[46620.519745]  880098881b38 81920b14 00104050
81145927
[46620.519747]   880237ffc800 
0005
[46620.519749]  0001 8191ce20 880237ffc808
0002
[46620.519750] Call Trace:
[46620.519752]  [] ? dump_stack+0x49/0x6a
[46620.519754]  [] ? warn_alloc_failed+0xe7/0x140
[46620.519757]  [] ? __alloc_pages_direct_compact+0xa2/0x1b0
[46620.519759]  [] ? __alloc_pages_nodemask+0x7f4/0x970
[46620.519761]  [] ? alloc_pages_current+0xaf/0x180
[46620.519763]  [] ? __get_free_pages+0x5/0x40
[46620.519765]  [] ? kmalloc_order_trace+0x3b/0xe0
[46620.519767]  [] ? run_delalloc_dedup+0x685/0xba0
[46620.519769]  [] ? run_locked_range.constprop.66+0xce/0xd0
[46620.519771]  [] ? detach_if_pending+0x100/0x100
[46620.519773]  [] ? async_cow_start+0x75/0xa0
[46620.519775]  [] ? worker_loop+0x149/0x550
[46620.519777]  [] ? btrfs_queue_worker+0x2f0/0x2f0
[46620.519779]  [] ? kthread+0xc1/0xe0
[46620.519781]  [] ? kthread_create_on_node+0x190/0x190
[46620.519783]  [] ? ret_from_fork+0x7c/0xb0
[46620.519785]  [] ? kthread_create_on_node+0x190/0x190

(many of them, as to be seen here: http://bpaste.net/show/182698/ )

I use btrfs with lzo compression:
/dev/sdc2 on /mnt/steamdir type btrfs (rw,compress=lzo)

Kernel config is here:
http://bpaste.net/show/182706/

If you need any more info, just let me know!

Thanks for your work and all the best,
 Jannis Achstetter

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-01-02 Thread Konstantinos Skarlatos
Sorry for the spam, i just mixed up the order of your patches. they now 
apply cleanly to 3.13 git.

Thanks

On 2/1/2014 4:32 μμ, Konstantinos Skarlatos wrote:
Hello, I am trying to test your patches and they do not apply to 
latest 3.12 source or 3.13 git. Am I doing something wrong?


---logs for 3.12---

Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1085 (offset 96 lines).
Hunk #2 succeeded at 1096 (offset 96 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file 
fs/btrfs/extent-tree.c.rej

patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 765 with fuzz 2 (offset -151 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file 
fs/btrfs/file-item.c.rej

patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej

---logs for 3.13---
Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1078 (offset 89 lines).
Hunk #2 succeeded at 1089 (offset 89 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file 
fs/btrfs/extent-tree.c.rej

patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 768 with fuzz 2 (offset -148 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file 
fs/btrfs/file-item.c.rej

patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej


On 30/12/2013 10:12 πμ, Liu Bo wrote:

Hello,

Here is the New Year patch bomb :-)

Data deduplication is a specialized data compression technique for 
eliminating

duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project 
ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is 
for short.


PATCH 1 is a hang fix with deduplication on, but it's also useful 
without

dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, 
which are

uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5-8 is the preparation work for dedup implementation.

PATCH 9 shows how we implement dedup feature.

PATCH 10 fixes a backref walking bug with dedup.

PATCH 11 fixes a free space bug of dedup extents on error handling.

PATCH 12 adds the ioctl to control dedup feature.

PATCH 13 fixes the metadata ENOSPC problem with dedup which has been 
there

WAY TOO LONG.

PATCH 14 fixes a race bug on dedup writes.

And there is also a btrfs-progs patch(PATCH 15) which offers all 
details about

how to control the dedup feature.

I've tested this with xfstests by adding a inline dedup 'enable & on' 
in xfstests'

mount and scratch_mount.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: 
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage


v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (14):
   Btrfs: skip merge part for delayed data refs
   Btrfs: improve the delayed refs process in rm case
   Btrfs: introduce a head ref rbtree
   Btrfs: disable qgroups accounting when quata_enable is 0
   Btrfs: introduce dedup tree and relatives
   Btrfs: introduce dedup tree operations
   Btrfs: introduce dedup state
   Btrfs: make ordered extent aware of dedup
   Btrfs: online(inband) data dedup
   Btrfs: skip dedup reference during backref walking
   Btrfs: don't return space for dedup extent
   Btrfs: add ioctl of dedup control
   Btrfs: fix dedupe 'ENOSPC' problem
   Btrfs: fix a crash of dedup ref

  fs/btrfs/backref.c   |   9 +
  fs/btrfs/ctree.c |   2 +-
  fs/btrf

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-01-02 Thread Konstantinos Skarlatos
Hello, I am trying to test your patches and they do not apply to latest 
3.12 source or 3.13 git. Am I doing something wrong?


---logs for 3.12---

Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1085 (offset 96 lines).
Hunk #2 succeeded at 1096 (offset 96 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/extent-tree.c.rej
patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 765 with fuzz 2 (offset -151 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file fs/btrfs/file-item.c.rej
patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej

---logs for 3.13---
Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1078 (offset 89 lines).
Hunk #2 succeeded at 1089 (offset 89 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/extent-tree.c.rej
patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 768 with fuzz 2 (offset -148 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file fs/btrfs/file-item.c.rej
patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej


On 30/12/2013 10:12 πμ, Liu Bo wrote:

Hello,

Here is the New Year patch bomb :-)

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5-8 is the preparation work for dedup implementation.

PATCH 9 shows how we implement dedup feature.

PATCH 10 fixes a backref walking bug with dedup.

PATCH 11 fixes a free space bug of dedup extents on error handling.

PATCH 12 adds the ioctl to control dedup feature.

PATCH 13 fixes the metadata ENOSPC problem with dedup which has been there
WAY TOO LONG.

PATCH 14 fixes a race bug on dedup writes.

And there is also a btrfs-progs patch(PATCH 15) which offers all details about
how to control the dedup feature.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (14):
   Btrfs: skip merge part for delayed data refs
   Btrfs: improve the delayed refs process in rm case
   Btrfs: introduce a head ref rbtree
   Btrfs: disable qgroups accounting when quata_enable is 0
   Btrfs: introduce dedup tree and relatives
   Btrfs: introduce dedup tree operations
   Btrfs: introduce dedup state
   Btrfs: make ordered extent aware of dedup
   Btrfs: online(inband) data dedup
   Btrfs: skip dedup reference during backref walking
   Btrfs: don't return space for dedup extent
   Btrfs: add ioctl of dedup control
   Btrfs: fix dedupe 'ENOSPC' problem
   Btrfs: fix a crash of dedup ref

  fs/btrfs/backref.c   |   9 +
  fs/btrfs/ctree.c |   2 +-
  fs/btrfs/ctree.h |  86 ++
  fs/btrfs/delayed-ref.c   | 161 +++
  fs/btrfs/delayed-ref.h   |   8 +
  fs/btrfs/disk-io.c   |  40 +++
  fs/btrfs/extent-tree.c 

[RFC PATCH v8 00/14] Online(inband) data deduplication

2013-12-30 Thread Liu Bo
Hello,

Here is the New Year patch bomb :-)

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5-8 is the preparation work for dedup implementation.

PATCH 9 shows how we implement dedup feature.

PATCH 10 fixes a backref walking bug with dedup.

PATCH 11 fixes a free space bug of dedup extents on error handling.

PATCH 12 adds the ioctl to control dedup feature.

PATCH 13 fixes the metadata ENOSPC problem with dedup which has been there
WAY TOO LONG.

PATCH 14 fixes a race bug on dedup writes.

And there is also a btrfs-progs patch(PATCH 15) which offers all details about
how to control the dedup feature.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (14):
  Btrfs: skip merge part for delayed data refs
  Btrfs: improve the delayed refs process in rm case
  Btrfs: introduce a head ref rbtree
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: add ioctl of dedup control
  Btrfs: fix dedupe 'ENOSPC' problem
  Btrfs: fix a crash of dedup ref

 fs/btrfs/backref.c   |   9 +
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |  86 ++
 fs/btrfs/delayed-ref.c   | 161 +++
 fs/btrfs/delayed-ref.h   |   8 +
 fs/btrfs/disk-io.c   |  40 +++
 fs/btrfs/extent-tree.c   | 208 --
 fs/btrfs/extent_io.c |  22 +-
 fs/btrfs/extent_io.h |  16 ++
 fs/btrfs/file-item.c | 244 +
 fs/btrfs/inode.c | 635 ++-
 fs/btrfs/ioctl.c | 167 
 fs/btrfs/ordered-data.c  |  38 ++-
 fs/btrfs/ordered-data.h  |  13 +-
 fs/btrfs/qgroup.c|   3 +
 fs/btrfs/relocation.c|   3 +
 fs/btrfs/transaction.c   |   4 +-
 include/trace/events/btrfs.h |   3 +-
 include/uapi/linux/btrfs.h   |  11 +
 19 files changed, 1501 insertions(+), 172 deletions(-)

-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Liu Bo
(Cced: David)

On Wed, Oct 23, 2013 at 10:26:17AM +0800, Liu Bo wrote:
> On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote:
> > Hi,
> > 
> > On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote:
> > > Data deduplication is a specialized data compression technique for 
> > > eliminating
> > > duplicate copies of repeating data.[1]
> > > 
> > > This patch set is also related to "Content based storage" in project 
> > > ideas[2].
> > > 
> > > PATCH 1 is a hang fix with deduplication on, but it's also useful without
> > > dedup in practice use.
> > > 
> > > PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
> > > uncovered by the dedup feature.
> > > 
> > > PATCH 4 is a speed-up improvement, which is about dedup and quota.
> > > 
> > > PATCH 5-8 is the preparation for dedup implementation.
> > > 
> > > PATCH 9 shows how we implement dedup feature.
> > > 
> > > PATCH 10 fixes a backref walking bug with dedup.
> > > 
> > > PATCH 11 fixes a free space bug of dedup extents on error handling.
> > > 
> > > PATCH 12 fixes a race bug on dedup writes.
> > > 
> > > PATCH 13 adds the ioctl to control dedup feature.
> > > 
> > > And there is also a btrfs-progs patch(PATCH 14) which involves all 
> > > details of
> > > how to control dedup feature.
> > > 
> > > I've tested this with xfstests by adding a inline dedup 'enable & on' in 
> > > xfstests'
> > > mount and scratch_mount.
> > > 
> > > TODO:
> > > * a bit-to-bit comparison callback.
> > > 
> > > All comments are welcome!
> > > 
> > 
> > Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6
> > and it worked correctly, although I haven't used it on production
> > servers given the bit-to-bit comparison callback isn't implemented yet.
> 
> Many thanks for testing this!
> 
> It's not yet proper for production server use until we solve the
> metadata reservation problems(I'm working on it right now).
> 
> > 
> > I have a few comments on the ioctl to control the dedup feature.
> > Basically it is used to enable the deduplication, to switch it on or off
> > and to select the blocksize. Couldn't it be implemented as a mount
> > option instead like for the other btrfs features? The dedup tree would
> > be created the first time the mount option is activated, and the on/off
> > would be controlled by the presence of the dedup mount flag. The
> > blocksize could be specified with the value appended to the dedup
> > option, for example dedup=8192.
> 
> In the previous version patch set, actually I chose to use mount options
> to provide a flexible control of dedup, but as thread[1] shows, David
> thinked that mount options is too heavy to use as it cannot be removed
> once it's merged.
> 
> [1]: http://www.spinics.net/lists/linux-btrfs/msg27294.html
> 
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Liu Bo
On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote:
> Hi,
> 
> On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote:
> > Data deduplication is a specialized data compression technique for 
> > eliminating
> > duplicate copies of repeating data.[1]
> > 
> > This patch set is also related to "Content based storage" in project 
> > ideas[2].
> > 
> > PATCH 1 is a hang fix with deduplication on, but it's also useful without
> > dedup in practice use.
> > 
> > PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
> > uncovered by the dedup feature.
> > 
> > PATCH 4 is a speed-up improvement, which is about dedup and quota.
> > 
> > PATCH 5-8 is the preparation for dedup implementation.
> > 
> > PATCH 9 shows how we implement dedup feature.
> > 
> > PATCH 10 fixes a backref walking bug with dedup.
> > 
> > PATCH 11 fixes a free space bug of dedup extents on error handling.
> > 
> > PATCH 12 fixes a race bug on dedup writes.
> > 
> > PATCH 13 adds the ioctl to control dedup feature.
> > 
> > And there is also a btrfs-progs patch(PATCH 14) which involves all details 
> > of
> > how to control dedup feature.
> > 
> > I've tested this with xfstests by adding a inline dedup 'enable & on' in 
> > xfstests'
> > mount and scratch_mount.
> > 
> > TODO:
> > * a bit-to-bit comparison callback.
> > 
> > All comments are welcome!
> > 
> 
> Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6
> and it worked correctly, although I haven't used it on production
> servers given the bit-to-bit comparison callback isn't implemented yet.

Many thanks for testing this!

It's not yet proper for production server use until we solve the
metadata reservation problems(I'm working on it right now).

> 
> I have a few comments on the ioctl to control the dedup feature.
> Basically it is used to enable the deduplication, to switch it on or off
> and to select the blocksize. Couldn't it be implemented as a mount
> option instead like for the other btrfs features? The dedup tree would
> be created the first time the mount option is activated, and the on/off
> would be controlled by the presence of the dedup mount flag. The
> blocksize could be specified with the value appended to the dedup
> option, for example dedup=8192.

In the previous version patch set, actually I chose to use mount options
to provide a flexible control of dedup, but as thread[1] shows, David
thinked that mount options is too heavy to use as it cannot be removed
once it's merged.

[1]: http://www.spinics.net/lists/linux-btrfs/msg27294.html

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Aurelien Jarno
Hi,

On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote:
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
> 
> This patch set is also related to "Content based storage" in project ideas[2].
> 
> PATCH 1 is a hang fix with deduplication on, but it's also useful without
> dedup in practice use.
> 
> PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
> uncovered by the dedup feature.
> 
> PATCH 4 is a speed-up improvement, which is about dedup and quota.
> 
> PATCH 5-8 is the preparation for dedup implementation.
> 
> PATCH 9 shows how we implement dedup feature.
> 
> PATCH 10 fixes a backref walking bug with dedup.
> 
> PATCH 11 fixes a free space bug of dedup extents on error handling.
> 
> PATCH 12 fixes a race bug on dedup writes.
> 
> PATCH 13 adds the ioctl to control dedup feature.
> 
> And there is also a btrfs-progs patch(PATCH 14) which involves all details of
> how to control dedup feature.
> 
> I've tested this with xfstests by adding a inline dedup 'enable & on' in 
> xfstests'
> mount and scratch_mount.
> 
> TODO:
> * a bit-to-bit comparison callback.
> 
> All comments are welcome!
> 

Thanks for this new patchset. I have tested it on top of kernel 3.12-rc6
and it worked correctly, although I haven't used it on production
servers given the bit-to-bit comparison callback isn't implemented yet.

I have a few comments on the ioctl to control the dedup feature.
Basically it is used to enable the deduplication, to switch it on or off
and to select the blocksize. Couldn't it be implemented as a mount
option instead like for the other btrfs features? The dedup tree would
be created the first time the mount option is activated, and the on/off
would be controlled by the presence of the dedup mount flag. The
blocksize could be specified with the value appended to the dedup
option, for example dedup=8192.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-13 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5-8 is the preparation for dedup implementation.

PATCH 9 shows how we implement dedup feature.

PATCH 10 fixes a backref walking bug with dedup.

PATCH 11 fixes a free space bug of dedup extents on error handling.

PATCH 12 fixes a race bug on dedup writes.

PATCH 13 adds the ioctl to control dedup feature.

And there is also a btrfs-progs patch(PATCH 14) which involves all details of
how to control dedup feature.

I've tested this with xfstests by adding a inline dedup 'enable & on' in 
xfstests'
mount and scratch_mount.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959



Liu Bo (13):
  Btrfs: skip merge part for delayed data refs
  Btrfs: improve the delayed refs process in rm case
  Btrfs: introduce a head ref rbtree
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: fix a crash of dedup ref
  Btrfs: add ioctl of dedup control

 fs/btrfs/backref.c   |9 +
 fs/btrfs/ctree.c |2 +-
 fs/btrfs/ctree.h |   85 ++
 fs/btrfs/delayed-ref.c   |  159 +++
 fs/btrfs/delayed-ref.h   |8 +
 fs/btrfs/disk-io.c   |   45 +++
 fs/btrfs/extent-tree.c   |  190 +++--
 fs/btrfs/extent_io.c |   22 ++-
 fs/btrfs/extent_io.h |   15 +
 fs/btrfs/file-item.c |  230 +++
 fs/btrfs/inode.c |  641 +-
 fs/btrfs/ioctl.c |  167 +++
 fs/btrfs/ordered-data.c  |   38 ++-
 fs/btrfs/ordered-data.h  |   13 +-
 fs/btrfs/qgroup.c|3 +
 fs/btrfs/relocation.c|3 +
 fs/btrfs/transaction.c   |4 +-
 include/trace/events/btrfs.h |3 +-
 include/uapi/linux/btrfs.h   |   11 +
 19 files changed, 1478 insertions(+), 170 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication

2013-09-08 Thread Liu Bo
Hi Dave,

On Mon, Sep 02, 2013 at 06:19:42PM +0200, David Sterba wrote:
> I wanted to only comment on the ioctl and interface to userspace bits,
> but found more things to comment in the kernel code.

Sorry for the late reply(I'm out for vacation these days).

> 
> On Thu, Aug 08, 2013 at 04:35:45PM +0800, Liu Bo wrote:
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
> >  /* for storing balance parameters in the root tree */
> >  #define BTRFS_BALANCE_OBJECTID -4ULL
> >  
> > +/* dedup tree(experimental) */
> > +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
> 
> 9 now conflicts with the uuid tree, when you update the patch, please
> put the dedup tree definition right next after the uuid tree.
> 
> I've noticed that the lockdep annotation is also missing from disk-io.c)
> and the symbolic tree name in include/trace/events/btrfs.h as well.

Good point, I just saw that uuid tree has been merged into -next.

> 
> > +enum btrfs_dedup_type {
> > +   BTRFS_DEDUP_SHA256 = 0,
> > +   BTRFS_DEDUP_LAST = 1,
> > +};
> > +
> > +static int btrfs_dedup_lens[] = { 4, 0 };
> > +static int btrfs_dedup_sizes[] = { 32, 0 };/* 256bit, 32bytes */
> 
> This is a bit confusing, what's the difference between 'len' and 'size'
> here.

Hmm, I admit that it's ugly, as we want to store the last 64bit of
256bit hash value as key.offset, an array of u64[4] is used and
the last 64bit will be array[3], any better ideas will be appreciated.

> 
> > +struct btrfs_dedup_item {
> > +   __le64 len; /* disk length of dedup range */
> > +   u8 type;
> > +   u8 compression;
> > +   u8 encryption;
> > +   __le16 other_encoding; /* spare for later use */
> > +
> > +   /* hash/fingerprints go here */
> > +} __attribute__ ((__packed__));
> > +
> > +struct btrfs_dedup_hash {
> > +   /* hash algorithm */
> > +   int type;
> > +   u64 bytenr;
> > +   u64 num_bytes;
> > +   int compression;
> 
> Layout of this structure is a bit suboptimal, there is a 4 byte hole
> after type and compression, and you can use u8 types for both, but as
> this would leave another 6b hole after, using ints is ok, so the final
> layout should preferrably look like
> 
> {
>   u64 bytenr;
>   u64 num_bytes;
>   int type;
>   int compression;
> 
>   u64 hash[];
> };

Looks good, will update it.

> 
> > +
> > +   /* last field is a variable length array of dedup hash */
> > +   u64 hash[];
> > +};
> > +
> > @@ -1967,6 +2027,7 @@ struct btrfs_ioctl_defrag_range_args {
> >  #define BTRFS_MOUNT_CHECK_INTEGRITY(1 << 20)
> >  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
> >  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1 << 22)
> > +#define BTRFS_MOUNT_DEDUP  (1 << 24)
> >  
> >  #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt)
> >  #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2184,6 +2194,8 @@ int open_ctree(struct super_block *sb,
> > atomic64_set(&fs_info->tree_mod_seq, 0);
> > fs_info->sb = sb;
> > fs_info->max_inline = 8192 * 1024;
> > +   fs_info->dedup_bs = 8192;
> > +   fs_info->dedup_type = BTRFS_DEDUP_SHA256;
> > fs_info->metadata_ratio = 0;
> > fs_info->defrag_inodes = RB_ROOT;
> > fs_info->free_chunk_space = 0;
> > @@ -2282,6 +2294,7 @@ int open_ctree(struct super_block *sb,
> > mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount);
> > mutex_init(&fs_info->dev_replace.lock_management_lock);
> > mutex_init(&fs_info->dev_replace.lock);
> > +   mutex_init(&fs_info->dedup_ioctl_mutex);
> >  
> > spin_lock_init(&fs_info->qgroup_lock);
> > mutex_init(&fs_info->qgroup_ioctl_lock);
> > @@ -2457,6 +2470,14 @@ int open_ctree(struct super_block *sb,
> > goto fail_alloc;
> > }
> >  
> > +   fs_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
> 
> Isn't this supposed to use fs_info->dedup_type somehow? This is harcoded
> to sha256 and it's the only one available now, but when the defines and
> variables are there, please do use them

Okay.

> 
> > +   if (IS_ERR(fs_info->dedup_driver)) {
> > +   pr_info("Cannot load sha256 driver\n");
> > +   err = PTR_ERR(fs_info->dedup_driver);
> > +   fs_info->dedup_driver = NULL;
> > +   goto fail_alloc;
> > +   }
> > +
> > btrfs_init_workers(&fs_info->generic_worker,
> >"genwork", 1, NULL);
> >  
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -1102,8 +1102,16 @@ static noinline int lookup_extent_data_ref(struct 
> > btrfs_trans_handle *trans,
> > key.offset = parent;
> > } else {
> > key.type = BTRFS_EXTENT_DATA_REF_KEY;
> > -   key.offset = hash_extent_data_ref(root_objectid,
> > - owner, offset);
> > +
> > +   /*
> > +* we've not got the right offset and owne

Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication

2013-09-02 Thread David Sterba
I wanted to only comment on the ioctl and interface to userspace bits,
but found more things to comment in the kernel code.

On Thu, Aug 08, 2013 at 04:35:45PM +0800, Liu Bo wrote:
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
>  /* for storing balance parameters in the root tree */
>  #define BTRFS_BALANCE_OBJECTID -4ULL
>  
> +/* dedup tree(experimental) */
> +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL

9 now conflicts with the uuid tree, when you update the patch, please
put the dedup tree definition right next after the uuid tree.

I've noticed that the lockdep annotation is also missing from disk-io.c)
and the symbolic tree name in include/trace/events/btrfs.h as well.

> +enum btrfs_dedup_type {
> + BTRFS_DEDUP_SHA256 = 0,
> + BTRFS_DEDUP_LAST = 1,
> +};
> +
> +static int btrfs_dedup_lens[] = { 4, 0 };
> +static int btrfs_dedup_sizes[] = { 32, 0 };  /* 256bit, 32bytes */

This is a bit confusing, what's the difference between 'len' and 'size'
here.

> +struct btrfs_dedup_item {
> + __le64 len; /* disk length of dedup range */
> + u8 type;
> + u8 compression;
> + u8 encryption;
> + __le16 other_encoding; /* spare for later use */
> +
> + /* hash/fingerprints go here */
> +} __attribute__ ((__packed__));
> +
> +struct btrfs_dedup_hash {
> + /* hash algorithm */
> + int type;
> + u64 bytenr;
> + u64 num_bytes;
> + int compression;

Layout of this structure is a bit suboptimal, there is a 4 byte hole
after type and compression, and you can use u8 types for both, but as
this would leave another 6b hole after, using ints is ok, so the final
layout should preferrably look like

{
u64 bytenr;
u64 num_bytes;
int type;
int compression;

u64 hash[];
};

> +
> + /* last field is a variable length array of dedup hash */
> + u64 hash[];
> +};
> +
> @@ -1967,6 +2027,7 @@ struct btrfs_ioctl_defrag_range_args {
>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
> +#define BTRFS_MOUNT_DEDUP(1 << 24)
>  
>  #define btrfs_clear_opt(o, opt)  ((o) &= ~BTRFS_MOUNT_##opt)
>  #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt)
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2184,6 +2194,8 @@ int open_ctree(struct super_block *sb,
>   atomic64_set(&fs_info->tree_mod_seq, 0);
>   fs_info->sb = sb;
>   fs_info->max_inline = 8192 * 1024;
> + fs_info->dedup_bs = 8192;
> + fs_info->dedup_type = BTRFS_DEDUP_SHA256;
>   fs_info->metadata_ratio = 0;
>   fs_info->defrag_inodes = RB_ROOT;
>   fs_info->free_chunk_space = 0;
> @@ -2282,6 +2294,7 @@ int open_ctree(struct super_block *sb,
>   mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount);
>   mutex_init(&fs_info->dev_replace.lock_management_lock);
>   mutex_init(&fs_info->dev_replace.lock);
> + mutex_init(&fs_info->dedup_ioctl_mutex);
>  
>   spin_lock_init(&fs_info->qgroup_lock);
>   mutex_init(&fs_info->qgroup_ioctl_lock);
> @@ -2457,6 +2470,14 @@ int open_ctree(struct super_block *sb,
>   goto fail_alloc;
>   }
>  
> + fs_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);

Isn't this supposed to use fs_info->dedup_type somehow? This is harcoded
to sha256 and it's the only one available now, but when the defines and
variables are there, please do use them

> + if (IS_ERR(fs_info->dedup_driver)) {
> + pr_info("Cannot load sha256 driver\n");
> + err = PTR_ERR(fs_info->dedup_driver);
> + fs_info->dedup_driver = NULL;
> + goto fail_alloc;
> + }
> +
>   btrfs_init_workers(&fs_info->generic_worker,
>  "genwork", 1, NULL);
>  
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1102,8 +1102,16 @@ static noinline int lookup_extent_data_ref(struct 
> btrfs_trans_handle *trans,
>   key.offset = parent;
>   } else {
>   key.type = BTRFS_EXTENT_DATA_REF_KEY;
> - key.offset = hash_extent_data_ref(root_objectid,
> -   owner, offset);
> +
> + /*
> +  * we've not got the right offset and owner, so search by -1
> +  * here.
> +  */
> + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
> + key.offset = -1;

offset is u64

> + else
> + key.offset = hash_extent_data_ref(root_objectid,
> +   owner, offset);
>   }
>  again:
>   recow = 0;
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode,
> + struct page *locked_page, u64 start, u64 e

[RFC PATCH v6 0/5] Online data deduplication

2013-08-08 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5 is full of real things, all details about implementation of dedup.

Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
feature.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v5->v6:
- remove BUG_ON()s and use proper error handling.
- make dedup hash endian safe on disk.
- refractor dedup tree item.
- fix a bug of deleting file extents with dedup disabled.
- some cleanups
- add manpage for dedup subcommand.

v4->v5:
- go back to one dedup key with a special backref for dedup tree because
  the disk format understands backref well.
- fix a fsync hang with dedup enabled.
- rebase onto the latest btrfs.


Liu Bo (5):
  Btrfs: skip merge part for delayed data refs
  Btrfs: improve the delayed refs process in rm case
  Btrfs: introduce a head ref rbtree
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: online data deduplication

 fs/btrfs/backref.c |9 +
 fs/btrfs/ctree.c   |2 +-
 fs/btrfs/ctree.h   |   82 ++
 fs/btrfs/delayed-ref.c |  159 +++
 fs/btrfs/delayed-ref.h |8 +
 fs/btrfs/disk-io.c |   31 ++
 fs/btrfs/extent-tree.c |  190 +++--
 fs/btrfs/extent_io.c   |   29 ++-
 fs/btrfs/extent_io.h   |   16 +
 fs/btrfs/file-item.c   |  211 ++
 fs/btrfs/inode.c   |  673 +++-
 fs/btrfs/ioctl.c   |   93 ++
 fs/btrfs/ordered-data.c|   38 ++-
 fs/btrfs/ordered-data.h|   13 +-
 fs/btrfs/qgroup.c  |3 +
 fs/btrfs/relocation.c  |3 +
 fs/btrfs/super.c   |   27 ++-
 fs/btrfs/transaction.c |4 +-
 include/uapi/linux/btrfs.h |5 +
 19 files changed, 1420 insertions(+), 176 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-08-01 Thread Zach Brown
> So do you mean that our whole hash value will be (key.objectid + bytes)
> because key.objectid is a part of hash value?

I think so, if I understood your question.  The idea is to not store the
bytes of the hash that make up the objectid more than once so the tree
items are smaller.

For example:

+   read_extent_buffer(leaf, hash_in_item, ((unsigned long)hash_item),
+  BTRFS_DEDUP_HASH_SIZE);

That'd be (_SIZE - sizeof(u64)) if the bytes of the hash that made up
the object id weren't also stored in the item payload.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 0/5] Online data deduplication

2013-08-01 Thread Liu Bo
On Wed, Jul 31, 2013 at 05:20:27PM -0400, Josef Bacik wrote:
> On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote:
> > Data deduplication is a specialized data compression technique for 
> > eliminating
> > duplicate copies of repeating data.[1]
> > 
> > This patch set is also related to "Content based storage" in project 
> > ideas[2].
> > 
> > PATCH 1 is a hang fix with deduplication on, but it's also useful without
> > dedup in practice use.
> > 
> > PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
> > uncovered by the dedup feature.
> > 
> > PATCH 4 is a speed-up improvement, which is about dedup and quota.
> > 
> > PATCH 5 is full of real things, all details about implementation of dedup.
> > 
> > Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
> > feature.
> > 
> > TODO:
> > * a bit-to-bit comparison callback.
> 
> Didn't pass my BUG_ON() search test, try again.

I'm cleaning them up :)

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-08-01 Thread Liu Bo
On Wed, Jul 31, 2013 at 03:50:50PM -0700, Zach Brown wrote:
> 
> > +#define BTRFS_DEDUP_HASH_SIZE 32   /* 256bit = 32 * 8bit */
> > +#define BTRFS_DEDUP_HASH_LEN 4
> > +
> > +struct btrfs_dedup_hash_item {
> > +   /* FIXME: put a hash type field here */
> > +
> > +   __le64 hash[BTRFS_DEDUP_HASH_LEN];
> > +} __attribute__ ((__packed__));
> 
> The handling of hashes in this patch is a mess.
> 
> The inconsistent use of _SIZE, _LEN, and literal 4 and the u64 *s being
> passed around is asking for mistakes to be made in the future.  And I
> don't think it's endian safe.

Yeah, you're right, I missed the endian part for hash.

> 
> I think I'd have a struct that represents the native representation of
> the tree item.  Something like:
> 
> struct btrfs_dedup_hash {
>   u64 key_value;
>   u8 algo;
>   u8 len;
>   u8 bytes[0];
> }
> 
> You can then have helpers that generate that from either the cryptolib
> transformation of dedup regions or to and from the tree items.  The
> bytes (and the tree item payload) wouldn't need to have the hash bytes
> that are stored up in the key. 

I agree on merging the two structs, btrfs_dedup_item and btrfs_dedup_hash_item,
into one.

So do you mean that our whole hash value will be (key.objectid + bytes)
because key.objectid is a part of hash value?

Thanks for the comments!

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: Online data deduplication

2013-07-31 Thread Hemanth Kumar
I tried to apply deduplication patch on to 3.11-rc2 kernel but i am
getting the following error. Can you pls thake a look at it and let me
know whats wrong.

linux-3805:/home/hemanth/linux-3.11-rc2 # patch -p1 <
/home/hemanth/RFC-V4-1-2-Btrfs-skip-merge-part-for-delayed-data-refs.patch
patching file fs/btrfs/delayed-ref.c
linux-3805:/home/hemanth/linux-3.11-rc2 # patch -p1 <
/home/hemanth/RFC-V4-2-2-Btrfs-online-data-deduplication.patch
patching file fs/btrfs/ctree.h
Hunk #2 succeeded at 95 with fuzz 2.
Hunk #7 succeeded at 1328 (offset 12 lines).
Hunk #8 succeeded at 1669 (offset 15 lines).
Hunk #9 succeeded at 1971 (offset 40 lines).
Hunk #10 succeeded at 2008 (offset 40 lines).
Hunk #11 succeeded at 2944 (offset 40 lines).
Hunk #12 succeeded at 3421 with fuzz 1 (offset 59 lines).
Hunk #13 succeeded at 3586 (offset 59 lines).
patching file fs/btrfs/disk-io.c
Hunk #1 succeeded at 1581 with fuzz 2 (offset 39 lines).
Hunk #2 FAILED at 1997.
Hunk #3 FAILED at 2010.
Hunk #4 succeeded at 2101 (offset 26 lines).
Hunk #5 FAILED at 2088.
Hunk #6 succeeded at 2189 with fuzz 1 (offset 20 lines).
Hunk #7 succeeded at 2288 (offset 19 lines).
Hunk #8 succeeded at 2464 (offset 20 lines).
Hunk #9 FAILED at 2669.
4 out of 9 hunks FAILED -- saving rejects to file fs/btrfs/disk-io.c.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 succeeded at 2219 (offset -3 lines).
Hunk #2 succeeded at 4653 with fuzz 2 (offset 88 lines).
Hunk #3 succeeded at 5806 (offset 189 lines).
patching file fs/btrfs/extent_io.c
Hunk #1 succeeded at 1259 (offset 43 lines).
Hunk #2 succeeded at 1718 (offset 43 lines).
Hunk #3 succeeded at 2429 (offset 21 lines).
Hunk #4 succeeded at 2634 with fuzz 2 (offset 47 lines).
Hunk #5 succeeded at 2674 (offset 47 lines).
Hunk #6 succeeded at 3736 (offset 46 lines).
patching file fs/btrfs/extent_io.h
Hunk #1 FAILED at 19.
Hunk #2 succeeded at 52 (offset 1 line).
Hunk #3 succeeded at 226 (offset 1 line).
Hunk #4 succeeded at 353 (offset 3 lines).
1 out of 4 hunks FAILED -- saving rejects to file fs/btrfs/extent_io.h.rej
patching file fs/btrfs/file-item.c
Hunk #1 succeeded at 858 (offset -34 lines).
patching file fs/btrfs/inode.c
Hunk #1 succeeded at 105 (offset 3 lines).
Hunk #2 succeeded at 296 (offset 3 lines).
Hunk #3 succeeded at 314 (offset 3 lines).
Hunk #4 succeeded at 369 (offset 3 lines).
Hunk #5 succeeded at 438 (offset 3 lines).
Hunk #6 succeeded at 575 (offset 3 lines).
Hunk #7 succeeded at 603 (offset 3 lines).
Hunk #8 succeeded at 671 (offset 3 lines).
Hunk #9 succeeded at 772 (offset 7 lines).
Hunk #10 succeeded at 797 (offset 7 lines).
Hunk #11 succeeded at 818 (offset 7 lines).
Hunk #12 succeeded at 1214 (offset 7 lines).
Hunk #13 succeeded at 1316 (offset 7 lines).
Hunk #14 succeeded at 1406 (offset 7 lines).
Hunk #15 succeeded at 1799 (offset 7 lines).
Hunk #16 succeeded at 1875 (offset 7 lines).
Hunk #17 succeeded at 1903 (offset 7 lines).
Hunk #18 succeeded at 1925 (offset 7 lines).
Hunk #19 succeeded at 2360 (offset 32 lines).
Hunk #20 succeeded at 2423 (offset 32 lines).
Hunk #21 succeeded at 3182 (offset 27 lines).
patching file fs/btrfs/ioctl.c
Hunk #1 succeeded at 4097 (offset 23 lines).
Hunk #2 succeeded at 4298 (offset 25 lines).
patching file fs/btrfs/ordered-data.c
Hunk #1 succeeded at 183 with fuzz 2 (offset 1 line).
Hunk #2 succeeded at 204 (offset 2 lines).
Hunk #3 succeeded at 264 (offset 10 lines).
Hunk #4 succeeded at 281 (offset 10 lines).
Hunk #5 succeeded at 526 (offset 10 lines).
patching file fs/btrfs/ordered-data.h
Hunk #1 succeeded at 102 (offset -12 lines).
Hunk #2 succeeded at 131 (offset -12 lines).
Hunk #3 succeeded at 167 (offset -15 lines).
patching file fs/btrfs/super.c
Hunk #1 succeeded at 320 (offset 2 lines).
Hunk #2 succeeded at 362 (offset 2 lines).
Hunk #3 succeeded at 629 (offset 2 lines).
Hunk #4 succeeded at 964 (offset -1 lines).
patching file include/uapi/linux/btrfs.h
Hunk #2 FAILED at 542.
1 out of 2 hunks FAILED -- saving rejects to file include/uapi/linux/btrfs.h.rej


On Tue, Jul 30, 2013 at 7:37 AM, Liu Bo  wrote:
> On Mon, Jul 29, 2013 at 09:05:42PM +0530, Hemanth Kumar wrote:
>> Hello,
>> I am willing to perform QA on online data deduplication. From where
>> can i download the patches?
>
> Hi Hemanth Kumar H C,
>
> I really appreciate this :)
>
> Right now I'm planning v5 version patch set, which will come out probably in
> this week(Only one bug is left on my hands)
>
> v5 version will have a big change on disk format, performance, btw.
>
> For v4 you can find it in the following url:
> https://patchwork.kernel.org/patch/2565601/
> https://patchwork.kernel.org/patch/2565591/
>
> thanks,
> -liubo



--
Thanks,
Hemanth Kumar H C
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-07-31 Thread Zach Brown

> +#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */
> +#define BTRFS_DEDUP_HASH_LEN 4
> +
> +struct btrfs_dedup_hash_item {
> + /* FIXME: put a hash type field here */
> +
> + __le64 hash[BTRFS_DEDUP_HASH_LEN];
> +} __attribute__ ((__packed__));

The handling of hashes in this patch is a mess.

The inconsistent use of _SIZE, _LEN, and literal 4 and the u64 *s being
passed around is asking for mistakes to be made in the future.  And I
don't think it's endian safe.

I think I'd have a struct that represents the native representation of
the tree item.  Something like:

struct btrfs_dedup_hash {
u64 key_value;
u8 algo;
u8 len;
u8 bytes[0];
}

You can then have helpers that generate that from either the cryptolib
transformation of dedup regions or to and from the tree items.  The
bytes (and the tree item payload) wouldn't need to have the hash bytes
that are stored up in the key. 

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5 0/5] Online data deduplication

2013-07-31 Thread Josef Bacik
On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote:
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
> 
> This patch set is also related to "Content based storage" in project ideas[2].
> 
> PATCH 1 is a hang fix with deduplication on, but it's also useful without
> dedup in practice use.
> 
> PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
> uncovered by the dedup feature.
> 
> PATCH 4 is a speed-up improvement, which is about dedup and quota.
> 
> PATCH 5 is full of real things, all details about implementation of dedup.
> 
> Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
> feature.
> 
> TODO:
> * a bit-to-bit comparison callback.

Didn't pass my BUG_ON() search test, try again.

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v5 0/5] Online data deduplication

2013-07-31 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5 is full of real things, all details about implementation of dedup.

Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
feature.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v4->v5:
- go back to one dedup key with a special backref for dedup tree because
  the disk format understands backref well.
- fix a fsync hang with dedup enabled.
- rebase onto the latest btrfs.


Liu Bo (5):
  Btrfs: skip merge part for delayed data refs
  Btrfs: improve the delayed refs process in rm case
  Btrfs: introduce a head ref rbtree
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: online data deduplication

 fs/btrfs/backref.c |9 +
 fs/btrfs/ctree.h   |   59 
 fs/btrfs/delayed-ref.c |  141 +++
 fs/btrfs/delayed-ref.h |8 +
 fs/btrfs/disk-io.c |   30 ++
 fs/btrfs/extent-tree.c |  196 --
 fs/btrfs/extent_io.c   |   29 ++-
 fs/btrfs/extent_io.h   |   16 ++
 fs/btrfs/file-item.c   |  217 +++
 fs/btrfs/inode.c   |  637 ++--
 fs/btrfs/ioctl.c   |   93 +++
 fs/btrfs/ordered-data.c|   36 ++-
 fs/btrfs/ordered-data.h|   11 +-
 fs/btrfs/qgroup.c  |6 +
 fs/btrfs/relocation.c  |3 +
 fs/btrfs/super.c   |   27 ++-
 fs/btrfs/transaction.c |4 +-
 include/uapi/linux/btrfs.h |5 +
 18 files changed, 1356 insertions(+), 171 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Online data deduplication

2013-07-29 Thread Liu Bo
On Mon, Jul 29, 2013 at 09:05:42PM +0530, Hemanth Kumar wrote:
> Hello,
> I am willing to perform QA on online data deduplication. From where
> can i download the patches?

Hi Hemanth Kumar H C,

I really appreciate this :)

Right now I'm planning v5 version patch set, which will come out probably in
this week(Only one bug is left on my hands)

v5 version will have a big change on disk format, performance, btw.

For v4 you can find it in the following url:
https://patchwork.kernel.org/patch/2565601/
https://patchwork.kernel.org/patch/2565591/

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Online data deduplication

2013-07-29 Thread Hemanth Kumar
Hello,
I am willing to perform QA on online data deduplication. From where
can i download the patches?

-- 
Thanks,
Hemanth Kumar H C
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V4 0/2] Online data deduplication

2013-05-14 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

PATCH 1 is a hang fix with deduplication on, but it's also useful with no
deduplication in practice use.

For more implementation details, please refer to PATCH 2.

Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
feature.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v4:
  * add INCOMPAT flag so that old kernel won't mount with a dedup btrfs.
  * elaborate error handling.
  * address a compress bug.
  * address an issue of dedup flag on extent state tree.
  * add new dedup ioctl interface.
v3:
  * add COMPRESS support
  * add a real ioctl to enable dedup feature
  * change the maximum allowed dedup blocksize to 128k because of compression
range limit
v2:
  * To avoid enlarging the file extent item's size, add another index key used
for freeing dedup extent.
  * Freeing dedup extent is now like how we delete checksum.
  * Add support for alternative deduplicatin blocksize larger than PAGESIZE.
  * Add a mount option to set deduplication blocksize.
  * Add support for those writes that are smaller than deduplication blocksize.

---
* HOW To turn deduplication on:

There are 2 steps you need to do before using it,
1) mount /dev/disk /mnt_of_your_btrfs -o dedup
   (or mount /dev/disk /mnt_of_your_btrfs -o dedup_bs=128K)
2) btrfs dedup register /mnt_of_your_btrfs
---
* HOW To turn deduplication off:

Just mount your btrfs without "-o dedup" or "-o dedup_bs=xxxK"
---
* HOW To disable deduplication completely:

There are 2 steps you need to do before using it,
1) mount your btrfs WITHOUT "-o dedup" or "-o dedup_bs=xxxK"

2) btrfs dedup unregister /mnt_fs_your_btrfs
(NOTE: 'unregister' won't work unless you do step 1 FIRSTLY.)
---

Liu Bo (2):
  Btrfs: skip merge part for delayed data refs
  Btrfs: online data deduplication

 fs/btrfs/ctree.h   |   63 +
 fs/btrfs/delayed-ref.c |7 +
 fs/btrfs/disk-io.c |   34 +++-
 fs/btrfs/extent-tree.c |9 +
 fs/btrfs/extent_io.c   |   29 ++-
 fs/btrfs/extent_io.h   |   16 ++
 fs/btrfs/file-item.c   |  274 +++
 fs/btrfs/inode.c   |  630 ++--
 fs/btrfs/ioctl.c   |   93 +++
 fs/btrfs/ordered-data.c|   34 ++-
 fs/btrfs/ordered-data.h|   11 +-
 fs/btrfs/super.c   |   27 ++-
 include/uapi/linux/btrfs.h |5 +
 13 files changed, 1141 insertions(+), 91 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 0/2] Online data deduplication

2013-05-03 Thread Liu Bo
> You didn't use an INCOPMAT option for this so you need to deal with a user
> mounting the file system with an older kernel or even forgetting to use mount 
> -o
> dedup.  Otherwise your dedup tree will become out of date and you could 
> corrupt
> peoples data.  So if you aren't going to use an INCOMPAT flag you need to at
> least use a COMPAT flag so we know the option has been used at all and then 
> you
> need to have a mechanism to know if you need to invalidate the hash tree.
> 
> Users are also going to make the mistake of thinking dedup will make their
> workload awesome, and when it doesn't they need a way to turn it off.  If you 
> do
> an INCOMPAT option then you need to have a way to delete the hash tree and 
> unset
> the INCOMPAT flag.  If you do the COMPAT route then you get this for free 
> since
> the user just needs to stop using -o dedup, but you'll probably also want to
> provide a mechanism to delete the tree to free up space.  Thanks,
> 
> Josef

I made a few mistakes on this, yeah I should also provide a dedup disable way
and I'm going to use INCOMPAT.

But forgetting to use mount -o dedup will not get dedup tree to be out of date,
because dedup tree is loaded if we have it, no matter whether using 'mount -o
dedup'.

Thanks for the nice reminder, Josef :)

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Gabriel de Perthuis
>  #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>   struct btrfs_ioctl_dev_replace_args)
> +#define BTRFS_IOC_DEDUP_REGISTER _IO(BTRFS_IOCTL_MAGIC, 54)

This number has already been used by the offline dedup patches.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 0/2] Online data deduplication

2013-05-01 Thread Josef Bacik
On Wed, May 01, 2013 at 10:27:36AM -0600, Liu Bo wrote:
> NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!
> 
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
> 
> This patch set is also related to "Content based storage" in project ideas[2].
> 
> PATCH 1 is a hang fix when deduplication is on, but it's also useful with no
> deduplication in practice use.
> 
> For more implementation details, please refer to PATCH 2.
> 
> TODO:
> * a bit-to-bit comparison callback.
> 
> All comments are welcome!
> 
> [1]: http://en.wikipedia.org/wiki/Data_deduplication
> [2]: 
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
> 
> 
> v3:
>   * add COMPRESS support
>   * add a real ioctl to enable dedup feature
>   * change the maximum allowed dedup blocksize to 128k because of compressed
> range limit
> v2:
>   * To avoid enlarging the file extent item's size, add another index key used
> for freeing dedup extent.
>   * Freeing dedup extent is now like how we delete checksum.
>   * Add support for alternative deduplicatin blocksize larger than PAGESIZE.
>   * Add a mount option to set deduplication blocksize.
>   * Add support for those writes that are smaller than deduplication 
> blocksize.
> 
> =
> HOW To turn deduplication on:
> 
> There are 2 steps you need to do before using it,
> 1) mount /dev/disk /mnt_of_your_btrfs -o dedup
>(or mount /dev/disk /mnt_of_your_btrfs -o dedup_bs=128K)
> 2) btrfs filesystem dedup-register /mnt_of_your_btrfs
> =
> 

You didn't use an INCOPMAT option for this so you need to deal with a user
mounting the file system with an older kernel or even forgetting to use mount -o
dedup.  Otherwise your dedup tree will become out of date and you could corrupt
peoples data.  So if you aren't going to use an INCOMPAT flag you need to at
least use a COMPAT flag so we know the option has been used at all and then you
need to have a mechanism to know if you need to invalidate the hash tree.

Users are also going to make the mistake of thinking dedup will make their
workload awesome, and when it doesn't they need a way to turn it off.  If you do
an INCOMPAT option then you need to have a way to delete the hash tree and unset
the INCOMPAT flag.  If you do the COMPAT route then you get this for free since
the user just needs to stop using -o dedup, but you'll probably also want to
provide a mechanism to delete the tree to free up space.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)

This introduce the online data deduplication feature for btrfs.

(1) WHY do we need deduplication?
To improve our storage effiency.

(2) WHAT is deduplication?
Two key ways for practical deduplication implementations,
*  When the data is deduplicated
   (inband vs background)
*  The granularity of the deduplication.
   (block level vs file level)

For btrfs, we choose
*  inband(synchronous)
*  block level

We choose them because of the same reason as how zfs does.
a)  To get an immediate benefit.
b)  To remove redundant parts within a file.

So we have an inband, block level data deduplication here.

(3) HOW does deduplication works?
This makes full use of file extent back reference, the same way as
IOCTL_CLONE, which lets us easily store multiple copies of a set of
data as a single copy along with an index of references to the copy.

Here we have
a)  a new dedicated tree(DEDUP tree) and
b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
(stop 64bits of hash, type, disk offset),
*  stop 64bits of hash
   We take the stop 64bits of the sha256 as the index.
   Also we store the whole 256bits of sha256 in its item, which is
   very helpful on avoiding collision.
*  disk offset
   It helps to find where the data is stored.
c)  a new key(BTRFS_DEDUP_INDEX_KEY), which is
(disk offset, type, stop 64bits of hash),
It's mainly used when we're going to free a range of space occupied by
an file extent.

So the whole deduplication process works as,
1) write something,
2) calculate the hash of this "something" based on the deduplication unit,
3) try to find the match of hash value by searching DEDUP keys in
   a dedicated tree, DEDUP tree.
4) if found, skip real IO and link to the existing copy
   if not, do real IO and insert a DEDUP key to the DEDUP tree.

Right now, we can
a) enable deduplication with mount option "-o dedup",
b) control the deduplication unit with mount options '-o dedup_bs=xxx'.

The dedault deduplication unit is 8192, and the maximum allowed dedup
blocksize is 128k because of compressed range limit.

Signed-off-by: Liu Bo 
---
v3:
* add compress support
* add a real ioctl to enable dedup feature
* change the maximum allowed dedup blocksize to 128k because of compressed range
  limit

v2:
* Make changelog more clearly.
* To avoid enlarging the file extent item's size, add another index key used for
  freeing dedup extent.
* Freeing dedup extent is now like how we delete checksum.
* Add support for alternative deduplicatin blocksize larger than PAGESIZE.
* Add a mount option to set deduplication blocksize.
* Add support for those writes that are smaller than deduplication blocksize.
* Cleanups

 fs/btrfs/ctree.h   |   54 
 fs/btrfs/disk-io.c |   34 +++-
 fs/btrfs/extent-tree.c |7 +
 fs/btrfs/extent_io.c   |   27 ++-
 fs/btrfs/extent_io.h   |   15 ++
 fs/btrfs/file-item.c   |  242 ++
 fs/btrfs/inode.c   |  583 ++--
 fs/btrfs/ioctl.c   |   38 +++
 fs/btrfs/ordered-data.c|   30 ++-
 fs/btrfs/ordered-data.h|   11 +-
 fs/btrfs/super.c   |   27 ++-
 include/uapi/linux/btrfs.h |1 +
 12 files changed, 983 insertions(+), 86 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0d82922..c28538a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
 /* holds quota configuration and tracking */
 #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
 
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
+
 /* orhpan objectid for tracking unlinked/truncated files */
 #define BTRFS_ORPHAN_OBJECTID -5ULL
 
@@ -121,6 +125,9 @@ struct btrfs_ordered_sum;
  */
 #define BTRFS_FREE_INO_OBJECTID -12ULL
 
+/* dedup keys(with only disk bytenr as index) all have this objectid */
+#define BTRFS_DEDUP_OBJECTID -13ULL
+
 /* dummy objectid represents multiple objectids */
 #define BTRFS_MULTIPLE_OBJECTIDS -255ULL
 
@@ -890,6 +897,22 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/* dedup */
+struct btrfs_dedup_item {
+   __le64 len; /* disk length of dedup range */
+
+   u8 compression;
+   u8 encryption;
+   __le16 other_encoding; /* spare for later use */
+} __attribute__ ((__packed__));
+
+#define BTRFS_DEDUP_HASH_SIZE 32
+#define BTRFS_DEDUP_HASH_LEN 4
+
+struct btrfs_dedup_hash_item {
+   __le64 hash[BTRFS_DEDUP_HASH_LEN];
+} __attribute__ ((__packed__));
+
 struct btrfs_dev_stats_item {
   

[RFC PATCH v3 0/2] Online data deduplication

2013-05-01 Thread Liu Bo
NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

PATCH 1 is a hang fix when deduplication is on, but it's also useful with no
deduplication in practice use.

For more implementation details, please refer to PATCH 2.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage


v3:
  * add COMPRESS support
  * add a real ioctl to enable dedup feature
  * change the maximum allowed dedup blocksize to 128k because of compressed
range limit
v2:
  * To avoid enlarging the file extent item's size, add another index key used
for freeing dedup extent.
  * Freeing dedup extent is now like how we delete checksum.
  * Add support for alternative deduplicatin blocksize larger than PAGESIZE.
  * Add a mount option to set deduplication blocksize.
  * Add support for those writes that are smaller than deduplication blocksize.

=
HOW To turn deduplication on:

There are 2 steps you need to do before using it,
1) mount /dev/disk /mnt_of_your_btrfs -o dedup
   (or mount /dev/disk /mnt_of_your_btrfs -o dedup_bs=128K)
2) btrfs filesystem dedup-register /mnt_of_your_btrfs
=

Liu Bo (2):
  Btrfs: skip merge part for delayed data refs
  Btrfs: online data deduplication

 fs/btrfs/ctree.h   |   54 
 fs/btrfs/delayed-ref.c |7 +
 fs/btrfs/disk-io.c |   34 +++-
 fs/btrfs/extent-tree.c |7 +
 fs/btrfs/extent_io.c   |   27 ++-
 fs/btrfs/extent_io.h   |   15 ++
 fs/btrfs/file-item.c   |  242 ++
 fs/btrfs/inode.c   |  583 ++--
 fs/btrfs/ioctl.c   |   38 +++
 fs/btrfs/ordered-data.c|   30 ++-
 fs/btrfs/ordered-data.h|   11 +-
 fs/btrfs/super.c   |   27 ++-
 include/uapi/linux/btrfs.h |1 +
 13 files changed, 990 insertions(+), 86 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] Btrfs: online data deduplication

2013-04-14 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)

This introduce the online data deduplication feature for btrfs.

(1) WHY do we need deduplication?
To improve our storage effiency.

(2) WHAT is deduplication?
Two key ways for practical deduplication implementations,
*  When the data is deduplicated
   (inband vs background)
*  The granularity of the deduplication.
   (block level vs file level)

For btrfs, we choose
*  inband(synchronous)
*  block level

We choose them because of the same reason as how zfs does.
a)  To get an immediate benefit.
b)  To remove redundant parts within a file.

So we have an inband, block level data deduplication here.

(3) HOW does deduplication works?
This makes full use of file extent back reference, the same way as
IOCTL_CLONE, which lets us easily store multiple copies of a set of
data as a single copy along with an index of references to the copy.

Here we have
a)  a new dedicated tree(DEDUP tree) and
b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
(stop 64bits of hash, type, disk offset),
*  stop 64bits of hash
   We take the stop 64bits of the sha256 as the index.
   Also we store the whole 256bits of sha256 in its item, which is
   very helpful on avoiding collision.
*  disk offset
   It helps to find where the data is stored.
c)  a new key(BTRFS_DEDUP_INDEX_KEY), which is
(disk offset, type, stop 64bits of hash),
It's mainly used when we're going to free a range of space occupied by
an file extent.

So the whole deduplication process works as,
1) write something,
2) calculate the hash of this "something" based on the deduplication unit,
3) try to find the match of hash value by searching DEDUP keys in
   a dedicated tree, DEDUP tree.
4) if found, skip real IO and link to the existing copy
   if not, do real IO and insert a DEDUP key to the DEDUP tree.

Right now, we can
a) enable deduplication with mount option "-o dedup",
b) control the deduplication unit with mount options '-o 
dedup_blocksize=xxx'.

The dedault deduplication unit is 8192.

Signed-off-by: Liu Bo 
---
v2:
* Make changelog more clearly.
* To avoid enlarging the file extent item's size, add another index key used for
  freeing dedup extent.
* Freeing dedup extent is now like how we delete checksum.
* Add support for alternative deduplicatin blocksize larger than PAGESIZE.
* Add a mount option to set deduplication blocksize.
* Add support for those writes that are smaller than deduplication blocksize.
* Cleanups

 fs/btrfs/ctree.h|   45 ++
 fs/btrfs/disk-io.c  |   34 +-
 fs/btrfs/extent-tree.c  |7 +
 fs/btrfs/extent_io.c|8 +-
 fs/btrfs/extent_io.h|   11 ++
 fs/btrfs/extent_map.h   |1 +
 fs/btrfs/file-item.c|  231 ++
 fs/btrfs/inode.c|  363 +++
 fs/btrfs/ioctl.c|   34 +-
 fs/btrfs/ordered-data.c |   29 -
 fs/btrfs/ordered-data.h |9 ++
 fs/btrfs/super.c|   25 +++-
 12 files changed, 761 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0d82922..c13bcbb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
 /* holds quota configuration and tracking */
 #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
 
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
+
 /* orhpan objectid for tracking unlinked/truncated files */
 #define BTRFS_ORPHAN_OBJECTID -5ULL
 
@@ -121,6 +125,9 @@ struct btrfs_ordered_sum;
  */
 #define BTRFS_FREE_INO_OBJECTID -12ULL
 
+/* dedup keys(with only disk bytenr as index) all have this objectid */
+#define BTRFS_DEDUP_OBJECTID -13ULL
+
 /* dummy objectid represents multiple objectids */
 #define BTRFS_MULTIPLE_OBJECTIDS -255ULL
 
@@ -890,6 +897,18 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/* dedup */
+struct btrfs_dedup_item {
+   __le64 len;
+} __attribute__ ((__packed__));
+
+#define BTRFS_DEDUP_HASH_SIZE 32
+#define BTRFS_DEDUP_HASH_LEN 4
+
+struct btrfs_dedup_hash_item {
+   __le64 hash[BTRFS_DEDUP_HASH_LEN];
+} __attribute__ ((__packed__));
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -1287,6 +1306,7 @@ struct btrfs_fs_info {
struct btrfs_root *dev_root;
struct btrfs_root *fs_root;
struct btrfs_root *csum_root;
+   struct btrfs_root *dedup_root;
struct btrfs_root *quota_root;
 
/* the log root tree is a directory of all the othe

[PATCH v2 0/2] Online data deduplication

2013-04-14 Thread Liu Bo
This is the second attempt for online data deduplication.

NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

For more implementation details, please refer to PATCH 1.

PATCH 2 is a hang fix when deduplication is on.

TODO:
* a bit-to-bit comparison callback.
* a IOCTL for enabling deduplication.

All comments are welcome!

v2:
* To avoid enlarging the file extent item's size, add another index key used for
  freeing dedup extent.
* Freeing dedup extent is now like how we delete checksum.
* Add support for alternative deduplicatin blocksize larger than PAGESIZE.
* Add a mount option to set deduplication blocksize.
* Add support for those writes that are smaller than deduplication blocksize.

=
HOW To turn deduplication on:

There are 2 steps you need to do before using it,
1) mount /dev/disk /mnt_of_your_btrfs -o dedup
2) btrfs filesystem sync /mnt_of_your_btrfs
(For simplicity, I hack 'btrfs fi sync' to enable deduplication...)

Here is an example:
1) mkfs.btrfs /dev/sdb1
2) mount /dev/sdb1 /mnt/btrfs -o dedup
   (or mount /dev/sdb1 /mnt/btrfs -o dedup_blocksize=4k)
3) btrfs filesystem sync /mnt/btrfs
4) btrfs fi df /mnt/btrfs
   Data: total=8.00MB, used=256.00KB

5) dd if=/dev/zero of=/mnt/btrfs/foo bs=4K count=1; sync
6) dd if=/dev/zero of=/mnt/btrfs/foo bs=1M count=10; sync
7) btrfs fi df /mnt/btrfs
   Data: total=1.01GB, used=260.00KB

So 4K+10M has been written, but used=256.00KB -> used=260.00KB,
only 4KB is used!
=

Liu Bo (2):
  Btrfs: online data deduplication
  Btrfs: skip merge part for delayed data refs

 fs/btrfs/ctree.h|   45 ++
 fs/btrfs/delayed-ref.c  |7 +
 fs/btrfs/disk-io.c  |   34 +-
 fs/btrfs/extent-tree.c  |7 +
 fs/btrfs/extent_io.c|8 +-
 fs/btrfs/extent_io.h|   11 ++
 fs/btrfs/extent_map.h   |1 +
 fs/btrfs/file-item.c|  231 ++
 fs/btrfs/inode.c|  364 +++
 fs/btrfs/ioctl.c|   34 +-
 fs/btrfs/ordered-data.c |   29 -
 fs/btrfs/ordered-data.h |9 ++
 fs/btrfs/super.c|   25 +++-
 13 files changed, 769 insertions(+), 36 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread David Sterba
On Tue, Apr 09, 2013 at 09:52:42AM +0800, Miao Xie wrote:
> Onmon, 8 Apr 2013 15:47:27 +0200, David Sterba wrote:
> > This also depends on file data type and access patterns, fixing the dedup
> > basic chunk size to one block does not IMHO fit most usecases.
> 
> Maybe we can make btrfs(including dedup) support the bigalloc just like ext4.

dedup does not strictly need this, I agree it could make a few things
easier to implement. But I think it removes some flexibility once the
bigalloc cluster size is set, what if I want to change the dedup chunk
size because the structure of my data changed?

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread David Sterba
On Mon, Apr 08, 2013 at 10:08:54PM +0800, Liu Bo wrote:
> > Is it safe to use just 64 bits? I'd like to see better reasoning why
> > this is ok. The limitation of btrfs_key to store only 1-2 64bit items is
> > clear and must be handled, but it's IMO a critical design point.
> 
> Actually I use the whole 256bit hash value(stored in its item) to avoid hash
> collision, not just the stop 64bit.

Ok. It's good to point such things in the changelog/design docs, such
things may not be immediatelly clear from the code.

> > Last but not least, there was another dedup proposal (author CCed)
> > 
> > http://thread.gmane.org/gmane.comp.file-systems.btrfs/21722
> 
> I waited for 2 months and wanted to see the actual code from the proposal, 
> but I
> failed, so I decided to do it myself.

Quoting from the mail:

"My goal is to study Btrfs design, the offline deduplication patch [1]
and to come up with a design for the online dedup, this semester. I will
be implementing the feature next semester (spring, that is)."

So the implementation starts about now ...

> Anyway I'd like to see this feature in btrfs no matter who write it down.

We'll end up with 2 implementations that are likely doing different
tradeoffs and once one is merged the other one can be thrown away
wasting time and efforts.

Not my problem, but I don't uderstand why there's another implementation
when the dedup project has been informally claimed (via mailinglist).


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread Liu Bo
On Mon, Apr 08, 2013 at 09:48:16PM -0400, Josef Bacik wrote:
> On Mon, Apr 8, 2013 at 9:34 PM, Liu Bo  wrote:
> > On Mon, Apr 08, 2013 at 04:37:20PM -0400, Josef Bacik wrote:
> >> On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote:
> >> > On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
> >> > > On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
> > [...]
> >> > > +   __le64 dedup_hash;
> >> > > > +
> >> > > >  } __attribute__ ((__packed__));
> >> > > >
> >> > >
> >> > > Please don't do this, do like what we do with the crc tree, just 
> >> > > lookup based on
> >> > > the disk bytenr when we free the extent and drop refs that way.  Don't 
> >> > > further
> >> > > bloat the file extent item, we want it smaller not larger.  Thanks,
> >> > >
> >> > > Josef
> >> >
> >> > So the real trouble is that I take this hash value as the first field of
> >> > btrfs_key, and binary searching without the precise first field is not 
> >> > easy.
> >> >
> >> > Otherwise we may have to add another key type which replaces hash value 
> >> > with
> >> > disk bytenr, ie. (disk bytenr, ANOTHER_KEY_TYPE, hash value), then we'll 
> >> > need to
> >> > search the tree twice to free this one or drop refs.
> >> >
> >> > Either case is tradeoff, but as this is an initial version, we can try 
> >> > all of
> >> > these knobs and choose the better one :)
> >> >
> >>
> >> Why would you need to search twice?  You do something like
> >>
> >> key.objectid = bytenr;
> >> key.type = whatever your type is
> >> key.offset = first 64bits of your hash
> >>
> >> and then you search based on bytenr and you get your hash.  All extents 
> >> that
> >> share hashes will have the same bytenr so you can just search with
> >>
> >> key.objectid = bytenr
> >> key.type = whatever
> >> key.offset = (u64)-1
> >>
> >> and then walk backwards, or do 0 and walk forwards, whatever your 
> >> preference is.
> >> Also make it so the item stores the entire sha.  With Sha256 you are 
> >> wasting
> >> most of the hash result by just storing the first 64bits.  So just use the 
> >> first
> >> 64bits as the index and then store the full hash in the item so you can do 
> >> a
> >> proper compare, and then you can have different hash functions later with
> >> different length hash values.  Thanks,
> >
> > This is on freeing extent, what about finding dedup when we don't yet have a
> > disk bytenr but only a hash value from sha256 on file data?
> >
> > That's why (hash, type, disk bytenr) is necessary.
> >
> > Searching twice means that if we do what you've suggested, we'd not only 
> > update
> > or free (disk bytenr, type, hash), but also (hash, type, disk bytenr).
> >
> > Or am I missing something?
> >
> 
> No sorry I was putting my son to sleep thinking about this and I
> realized the problem you were having.  So I say do like we do with
> DIR_ITEMs, you have one type indexed by bytenr and one indexed by the
> stop 64bits of the hash value.  You just have an empty item for the
> index by bytenr and for the index by the hash value you have the
> offset of the key be the bytenr and then the item stores the length of
> the full hash value and the hash value.  You don't need ref counting
> since that's taken care of by the extent tree, you just free the
> corresponding dedup items when you free the extent.  It would be nice
> to do everything with just one key but I don't think you are going to
> be able to do it, unless you can figure out some way to stash the hash
> in the extent ref or something.  But growing the file extent item or
> the extent ref isn't an option since it affects anybody who doesn't
> use dedup, so I think the 2 key option is probably the best bet.
> Thanks,
> 
> Josef

Yeah, this makes sense.

All right, I've finished shipping it to two key style(from Josef) and getting
rid of the extra ref(from Josef and Miao).

But I still need to build a framework to enlarge the dedup blocksize rather than
a page size(from Dave).

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread Liu Bo
On Wed, Apr 10, 2013 at 02:05:32PM +0200, Marek Otahal wrote:
> Hello, 
> this is awesome news! thank you for working on dedup. 

Hi,

Your previous thread floating on the LIST did give me some light, thanks :)

> 
> I have some questions about the dedup approach in regards to other 
> layers/features. 
> 
> 1/ How will the snapshots be handled? 
> 
> Whether data would be dedup-ed between snapshots (potentially big saved-space 
> ratio), or would snapshots be considered isolated? Best, if this could be set 
> by the user. My concern is about being error-prone, where with deduping 
> snapshots, actually only 1 copy of the data would exist and a corruption 
> would damage it as well as all snapshots. Or is this not a problem and we say 
> "safety" is handled by RAID? 

I don't think that snapshot should be responsible on data security, even without
dedup, snapshot's files can still share the same disk space/extent with the
source.

So I won't treat dedup on snapshot as a special case.

> 
> 2/ Order of dedup/compression? 
> 
> What would be done first, compress a file and then compare blocks for 
> duplicates, or the other way around? 
> 
> Dedup 1st would save some compression work:
> file's block 00 -> hash -> isDup? (if no)-> compress (10x0) -> write
> but proble is written data size is unknown (it's not the 1 block at start)
> 
> Other way, compress first, would waste compression cpu-operations on 
> duplicate blocks, but would yield reduced dedup-related metadata usage, as 1 
> million of zeros would be compressed to a single block and that one only is 
> compared/written. Usefullness here depends on the compression ratio of the 
> file. 
> 
> I'm not sure which approach here would be better? 
> 

In my opinion, dedup is a special kind of compression, it's not worth doing
dedup on compressed data as I want to keep this feature to be simple.

I prefer to dedup first rather than compression(but I could be wrong).

thanks,
liubo

> 
> 
> 
> Thank you for your time and explanation. 
> Best wishes, Mark
>
> On Sunday 07 April 2013 21:12:48 Liu Bo wrote:
> > (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
> > 
> > This introduce the online data deduplication feature for btrfs.
> > 
> > (1) WHY do we need deduplication?
> > To improve our storage effiency.
> > 
> > (2) WHAT is deduplication?
> > Two key ways for practical deduplication implementations,
> > *  When the data is deduplicated
> >(inband vs background)
> > *  The granularity of the deduplication.
> >(block level vs file level)
> > 
> > For btrfs, we choose
> > *  inband(synchronous)
> > *  block level
> > 
> > We choose them because of the same reason as how zfs does.
> > a)  To get an immediate benefit.
> > b)  To remove redundant parts within a file.
> > 
> > So we have an inband, block level data deduplication here.
> > 
> > (3) HOW does deduplication works?
> > This makes full use of file extent back reference, the same way as
> > IOCTL_CLONE, which lets us easily store multiple copies of a set of
> > data as a single copy along with an index of references to the copy.
> > 
> > Here we have
> > a)  a new dedicated tree(DEDUP tree) and
> > b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> > (stop 64bits of hash, type, disk offset),
> > *  stop 64bits of hash
> >It comes from sha256, which is very helpful on avoiding 
> > collision.
> >And we take the stop 64bits as the index.
> > *  disk offset
> >It helps to find where the data is stored.
> > 
> > So the whole deduplication process works as,
> > 1) write something,
> > 2) calculate the hash of this "something",
> > 3) try to find the match of hash value by searching DEDUP keys in
> >a dedicated tree, DEDUP tree.
> > 4) if found, skip real IO and link to the existing copy
> >if not, do real IO and insert a DEDUP key to the DEDUP tree.
> > 
> > For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
> > going to increase this unit dynamically in the future.
> > 
> > Signed-off-by: Liu Bo 
> 
> -- 
> 
> Marek Otahal :o)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread Marek Otahal
Hello, 
this is awesome news! thank you for working on dedup. 

I have some questions about the dedup approach in regards to other 
layers/features. 

1/ How will the snapshots be handled? 

Whether data would be dedup-ed between snapshots (potentially big saved-space 
ratio), or would snapshots be considered isolated? Best, if this could be set 
by the user. My concern is about being error-prone, where with deduping 
snapshots, actually only 1 copy of the data would exist and a corruption would 
damage it as well as all snapshots. Or is this not a problem and we say 
"safety" is handled by RAID? 

2/ Order of dedup/compression? 

What would be done first, compress a file and then compare blocks for 
duplicates, or the other way around? 

Dedup 1st would save some compression work:
file's block 00 -> hash -> isDup? (if no)-> compress (10x0) -> write
but proble is written data size is unknown (it's not the 1 block at start)

Other way, compress first, would waste compression cpu-operations on duplicate 
blocks, but would yield reduced dedup-related metadata usage, as 1 million of 
zeros would be compressed to a single block and that one only is 
compared/written. Usefullness here depends on the compression ratio of the 
file. 

I'm not sure which approach here would be better? 




Thank you for your time and explanation. 
Best wishes, Mark
   
On Sunday 07 April 2013 21:12:48 Liu Bo wrote:
> (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
> 
> This introduce the online data deduplication feature for btrfs.
> 
> (1) WHY do we need deduplication?
> To improve our storage effiency.
> 
> (2) WHAT is deduplication?
> Two key ways for practical deduplication implementations,
> *  When the data is deduplicated
>(inband vs background)
> *  The granularity of the deduplication.
>(block level vs file level)
> 
> For btrfs, we choose
> *  inband(synchronous)
> *  block level
> 
> We choose them because of the same reason as how zfs does.
> a)  To get an immediate benefit.
> b)  To remove redundant parts within a file.
> 
> So we have an inband, block level data deduplication here.
> 
> (3) HOW does deduplication works?
> This makes full use of file extent back reference, the same way as
> IOCTL_CLONE, which lets us easily store multiple copies of a set of
> data as a single copy along with an index of references to the copy.
> 
> Here we have
> a)  a new dedicated tree(DEDUP tree) and
> b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> (stop 64bits of hash, type, disk offset),
> *  stop 64bits of hash
>It comes from sha256, which is very helpful on avoiding collision.
>And we take the stop 64bits as the index.
> *  disk offset
>It helps to find where the data is stored.
> 
> So the whole deduplication process works as,
> 1) write something,
> 2) calculate the hash of this "something",
> 3) try to find the match of hash value by searching DEDUP keys in
>a dedicated tree, DEDUP tree.
> 4) if found, skip real IO and link to the existing copy
>if not, do real IO and insert a DEDUP key to the DEDUP tree.
> 
> For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
> going to increase this unit dynamically in the future.
> 
> Signed-off-by: Liu Bo 

-- 

Marek Otahal :o)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Miao Xie
On  mon, 8 Apr 2013 15:47:27 +0200, David Sterba wrote:
> On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote:
>> (2) WHAT is deduplication?
>> Two key ways for practical deduplication implementations,
>> *  When the data is deduplicated
>>(inband vs background)
>> *  The granularity of the deduplication.
>>(block level vs file level)
>>
>> For btrfs, we choose
>> *  inband(synchronous)
>> *  block level
> 
> Block level may be too fine grained leading to excessive fragmentation
> and increased metadata usage given that there's a much higher chance to
> find duplicate (4k) blocks here and there.
> 
> There's always a tradeoff, the practical values that are considered for
> granularity range from 8k to 64, see eg. this paper for graphs and analyses
> 
> http://static.usenix.org/event/fast11/tech/full_papers/Meyer.pdf .
> 
> This also depends on file data type and access patterns, fixing the dedup
> basic chunk size to one block does not IMHO fit most usecases.

Maybe we can make btrfs(including dedup) support the bigalloc just like ext4.

Thanks
Miao

> 
>> (3) HOW does deduplication works?
> ...
>> Here we have
>> a)  a new dedicated tree(DEDUP tree) and
>> b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
>> (stop 64bits of hash, type, disk offset),
>> *  stop 64bits of hash
>>It comes from sha256, which is very helpful on avoiding collision.
>>And we take the stop 64bits as the index.
> 
> Is it safe to use just 64 bits? I'd like to see better reasoning why
> this is ok. The limitation of btrfs_key to store only 1-2 64bit items is
> clear and must be handled, but it's IMO a critical design point.
> 
>> *  disk offset
>>It helps to find where the data is stored.
> 
> Does the disk offset also help to resolving block hash collisions?
> 
>> So the whole deduplication process works as,
>> 1) write something,
>> 2) calculate the hash of this "something",
>> 3) try to find the match of hash value by searching DEDUP keys in
>>a dedicated tree, DEDUP tree.
>> 4) if found, skip real IO and link to the existing copy
>>if not, do real IO and insert a DEDUP key to the DEDUP tree.
> 
> ... how are the hash collisions handled? Using part of a secure has
> cannot be considered equally strong (given that there is not other
> safety checks like comparing the whole blocks).
> 
> Last but not least, there was another dedup proposal (author CCed)
> 
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/21722
> 
> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Mon, Apr 8, 2013 at 9:34 PM, Liu Bo  wrote:
> On Mon, Apr 08, 2013 at 04:37:20PM -0400, Josef Bacik wrote:
>> On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote:
>> > On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
>> > > On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
> [...]
>> > > +   __le64 dedup_hash;
>> > > > +
>> > > >  } __attribute__ ((__packed__));
>> > > >
>> > >
>> > > Please don't do this, do like what we do with the crc tree, just lookup 
>> > > based on
>> > > the disk bytenr when we free the extent and drop refs that way.  Don't 
>> > > further
>> > > bloat the file extent item, we want it smaller not larger.  Thanks,
>> > >
>> > > Josef
>> >
>> > So the real trouble is that I take this hash value as the first field of
>> > btrfs_key, and binary searching without the precise first field is not 
>> > easy.
>> >
>> > Otherwise we may have to add another key type which replaces hash value 
>> > with
>> > disk bytenr, ie. (disk bytenr, ANOTHER_KEY_TYPE, hash value), then we'll 
>> > need to
>> > search the tree twice to free this one or drop refs.
>> >
>> > Either case is tradeoff, but as this is an initial version, we can try all 
>> > of
>> > these knobs and choose the better one :)
>> >
>>
>> Why would you need to search twice?  You do something like
>>
>> key.objectid = bytenr;
>> key.type = whatever your type is
>> key.offset = first 64bits of your hash
>>
>> and then you search based on bytenr and you get your hash.  All extents that
>> share hashes will have the same bytenr so you can just search with
>>
>> key.objectid = bytenr
>> key.type = whatever
>> key.offset = (u64)-1
>>
>> and then walk backwards, or do 0 and walk forwards, whatever your preference 
>> is.
>> Also make it so the item stores the entire sha.  With Sha256 you are wasting
>> most of the hash result by just storing the first 64bits.  So just use the 
>> first
>> 64bits as the index and then store the full hash in the item so you can do a
>> proper compare, and then you can have different hash functions later with
>> different length hash values.  Thanks,
>
> This is on freeing extent, what about finding dedup when we don't yet have a
> disk bytenr but only a hash value from sha256 on file data?
>
> That's why (hash, type, disk bytenr) is necessary.
>
> Searching twice means that if we do what you've suggested, we'd not only 
> update
> or free (disk bytenr, type, hash), but also (hash, type, disk bytenr).
>
> Or am I missing something?
>

No sorry I was putting my son to sleep thinking about this and I
realized the problem you were having.  So I say do like we do with
DIR_ITEMs, you have one type indexed by bytenr and one indexed by the
stop 64bits of the hash value.  You just have an empty item for the
index by bytenr and for the index by the hash value you have the
offset of the key be the bytenr and then the item stores the length of
the full hash value and the hash value.  You don't need ref counting
since that's taken care of by the extent tree, you just free the
corresponding dedup items when you free the extent.  It would be nice
to do everything with just one key but I don't think you are going to
be able to do it, unless you can figure out some way to stash the hash
in the extent ref or something.  But growing the file extent item or
the extent ref isn't an option since it affects anybody who doesn't
use dedup, so I think the 2 key option is probably the best bet.
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Miao Xie
On  mon, 8 Apr 2013 22:16:26 +0800, Liu Bo wrote:
> On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
>> On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
>>> (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
>>>
>>> This introduce the online data deduplication feature for btrfs.
>>>
>>> (1) WHY do we need deduplication?
>>> To improve our storage effiency.
>>>
>>> (2) WHAT is deduplication?
>>> Two key ways for practical deduplication implementations,
>>> *  When the data is deduplicated
>>>(inband vs background)
>>> *  The granularity of the deduplication.
>>>(block level vs file level)
>>>
>>> For btrfs, we choose
>>> *  inband(synchronous)
>>> *  block level
>>>
>>> We choose them because of the same reason as how zfs does.
>>> a)  To get an immediate benefit.
>>> b)  To remove redundant parts within a file.
>>>
>>> So we have an inband, block level data deduplication here.
>>>
>>> (3) HOW does deduplication works?
>>> This makes full use of file extent back reference, the same way as
>>> IOCTL_CLONE, which lets us easily store multiple copies of a set of
>>> data as a single copy along with an index of references to the copy.
>>>
>>> Here we have
>>> a)  a new dedicated tree(DEDUP tree) and
>>> b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
>>> (stop 64bits of hash, type, disk offset),
>>> *  stop 64bits of hash
>>>It comes from sha256, which is very helpful on avoiding 
>>> collision.
>>>And we take the stop 64bits as the index.
>>> *  disk offset
>>>It helps to find where the data is stored.
>>>
>>> So the whole deduplication process works as,
>>> 1) write something,
>>> 2) calculate the hash of this "something",
>>> 3) try to find the match of hash value by searching DEDUP keys in
>>>a dedicated tree, DEDUP tree.
>>> 4) if found, skip real IO and link to the existing copy
>>>if not, do real IO and insert a DEDUP key to the DEDUP tree.
>>>
>>> For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
>>> going to increase this unit dynamically in the future.
>>>
>>> Signed-off-by: Liu Bo 
>>> ---
>>>  fs/btrfs/ctree.h|   53 
>>>  fs/btrfs/disk-io.c  |   33 +-
>>>  fs/btrfs/extent-tree.c  |   22 +++-
>>>  fs/btrfs/extent_io.c|8 +-
>>>  fs/btrfs/extent_io.h|   11 ++
>>>  fs/btrfs/file-item.c|  186 ++
>>>  fs/btrfs/file.c |6 +-
>>>  fs/btrfs/inode.c|  330 
>>> +++
>>>  fs/btrfs/ioctl.c|   34 +-
>>>  fs/btrfs/ordered-data.c |   25 +++-
>>>  fs/btrfs/ordered-data.h |9 ++
>>>  fs/btrfs/print-tree.c   |6 +-
>>>  fs/btrfs/super.c|7 +-
>>>  13 files changed, 685 insertions(+), 45 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0d82922..59339bc 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -32,6 +32,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "extent_io.h"
>>>  #include "extent_map.h"
>>>  #include "async-thread.h"
>>> @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
>>>  /* holds quota configuration and tracking */
>>>  #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
>>>
>>> +/* dedup tree(experimental) */
>>> +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
>>> +
>>>  /* orhpan objectid for tracking unlinked/truncated files */
>>>  #define BTRFS_ORPHAN_OBJECTID -5ULL
>>>
>>> @@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
>>>  */
>>> __le64 num_bytes;
>>>
>>> +   /*
>>> +* the stop 64bits of sha256 hash value, this helps us find the
>>> +* corresponding item in dedup tree.
>>> +*/
>>> +   __le64 dedup_hash;
>>> +
>>>  } __attribute__ ((__packed__));
>>>
>>
>> Please don't do this, do like what we do with the

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 04:37:20PM -0400, Josef Bacik wrote:
> On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote:
> > On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
> > > On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
[...]
> > > +   __le64 dedup_hash;
> > > > +
> > > >  } __attribute__ ((__packed__));
> > > > 
> > > 
> > > Please don't do this, do like what we do with the crc tree, just lookup 
> > > based on
> > > the disk bytenr when we free the extent and drop refs that way.  Don't 
> > > further
> > > bloat the file extent item, we want it smaller not larger.  Thanks,
> > > 
> > > Josef
> > 
> > So the real trouble is that I take this hash value as the first field of
> > btrfs_key, and binary searching without the precise first field is not easy.
> > 
> > Otherwise we may have to add another key type which replaces hash value with
> > disk bytenr, ie. (disk bytenr, ANOTHER_KEY_TYPE, hash value), then we'll 
> > need to
> > search the tree twice to free this one or drop refs.
> > 
> > Either case is tradeoff, but as this is an initial version, we can try all 
> > of
> > these knobs and choose the better one :)
> > 
> 
> Why would you need to search twice?  You do something like
> 
> key.objectid = bytenr;
> key.type = whatever your type is
> key.offset = first 64bits of your hash
> 
> and then you search based on bytenr and you get your hash.  All extents that
> share hashes will have the same bytenr so you can just search with
> 
> key.objectid = bytenr
> key.type = whatever
> key.offset = (u64)-1
> 
> and then walk backwards, or do 0 and walk forwards, whatever your preference 
> is.
> Also make it so the item stores the entire sha.  With Sha256 you are wasting
> most of the hash result by just storing the first 64bits.  So just use the 
> first
> 64bits as the index and then store the full hash in the item so you can do a
> proper compare, and then you can have different hash functions later with
> different length hash values.  Thanks,

This is on freeing extent, what about finding dedup when we don't yet have a
disk bytenr but only a hash value from sha256 on file data?

That's why (hash, type, disk bytenr) is necessary.

Searching twice means that if we do what you've suggested, we'd not only update
or free (disk bytenr, type, hash), but also (hash, type, disk bytenr).

Or am I missing something?

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote:
> On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
> > On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
> > > (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
> > > 
> > > This introduce the online data deduplication feature for btrfs.
> > > 
> > > (1) WHY do we need deduplication?
> > > To improve our storage effiency.
> > > 
> > > (2) WHAT is deduplication?
> > > Two key ways for practical deduplication implementations,
> > > *  When the data is deduplicated
> > >(inband vs background)
> > > *  The granularity of the deduplication.
> > >(block level vs file level)
> > > 
> > > For btrfs, we choose
> > > *  inband(synchronous)
> > > *  block level
> > > 
> > > We choose them because of the same reason as how zfs does.
> > > a)  To get an immediate benefit.
> > > b)  To remove redundant parts within a file.
> > > 
> > > So we have an inband, block level data deduplication here.
> > > 
> > > (3) HOW does deduplication works?
> > > This makes full use of file extent back reference, the same way as
> > > IOCTL_CLONE, which lets us easily store multiple copies of a set of
> > > data as a single copy along with an index of references to the copy.
> > > 
> > > Here we have
> > > a)  a new dedicated tree(DEDUP tree) and
> > > b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> > > (stop 64bits of hash, type, disk offset),
> > > *  stop 64bits of hash
> > >It comes from sha256, which is very helpful on avoiding 
> > > collision.
> > >And we take the stop 64bits as the index.
> > > *  disk offset
> > >It helps to find where the data is stored.
> > > 
> > > So the whole deduplication process works as,
> > > 1) write something,
> > > 2) calculate the hash of this "something",
> > > 3) try to find the match of hash value by searching DEDUP keys in
> > >a dedicated tree, DEDUP tree.
> > > 4) if found, skip real IO and link to the existing copy
> > >if not, do real IO and insert a DEDUP key to the DEDUP tree.
> > > 
> > > For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
> > > going to increase this unit dynamically in the future.
> > > 
> > > Signed-off-by: Liu Bo 
> > > ---
> > >  fs/btrfs/ctree.h|   53 
> > >  fs/btrfs/disk-io.c  |   33 +-
> > >  fs/btrfs/extent-tree.c  |   22 +++-
> > >  fs/btrfs/extent_io.c|8 +-
> > >  fs/btrfs/extent_io.h|   11 ++
> > >  fs/btrfs/file-item.c|  186 ++
> > >  fs/btrfs/file.c |6 +-
> > >  fs/btrfs/inode.c|  330 
> > > +++
> > >  fs/btrfs/ioctl.c|   34 +-
> > >  fs/btrfs/ordered-data.c |   25 +++-
> > >  fs/btrfs/ordered-data.h |9 ++
> > >  fs/btrfs/print-tree.c   |6 +-
> > >  fs/btrfs/super.c|7 +-
> > >  13 files changed, 685 insertions(+), 45 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 0d82922..59339bc 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -32,6 +32,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include "extent_io.h"
> > >  #include "extent_map.h"
> > >  #include "async-thread.h"
> > > @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
> > >  /* holds quota configuration and tracking */
> > >  #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
> > > 
> > > +/* dedup tree(experimental) */
> > > +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
> > > +
> > >  /* orhpan objectid for tracking unlinked/truncated files */
> > >  #define BTRFS_ORPHAN_OBJECTID -5ULL
> > > 
> > > @@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
> > >  */
> > > __le64 num_bytes;
> > > 
> > > +   /*
> > > +* the stop 64bits of sha256 hash value, this helps us find the
> > > +* corresponding item in dedu

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
> On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
> > (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
> > 
> > This introduce the online data deduplication feature for btrfs.
> > 
> > (1) WHY do we need deduplication?
> > To improve our storage effiency.
> > 
> > (2) WHAT is deduplication?
> > Two key ways for practical deduplication implementations,
> > *  When the data is deduplicated
> >(inband vs background)
> > *  The granularity of the deduplication.
> >(block level vs file level)
> > 
> > For btrfs, we choose
> > *  inband(synchronous)
> > *  block level
> > 
> > We choose them because of the same reason as how zfs does.
> > a)  To get an immediate benefit.
> > b)  To remove redundant parts within a file.
> > 
> > So we have an inband, block level data deduplication here.
> > 
> > (3) HOW does deduplication works?
> > This makes full use of file extent back reference, the same way as
> > IOCTL_CLONE, which lets us easily store multiple copies of a set of
> > data as a single copy along with an index of references to the copy.
> > 
> > Here we have
> > a)  a new dedicated tree(DEDUP tree) and
> > b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> > (stop 64bits of hash, type, disk offset),
> > *  stop 64bits of hash
> >It comes from sha256, which is very helpful on avoiding 
> > collision.
> >And we take the stop 64bits as the index.
> > *  disk offset
> >It helps to find where the data is stored.
> > 
> > So the whole deduplication process works as,
> > 1) write something,
> > 2) calculate the hash of this "something",
> > 3) try to find the match of hash value by searching DEDUP keys in
> >a dedicated tree, DEDUP tree.
> > 4) if found, skip real IO and link to the existing copy
> >if not, do real IO and insert a DEDUP key to the DEDUP tree.
> > 
> > For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
> > going to increase this unit dynamically in the future.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/ctree.h|   53 
> >  fs/btrfs/disk-io.c  |   33 +-
> >  fs/btrfs/extent-tree.c  |   22 +++-
> >  fs/btrfs/extent_io.c|8 +-
> >  fs/btrfs/extent_io.h|   11 ++
> >  fs/btrfs/file-item.c|  186 ++
> >  fs/btrfs/file.c |6 +-
> >  fs/btrfs/inode.c|  330 
> > +++
> >  fs/btrfs/ioctl.c|   34 +-
> >  fs/btrfs/ordered-data.c |   25 +++-
> >  fs/btrfs/ordered-data.h |9 ++
> >  fs/btrfs/print-tree.c   |6 +-
> >  fs/btrfs/super.c|7 +-
> >  13 files changed, 685 insertions(+), 45 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 0d82922..59339bc 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "extent_io.h"
> >  #include "extent_map.h"
> >  #include "async-thread.h"
> > @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
> >  /* holds quota configuration and tracking */
> >  #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
> > 
> > +/* dedup tree(experimental) */
> > +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
> > +
> >  /* orhpan objectid for tracking unlinked/truncated files */
> >  #define BTRFS_ORPHAN_OBJECTID -5ULL
> > 
> > @@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
> >  */
> > __le64 num_bytes;
> > 
> > +   /*
> > +* the stop 64bits of sha256 hash value, this helps us find the
> > +* corresponding item in dedup tree.
> > +*/
> > +   __le64 dedup_hash;
> > +
> >  } __attribute__ ((__packed__));
> > 
> 
> Please don't do this, do like what we do with the crc tree, just lookup based 
> on
> the disk bytenr when we free the extent and drop refs that way.  Don't further
> bloat the file extent item, we want it smaller not larger.  Thanks,
> 
> Josef

So the real trouble is that I take this hash value as the first field of
btrfs_key, and binary searching without the precise first field is not easy.

Otherwise we may have to add another key type which replaces hash value with
disk bytenr, ie. (disk bytenr, ANOTHER_KEY_TYPE, hash value), then we'll need to
search the tree twice to free this one or drop refs.

Either case is tradeoff, but as this is an initial version, we can try all of
these knobs and choose the better one :)

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 03:47:27PM +0200, David Sterba wrote:
> On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote:
> > (2) WHAT is deduplication?
> > Two key ways for practical deduplication implementations,
> > *  When the data is deduplicated
> >(inband vs background)
> > *  The granularity of the deduplication.
> >(block level vs file level)
> > 
> > For btrfs, we choose
> > *  inband(synchronous)
> > *  block level
> 
> Block level may be too fine grained leading to excessive fragmentation
> and increased metadata usage given that there's a much higher chance to
> find duplicate (4k) blocks here and there.
> 
> There's always a tradeoff, the practical values that are considered for
> granularity range from 8k to 64, see eg. this paper for graphs and analyses
> 
> http://static.usenix.org/event/fast11/tech/full_papers/Meyer.pdf .
> 
> This also depends on file data type and access patterns, fixing the dedup
> basic chunk size to one block does not IMHO fit most usecases.

Agreed, that'd definitely be a TODO.

I'd like to make this easier to control so that everyone can do their own
tradeoff depending on local cases.

> 
> > (3) HOW does deduplication works?
> ...
> > Here we have
> > a)  a new dedicated tree(DEDUP tree) and
> > b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> > (stop 64bits of hash, type, disk offset),
> > *  stop 64bits of hash
> >It comes from sha256, which is very helpful on avoiding 
> > collision.
> >And we take the stop 64bits as the index.
> 
> Is it safe to use just 64 bits? I'd like to see better reasoning why
> this is ok. The limitation of btrfs_key to store only 1-2 64bit items is
> clear and must be handled, but it's IMO a critical design point.

Actually I use the whole 256bit hash value(stored in its item) to avoid hash
collision, not just the stop 64bit.

Sorry for the bad commit log, seems I lost my mind somewhere at the time...


> 
> > *  disk offset
> >It helps to find where the data is stored.
> 
> Does the disk offset also help to resolving block hash collisions?

Ditto.

> 
> > So the whole deduplication process works as,
> > 1) write something,
> > 2) calculate the hash of this "something",
> > 3) try to find the match of hash value by searching DEDUP keys in
> >a dedicated tree, DEDUP tree.
> > 4) if found, skip real IO and link to the existing copy
> >if not, do real IO and insert a DEDUP key to the DEDUP tree.
> 
> ... how are the hash collisions handled? Using part of a secure has
> cannot be considered equally strong (given that there is not other
> safety checks like comparing the whole blocks).
> 
> Last but not least, there was another dedup proposal (author CCed)
> 
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/21722
> 
> 
> david

I waited for 2 months and wanted to see the actual code from the proposal, but I
failed, so I decided to do it myself.

Anyway I'd like to see this feature in btrfs no matter who write it down.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread David Sterba
On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote:
> (2) WHAT is deduplication?
> Two key ways for practical deduplication implementations,
> *  When the data is deduplicated
>(inband vs background)
> *  The granularity of the deduplication.
>(block level vs file level)
> 
> For btrfs, we choose
> *  inband(synchronous)
> *  block level

Block level may be too fine grained leading to excessive fragmentation
and increased metadata usage given that there's a much higher chance to
find duplicate (4k) blocks here and there.

There's always a tradeoff, the practical values that are considered for
granularity range from 8k to 64, see eg. this paper for graphs and analyses

http://static.usenix.org/event/fast11/tech/full_papers/Meyer.pdf .

This also depends on file data type and access patterns, fixing the dedup
basic chunk size to one block does not IMHO fit most usecases.

> (3) HOW does deduplication works?
...
> Here we have
> a)  a new dedicated tree(DEDUP tree) and
> b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> (stop 64bits of hash, type, disk offset),
> *  stop 64bits of hash
>It comes from sha256, which is very helpful on avoiding collision.
>And we take the stop 64bits as the index.

Is it safe to use just 64 bits? I'd like to see better reasoning why
this is ok. The limitation of btrfs_key to store only 1-2 64bit items is
clear and must be handled, but it's IMO a critical design point.

> *  disk offset
>It helps to find where the data is stored.

Does the disk offset also help to resolving block hash collisions?

> So the whole deduplication process works as,
> 1) write something,
> 2) calculate the hash of this "something",
> 3) try to find the match of hash value by searching DEDUP keys in
>a dedicated tree, DEDUP tree.
> 4) if found, skip real IO and link to the existing copy
>if not, do real IO and insert a DEDUP key to the DEDUP tree.

... how are the hash collisions handled? Using part of a secure has
cannot be considered equally strong (given that there is not other
safety checks like comparing the whole blocks).

Last but not least, there was another dedup proposal (author CCed)

http://thread.gmane.org/gmane.comp.file-systems.btrfs/21722


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
> (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
> 
> This introduce the online data deduplication feature for btrfs.
> 
> (1) WHY do we need deduplication?
> To improve our storage effiency.
> 
> (2) WHAT is deduplication?
> Two key ways for practical deduplication implementations,
> *  When the data is deduplicated
>(inband vs background)
> *  The granularity of the deduplication.
>(block level vs file level)
> 
> For btrfs, we choose
> *  inband(synchronous)
> *  block level
> 
> We choose them because of the same reason as how zfs does.
> a)  To get an immediate benefit.
> b)  To remove redundant parts within a file.
> 
> So we have an inband, block level data deduplication here.
> 
> (3) HOW does deduplication works?
> This makes full use of file extent back reference, the same way as
> IOCTL_CLONE, which lets us easily store multiple copies of a set of
> data as a single copy along with an index of references to the copy.
> 
> Here we have
> a)  a new dedicated tree(DEDUP tree) and
> b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
> (stop 64bits of hash, type, disk offset),
> *  stop 64bits of hash
>It comes from sha256, which is very helpful on avoiding collision.
>And we take the stop 64bits as the index.
> *  disk offset
>It helps to find where the data is stored.
> 
> So the whole deduplication process works as,
> 1) write something,
> 2) calculate the hash of this "something",
> 3) try to find the match of hash value by searching DEDUP keys in
>a dedicated tree, DEDUP tree.
> 4) if found, skip real IO and link to the existing copy
>if not, do real IO and insert a DEDUP key to the DEDUP tree.
> 
> For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
> going to increase this unit dynamically in the future.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ctree.h|   53 
>  fs/btrfs/disk-io.c  |   33 +-
>  fs/btrfs/extent-tree.c  |   22 +++-
>  fs/btrfs/extent_io.c|8 +-
>  fs/btrfs/extent_io.h|   11 ++
>  fs/btrfs/file-item.c|  186 ++
>  fs/btrfs/file.c |6 +-
>  fs/btrfs/inode.c|  330 
> +++
>  fs/btrfs/ioctl.c|   34 +-
>  fs/btrfs/ordered-data.c |   25 +++-
>  fs/btrfs/ordered-data.h |9 ++
>  fs/btrfs/print-tree.c   |6 +-
>  fs/btrfs/super.c|7 +-
>  13 files changed, 685 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0d82922..59339bc 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "extent_io.h"
>  #include "extent_map.h"
>  #include "async-thread.h"
> @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
>  /* holds quota configuration and tracking */
>  #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
> 
> +/* dedup tree(experimental) */
> +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
> +
>  /* orhpan objectid for tracking unlinked/truncated files */
>  #define BTRFS_ORPHAN_OBJECTID -5ULL
> 
> @@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
>  */
> __le64 num_bytes;
> 
> +   /*
> +* the stop 64bits of sha256 hash value, this helps us find the
> +* corresponding item in dedup tree.
> +*/
> +   __le64 dedup_hash;
> +
>  } __attribute__ ((__packed__));
> 

Please don't do this, do like what we do with the crc tree, just lookup based on
the disk bytenr when we free the extent and drop refs that way.  Don't further
bloat the file extent item, we want it smaller not larger.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: online data deduplication

2013-04-07 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)

This introduce the online data deduplication feature for btrfs.

(1) WHY do we need deduplication?
To improve our storage effiency.

(2) WHAT is deduplication?
Two key ways for practical deduplication implementations,
*  When the data is deduplicated
   (inband vs background)
*  The granularity of the deduplication.
   (block level vs file level)

For btrfs, we choose
*  inband(synchronous)
*  block level

We choose them because of the same reason as how zfs does.
a)  To get an immediate benefit.
b)  To remove redundant parts within a file.

So we have an inband, block level data deduplication here.

(3) HOW does deduplication works?
This makes full use of file extent back reference, the same way as
IOCTL_CLONE, which lets us easily store multiple copies of a set of
data as a single copy along with an index of references to the copy.

Here we have
a)  a new dedicated tree(DEDUP tree) and
b)  a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
(stop 64bits of hash, type, disk offset),
*  stop 64bits of hash
   It comes from sha256, which is very helpful on avoiding collision.
   And we take the stop 64bits as the index.
*  disk offset
   It helps to find where the data is stored.

So the whole deduplication process works as,
1) write something,
2) calculate the hash of this "something",
3) try to find the match of hash value by searching DEDUP keys in
   a dedicated tree, DEDUP tree.
4) if found, skip real IO and link to the existing copy
   if not, do real IO and insert a DEDUP key to the DEDUP tree.

For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
going to increase this unit dynamically in the future.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h|   53 
 fs/btrfs/disk-io.c  |   33 +-
 fs/btrfs/extent-tree.c  |   22 +++-
 fs/btrfs/extent_io.c|8 +-
 fs/btrfs/extent_io.h|   11 ++
 fs/btrfs/file-item.c|  186 ++
 fs/btrfs/file.c |6 +-
 fs/btrfs/inode.c|  330 +++
 fs/btrfs/ioctl.c|   34 +-
 fs/btrfs/ordered-data.c |   25 +++-
 fs/btrfs/ordered-data.h |9 ++
 fs/btrfs/print-tree.c   |6 +-
 fs/btrfs/super.c|7 +-
 13 files changed, 685 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0d82922..59339bc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
 /* holds quota configuration and tracking */
 #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
 
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
+
 /* orhpan objectid for tracking unlinked/truncated files */
 #define BTRFS_ORPHAN_OBJECTID -5ULL
 
@@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
 */
__le64 num_bytes;
 
+   /*
+* the stop 64bits of sha256 hash value, this helps us find the
+* corresponding item in dedup tree.
+*/
+   __le64 dedup_hash;
+
 } __attribute__ ((__packed__));
 
 struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/* dedup */
+struct btrfs_dedup_item {
+   __le32 refs;
+   __le64 len;
+} __attribute__ ((__packed__));
+
+#define BTRFS_DEDUP_HASH_SIZE 32
+#define BTRFS_DEDUP_HASH_LEN 4
+
+struct btrfs_dedup_hash_item {
+   __le64 hash[BTRFS_DEDUP_HASH_LEN];
+} __attribute__ ((__packed__));
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -1287,6 +1310,7 @@ struct btrfs_fs_info {
struct btrfs_root *dev_root;
struct btrfs_root *fs_root;
struct btrfs_root *csum_root;
+   struct btrfs_root *dedup_root;
struct btrfs_root *quota_root;
 
/* the log root tree is a directory of all the other log roots */
@@ -1605,6 +1629,9 @@ struct btrfs_fs_info {
struct btrfs_dev_replace dev_replace;
 
atomic_t mutually_exclusive_operation_running;
+
+   /* reference to deduplication algorithm drvier via cryptoapi */
+   struct crypto_shash *dedup_driver;
 };
 
 /*
@@ -1866,6 +1893,8 @@ struct btrfs_ioctl_defrag_range_args {
  */
 #define BTRFS_DEV_REPLACE_KEY  250
 
+#define BTRFS_DEDUP_ITEM_KEY   252
+
 /*
  * string items are for debugging.  They just store a short string of
  * data in the FS
@@ -1900,6 +1929,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY(1 << 20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1 << 22)
+#define BTRF

[PATCH 0/2 RFC] Online data deduplication

2013-04-07 Thread Liu Bo
This is the first attempt for online data deduplication.

NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

For more implementation details, please refer to PATCH 1.

PATCH 2 is a hang fix when deduplication is on.

==
HOW To turn deduplication on:

There are 2 steps you need to do before using it,
1) mount with option "-o dedup"
2) then run 'btrfs filesystem sync /mnt_of_your_btrfs'
(Because I hack 'btrfs fi sync' to enable deduplication...)

Here is an example:
1) mkfs.btrfs /dev/sdb1
2) mount /dev/sdb1 /mnt/btrfs -o dedup
3) btrfs filesystem sync /mnt/btrfs
4) btrfs fi df /mnt/btrfs
   Data: total=8.00MB, used=256.00KB
   System, DUP: total=8.00MB, used=4.00KB
   System: total=4.00MB, used=0.00
   Metadata, DUP: total=1.00GB, used=28.00KB
   Metadata: total=8.00MB, used=0.00

5) dd if=/dev/zero of=/mnt/btrfs/foo bs=4K count=1; sync
6) dd if=/dev/zero of=/mnt/btrfs/foo bs=1M count=10; sync
   Data: total=1.01GB, used=260.00KB
   System, DUP: total=8.00MB, used=4.00KB
   System: total=4.00MB, used=0.00
   Metadata, DUP: total=1.00GB, used=432.00KB
   Metadata: total=8.00MB, used=0.00

So 4K+10M has been written, but used=256.00KB -> used=260.00KB,
only 4KB is used!

=
TODO:
1) a bit-to-bit comparison callback.
2) support for alternative blocksize larger than PAGESIZE

I just tested it with simple cases like above, and not even with xfstests, which
is what I'm going to do.

Any comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

Liu Bo (2):
  Btrfs: online data deduplication
  Btrfs: skip merge part for delayed data refs

 fs/btrfs/ctree.h|   53 
 fs/btrfs/delayed-ref.c  |7 +
 fs/btrfs/disk-io.c  |   33 +-
 fs/btrfs/extent-tree.c  |   22 +++-
 fs/btrfs/extent_io.c|8 +-
 fs/btrfs/extent_io.h|   11 ++
 fs/btrfs/file-item.c|  184 ++
 fs/btrfs/file.c |6 +-
 fs/btrfs/inode.c|  327 +++
 fs/btrfs/ioctl.c|   34 +-
 fs/btrfs/ordered-data.c |   25 +++-
 fs/btrfs/ordered-data.h |9 ++
 fs/btrfs/print-tree.c   |6 +-
 fs/btrfs/super.c|7 +-
 14 files changed, 687 insertions(+), 45 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2 RFC] Online data deduplication

2013-04-07 Thread Liu Bo
This is the first attempt for online data deduplication.

NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data!

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to "Content based storage" in project ideas[2].

For more implementation details, please refer to PATCH 1.

PATCH 2 is a hang fix when deduplication is on.

==
HOW To turn deduplication on:

There are 2 steps you need to do before using it,
1) mount with option "-o dedup"
2) then run 'btrfs filesystem sync /mnt_of_your_btrfs'
(Because I hack 'btrfs fi sync' to enable deduplication...)

Here is an example:
1) mkfs.btrfs /dev/sdb1
2) mount /dev/sdb1 /mnt/btrfs -o dedup
3) btrfs filesystem sync /mnt/btrfs
4) btrfs fi df /mnt/btrfs
   Data: total=8.00MB, used=256.00KB
   System, DUP: total=8.00MB, used=4.00KB
   System: total=4.00MB, used=0.00
   Metadata, DUP: total=1.00GB, used=28.00KB
   Metadata: total=8.00MB, used=0.00

5) dd if=/dev/zero of=/mnt/btrfs/foo bs=4K count=1; sync
6) dd if=/dev/zero of=/mnt/btrfs/foo bs=1M count=10; sync
   Data: total=1.01GB, used=260.00KB
   System, DUP: total=8.00MB, used=4.00KB
   System: total=4.00MB, used=0.00
   Metadata, DUP: total=1.00GB, used=432.00KB
   Metadata: total=8.00MB, used=0.00

So 4K+10M has been written, but used=256.00KB -> used=260.00KB,
only 4KB is used!

=
TODO:
1) a bit-to-bit comparison callback.
2) support for alternative blocksize larger than PAGESIZE

I just tested it with simple cases like above, and not even with xfstests, which
is what I'm going to do.

Any comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

Liu Bo (2):
  Btrfs: online data deduplication
  Btrfs: skip merge part for delayed data refs

 fs/btrfs/ctree.h|   53 
 fs/btrfs/delayed-ref.c  |7 +
 fs/btrfs/disk-io.c  |   33 +-
 fs/btrfs/extent-tree.c  |   22 +++-
 fs/btrfs/extent_io.c|8 +-
 fs/btrfs/extent_io.h|   11 ++
 fs/btrfs/file-item.c|  184 ++
 fs/btrfs/file.c |6 +-
 fs/btrfs/inode.c|  327 +++
 fs/btrfs/ioctl.c|   34 +-
 fs/btrfs/ordered-data.c |   25 +++-
 fs/btrfs/ordered-data.h |9 ++
 fs/btrfs/print-tree.c   |6 +-
 fs/btrfs/super.c|7 +-
 14 files changed, 687 insertions(+), 45 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs vs data deduplication

2011-09-18 Thread Chris Samuel
On Mon, 19 Sep 2011, 06:15:51 EST, Hubert Kario  wrote:

> You shouldn't depend on single drive, metadata
> raid is there to protect against single bad
> blocks, not disk crash.

I guess the issue here is you no longer even
have that protection with this sort of dedup.

cheers,
Chris
-- 
Chris Samuel - http://www.csamuel.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs vs data deduplication

2011-09-18 Thread Maciej Marcin Piechotka
On Sat, 2011-07-09 at 08:19 +0200, Paweł Brodacki wrote:
> Hello,
> 
> I've stumbled upon this article:
> http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/
> 
> Reportedly Sandforce SF1200 SSD controller does internally block-level
> data de-duplication. This effectively removes the additional
> protection given by writing multiple metadata copies. This technique
> may be used, or can be used in the future by manufactureres of other
> drives too.
> 
> I would like to ask, if the metadata copies written to a btrfs system
> with enabled metadata mirroring are identical, or is there something
> that makes them unique on-disk, therefore preventing their
> de-duplication. I tried googling for the answer, but didn't net
> anything that would answer my question.
> 
> If the metadata copies are identical, I'd like to ask if it would be
> possible to change this without major disruption? I know that changes
> to on-disk format aren't a thing made lightly, but I'd be grateful for
> any comments.
> 
> The increase of the risk of file system corruption introduced by data
> de-duplication on Sandforce controllers was down-played in the
> vendor's reply included in the article, but still, what's the point of
> duplicating metadata on file system level, if storage below can remove
> that redundancy?
> 
> Regards,
> Paweł

Hello,

Sorry I add my 0.03$. It is possible to workaround it by using
encryption. If something other then ebc is used the identical elements
in unecrypted mode are stored as different on hdd.

The drawbacks:

 - Encryption overhead (you may want to use non-secure mode as you're
not interested in security)
 - There is avalanche effect (whole [encryption] block gets corrupted
even if one bit of block is corrupted).

Regards


signature.asc
Description: This is a digitally signed message part


Re: btrfs vs data deduplication

2011-09-18 Thread Hubert Kario
On Saturday 09 of July 2011 08:19:30 Paweł Brodacki wrote:
> Hello,
>
> I've stumbled upon this article:
> http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/
>
> Reportedly Sandforce SF1200 SSD controller does internally block-level
> data de-duplication. This effectively removes the additional
> protection given by writing multiple metadata copies. This technique
> may be used, or can be used in the future by manufactureres of other
> drives too.

Only a problem in a single disk installation

> I would like to ask, if the metadata copies written to a btrfs system
> with enabled metadata mirroring are identical, or is there something
> that makes them unique on-disk, therefore preventing their
> de-duplication. I tried googling for the answer, but didn't net
> anything that would answer my question.

There is a difference between root inode copies, don't think there's any 
difference between metadata tree copies. I'm quite certain they are bit for 
bit identical.

> If the metadata copies are identical, I'd like to ask if it would be
> possible to change this without major disruption? I know that changes
> to on-disk format aren't a thing made lightly, but I'd be grateful for
> any comments.

That would be a big change for little to no benefit.

> The increase of the risk of file system corruption introduced by data
> de-duplication on Sandforce controllers was down-played in the
> vendor's reply included in the article, but still, what's the point of
> duplicating metadata on file system level, if storage below can remove
> that redundancy?

You shouldn't depend on single drive, metadata raid is there to protect 
against single bad blocks, not disk crash.

If you want redundancy, use mulitple disks. Either HDD or SSD. And have 
readable backups.

Regards,
Hubert



signature.asc
Description: This is a digitally signed message part.


btrfs vs data deduplication

2011-07-08 Thread Paweł Brodacki
Hello,

I've stumbled upon this article:
http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/

Reportedly Sandforce SF1200 SSD controller does internally block-level
data de-duplication. This effectively removes the additional
protection given by writing multiple metadata copies. This technique
may be used, or can be used in the future by manufactureres of other
drives too.

I would like to ask, if the metadata copies written to a btrfs system
with enabled metadata mirroring are identical, or is there something
that makes them unique on-disk, therefore preventing their
de-duplication. I tried googling for the answer, but didn't net
anything that would answer my question.

If the metadata copies are identical, I'd like to ask if it would be
possible to change this without major disruption? I know that changes
to on-disk format aren't a thing made lightly, but I'd be grateful for
any comments.

The increase of the risk of file system corruption introduced by data
de-duplication on Sandforce controllers was down-played in the
vendor's reply included in the article, but still, what's the point of
duplicating metadata on file system level, if storage below can remove
that redundancy?

Regards,
Paweł
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski

Chris Mason wrote:

I wonder how well would deduplication work with defragmentation? One  
excludes the other to some extent.


Very much so ;)  Ideally we end up doing dedup in large extents, but it
will definitely increase the overall fragmentation of the FS.


Defragmentation could lead to interesting problems if it's not aware of 
dedupliction.



I can imagine "freeing up space" (i.e., as seen by userspace) where 
duplicated blocks are found, but keeping track of duplicated blocks 
internally (and not allowing to overwrite any block which is duplicated).


Only when free space is really needed, "deduplicate" blocks which have 
more copies, and allow to overwrite them with data.


Perhaps complicated though.


--
Tomasz Chmielewski
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Chris Mason
On Fri, Jun 05, 2009 at 02:20:48PM +0200, Tomasz Chmielewski wrote:
> Chris Mason wrote:
>> On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
>>> Hello Chris,
>>>
>  My question is now, how often can a block in btrfs be refferenced?
 The exact answer depends on if we are referencing it from a single
 file or from multiple files.  But either way it is roughly 2^32.
>>> could you please explain to me what underlying datastructure is used to
>>> monitor if the block is still referenced or already free? Is a counter
>>> used, bitmap (but that can't be if is 2^32) or some sort of list? I
>>> assume that a counter is used. If this is the case, I assume when a
>>> snapshot for example is deleted the reference counter of every block
>>> that was referenced in the snapshot will be decremented by one. Is this
>>> correct or am I missing something here?
>>
>> It is a counter and a back reference.  With Yan Zheng's new format work,
>> the limit is not 2^64.
>>
>> When a snapshot is deleted, the btree is walked to efficiently drop the
>> references on the blocks it referenced.
>>
>> From a dedup point of view, we'll want the dedup file to hold a
>> reference on the file extents.  The kernel ioctl side of things will
>> take care of that part.
>
> I wonder how well would deduplication work with defragmentation? One  
> excludes the other to some extent.

Very much so ;)  Ideally we end up doing dedup in large extents, but it
will definitely increase the overall fragmentation of the FS.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski

Chris Mason wrote:

On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:

Hello Chris,


 My question is now, how often can a block in btrfs be refferenced?

The exact answer depends on if we are referencing it from a single
file or from multiple files.  But either way it is roughly 2^32.

could you please explain to me what underlying datastructure is used to
monitor if the block is still referenced or already free? Is a counter
used, bitmap (but that can't be if is 2^32) or some sort of list? I
assume that a counter is used. If this is the case, I assume when a
snapshot for example is deleted the reference counter of every block
that was referenced in the snapshot will be decremented by one. Is this
correct or am I missing something here?


It is a counter and a back reference.  With Yan Zheng's new format work,
the limit is not 2^64.

When a snapshot is deleted, the btree is walked to efficiently drop the
references on the blocks it referenced.

From a dedup point of view, we'll want the dedup file to hold a
reference on the file extents.  The kernel ioctl side of things will
take care of that part.


I wonder how well would deduplication work with defragmentation? One 
excludes the other to some extent.



--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Chris Mason
On Thu, Jun 04, 2009 at 02:03:50PM +0200, Thomas Glanzmann wrote:
> Chris,
> 
> > It is a counter and a back reference.  With Yan Zheng's new format
> > work, the limit is not 2^64.
> 
> That means that there is one back reference for every use of the block?
> Where is this back reference stored? (I'm asking because if one back
> reference for every copy is stored, it can obviously not be allocated
> statically).

These are all stored in the extent allocation tree.  There isn't exactly
a 1:1 mapping but it is effectively that.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Chris,

> It is a counter and a back reference.  With Yan Zheng's new format
> work, the limit is not 2^64.

That means that there is one back reference for every use of the block?
Where is this back reference stored? (I'm asking because if one back
reference for every copy is stored, it can obviously not be allocated
statically).

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Chris Mason
On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > >  My question is now, how often can a block in btrfs be refferenced?
> 
> > The exact answer depends on if we are referencing it from a single
> > file or from multiple files.  But either way it is roughly 2^32.
> 
> could you please explain to me what underlying datastructure is used to
> monitor if the block is still referenced or already free? Is a counter
> used, bitmap (but that can't be if is 2^32) or some sort of list? I
> assume that a counter is used. If this is the case, I assume when a
> snapshot for example is deleted the reference counter of every block
> that was referenced in the snapshot will be decremented by one. Is this
> correct or am I missing something here?

It is a counter and a back reference.  With Yan Zheng's new format work,
the limit is not 2^64.

When a snapshot is deleted, the btree is walked to efficiently drop the
references on the blocks it referenced.

>From a dedup point of view, we'll want the dedup file to hold a
reference on the file extents.  The kernel ioctl side of things will
take care of that part.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Hello Chris,

> >  My question is now, how often can a block in btrfs be refferenced?

> The exact answer depends on if we are referencing it from a single
> file or from multiple files.  But either way it is roughly 2^32.

could you please explain to me what underlying datastructure is used to
monitor if the block is still referenced or already free? Is a counter
used, bitmap (but that can't be if is 2^32) or some sort of list? I
assume that a counter is used. If this is the case, I assume when a
snapshot for example is deleted the reference counter of every block
that was referenced in the snapshot will be decremented by one. Is this
correct or am I missing something here?

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-24 Thread Thomas Glanzmann
Hello Heinz,

> Hi, during the last half year I thought a little bit about doing dedup
> for my backup program: not only with fixed blocks (which is
> implemented), but with moving blocks (with all offsets in a file: 1
> byte, 2 byte, ...). That means, I have to have *lots* of comparisions
> (size of file - blocksize).  Even it's not the same, it must be very
> fast and that's the same problem like the one discussed here.

because I just stumbled across that, I wanted to let you know about an
interesting approach that NetAPP is using for its Virtual Tape Library:

http://www.netapp.com/us/communities/tech-ontap/vtl-dedupe.html

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-06 Thread Sander
Heinz-Josef Claes wrote (ao):
> Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason:
> > On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
> > > Hello,
> > >
> > > > I wouldn't rely on crc32: it is not a strong hash,
> > > > Such deduplication can lead to various problems,
> > > > including security ones.
> > >
> > > sure thing, did you think of replacing crc32 with sha1 or md5, is this
> > > even possible (is there enough space reserved so that the change can be
> > > done without changing the filesystem layout) at the moment with btrfs?
> >
> > It is possible, there's room in the metadata for about about 4k of
> > checksum for each 4k of data.  The initial btrfs code used sha256, but
> > the real limiting factor is the CPU time used.
> >
> > -chris
> >
> It's not only cpu time, it's also memory. You need 32 byte for each 4k block. 
> It needs to be in RAM for performance reason.

Less so with SSD I would assume.

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Thomas Glanzmann
Hello Jan,

* Jan-Frode Myklebust  [090504 20:20]:
> "thin or shallow clones" sounds more like sparse images. I believe
> "linked clones" is the word for running multiple virtual machines off
> a single gold image. Ref, the "VMware View Composer" section of:

not exactly. VMware has one golden image and than accounts the blocks
different from the golden image by accounting the differences in a
snapshot file on a block level basis. So yes, it is a ,,sparse file''
but implemented in userland.

> "All desktops that are linked to a master image can be patched or updated
>  simply by updating the master image, without affecting users’ settings,
>  data or applications."

True for their desktops, because each desktop contains of a virtual
machine which is created when a user connects and destroyed, when he
disconnects. Bottom line the VM for the thin client only exists while
the user is connected. That of course makes it easy to update the master
image. But for VMs that are not desktops that are get created/destroyed
at least once a day, this doesn't scale.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Heinz-Josef Claes
On Tue, 5 May 2009 07:29:45 +1000
Dmitri Nikulin  wrote:

> On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes  wrote:
> > Hi, during the last half year I thought a little bit about doing dedup for
> > my backup program: not only with fixed blocks (which is implemented), but
> > with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
> > means, I have to have *lots* of comparisions (size of file - blocksize).
> > Even it's not the same, it must be very fast and that's the same problem
> > like the one discussed here.
> >
> > My solution (not yet implemented) is as follows (hopefully I remember well):
> >
> > I calculate a checksum of 24 bit. (there can be another size)
> >
> > This means, I can have 2^24 different checksums.
> >
> > Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
> > I'm just in a hotel and have no calculator): one bit for each possibility.
> > This verctor is initialized with zeros.
> >
> > For each calculated checksum of a block, I set the according bit in the bit
> > vector.
> >
> > It's very fast, to check if a block with a special checksum exists in the
> > filesystem (backup for me) by checking the appropriate bit in the bit
> > vector.
> >
> > If it doesn't exist, it's a new block
> >
> > If it exists, there need to be a separate 'real' check if it's really the
> > same block (which is slow, but's that's happening <<1% of the time).
> 
> Which means you have to refer to each block in some unique way from
> the bit vector, making it a block pointer vector instead. That's only
> 64 times more expensive for a 64 bit offset...
> 
It was not the idea to have a pointer vector, only a bit vector. A pointer 
vector would be too big to hold it in RAM. Therefore, I need access to the disk 
after using the more exact md5sum (I wanted to use). The bitvector is only 
needed to have a very quick decision for most of the cases (speedup).
But I have no idea if it fits to this use case. I'm not a filesystem developer 
;-)

> Since the overwhelming majority of combinations will never appear in
> practice, you are much better served with a self-sizing data structure
> like a hash map or even a binary tree, or a hash map with each bucket
> being a binary tree, etc... You can use any sized hash and it won't
> affect the number of nodes you have to store. You can trade off to CPU
> or RAM easily as required, just by selecting an appropriate data
> structure. A bit vector and especially a pointer vector have extremely
> bad "any" case RAM requirements because even if you're deduping a mere
> 10 blocks you're still allocating and initialising 2^24 offsets. The
> least you could do is adaptively switch to a more efficient data
> structure if you see the number of blocks is low enough.
> 
> -- 
> Dmitri Nikulin
> 
> Centre for Synchrotron Science
> Monash University
> Victoria 3800, Australia
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Heinz-Josef Claes 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Dmitri Nikulin
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes  wrote:
> Hi, during the last half year I thought a little bit about doing dedup for
> my backup program: not only with fixed blocks (which is implemented), but
> with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
> means, I have to have *lots* of comparisions (size of file - blocksize).
> Even it's not the same, it must be very fast and that's the same problem
> like the one discussed here.
>
> My solution (not yet implemented) is as follows (hopefully I remember well):
>
> I calculate a checksum of 24 bit. (there can be another size)
>
> This means, I can have 2^24 different checksums.
>
> Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
> I'm just in a hotel and have no calculator): one bit for each possibility.
> This verctor is initialized with zeros.
>
> For each calculated checksum of a block, I set the according bit in the bit
> vector.
>
> It's very fast, to check if a block with a special checksum exists in the
> filesystem (backup for me) by checking the appropriate bit in the bit
> vector.
>
> If it doesn't exist, it's a new block
>
> If it exists, there need to be a separate 'real' check if it's really the
> same block (which is slow, but's that's happening <<1% of the time).

Which means you have to refer to each block in some unique way from
the bit vector, making it a block pointer vector instead. That's only
64 times more expensive for a 64 bit offset...

Since the overwhelming majority of combinations will never appear in
practice, you are much better served with a self-sizing data structure
like a hash map or even a binary tree, or a hash map with each bucket
being a binary tree, etc... You can use any sized hash and it won't
affect the number of nodes you have to store. You can trade off to CPU
or RAM easily as required, just by selecting an appropriate data
structure. A bit vector and especially a pointer vector have extremely
bad "any" case RAM requirements because even if you're deduping a mere
10 blocks you're still allocating and initialising 2^24 offsets. The
least you could do is adaptively switch to a more efficient data
structure if you see the number of blocks is low enough.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Andrey Kuzmin
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust  wrote:
>> Looking at the website content, it also revealed that VMware will have a
>> similiar feature for their workhorse ,,esx server'' in the upcoming
>> release, however my point still stands. Ship out a service pack for
>> windows and you 1.5 Gbyte of modified data that is not deduped.
>
> "All desktops that are linked to a master image can be patched or updated
>  simply by updating the master image, without affecting users’ settings,
>  data or applications."

Writable clone support in the file-system coupled with hierarchical
settings/data/apps  layout, and you have what's described above.

Regards,
Andrey

>
>
>  -jf
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Heinz-Josef Claes

Thomas Glanzmann schrieb:

Ric,

  
I would not categorize it as offline, but just not as inband (i.e., you can 
run a low priority background process to handle dedup).



  

Offline windows are extremely rare in production sites these days and
it could take a very long time to do dedup at the block level over a
large file system :-)



let me rephrase, by offline I meant asynchronous during off hours.

  
Hi, during the last half year I thought a little bit about doing dedup 
for my backup program: not only with fixed blocks (which is 
implemented), but with moving blocks (with all offsets in a file: 1 
byte, 2 byte, ...). That means, I have to have *lots* of comparisions 
(size of file - blocksize). Even it's not the same, it must be very fast 
and that's the same problem like the one discussed here.


My solution (not yet implemented) is as follows (hopefully I remember well):

I calculate a checksum of 24 bit. (there can be another size)

This means, I can have 2^24 different checksums.

Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember 
well, I'm just in a hotel and have no calculator): one bit for each 
possibility. This verctor is initialized with zeros.


For each calculated checksum of a block, I set the according bit in the 
bit vector.


It's very fast, to check if a block with a special checksum exists in 
the filesystem (backup for me) by checking the appropriate bit in the 
bit vector.


If it doesn't exist, it's a new block

If it exists, there need to be a separate 'real' check if it's really 
the same block (which is slow, but's that's happening <<1% of the time).


I hope it is possible to understand my thoughts. I'm in a hotel and I 
possibly cannot track the emails in this list in the next hours or days.


Regards, HJC
1/3 is not sufficient for dedup in my opinion - you can get that with 
normal compression at the block level.



1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Jan-Frode Myklebust
On 2009-05-04, Thomas Glanzmann  wrote:
>
>> As far as I understand, VMware already ships this "gold image" feature
>> (as they call it) for Windows environments and claims it to be very
>> efficient.
>
> they call it ,,thin or shallow clones''

"thin or shallow clones" sounds more like sparse images. I believe "linked 
clones"
is the word for running multiple virtual machines off a single gold image. Ref,
the "VMware View Composer" section of:

http://www.vmware.com/products/view/whatsincluded.html


> Looking at the website content, it also revealed that VMware will have a
> similiar feature for their workhorse ,,esx server'' in the upcoming
> release, however my point still stands. Ship out a service pack for
> windows and you 1.5 Gbyte of modified data that is not deduped. 

"All desktops that are linked to a master image can be patched or updated
 simply by updating the master image, without affecting users’ settings,
 data or applications."


  -jf

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Ric,

> I would not categorize it as offline, but just not as inband (i.e., you can 
> run a low priority background process to handle dedup).

> Offline windows are extremely rare in production sites these days and
> it could take a very long time to do dedup at the block level over a
> large file system :-)

let me rephrase, by offline I meant asynchronous during off hours.

> 1/3 is not sufficient for dedup in my opinion - you can get that with 
> normal compression at the block level.

1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Andrey,

> As far as I understand, VMware already ships this "gold image" feature
> (as they call it) for Windows environments and claims it to be very
> efficient.

they call it ,,thin or shallow clones'' and ship it with desktop
virtualization (one vm per thinclient user) and for VMware lab manager.
Looking at the website content, it also revealed that VMware will have a
similiar feature for their workhorse ,,esx server'' in the upcoming
release, however my point still stands. Ship out a service pack for
windows and you 1.5 Gbyte of modified data that is not deduped. Because
that is a once in VMs life opportunity.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Andrey Kuzmin
On Mon, May 4, 2009 at 8:03 PM, Ric Wheeler  wrote:
> Thomas Glanzmann wrote:
>
> I think that Chris already mentioned that you can (for virt OS images) also
> imagine using copy on write snapshots (of something that is mostly read-only
> like the OS system partitions).  Make 50 copy-on-write snapshots of "/"
> (excluding /home) and you have an effective compression of 98% until someone
> rudely starts writing at least :-)
>

As far as I understand, VMware already ships this "gold image" feature
(as they call it) for Windows environments and claims it to be very
efficient.

Regards,
Andrey

> ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

Thomas Glanzmann wrote:

Hello Ric,


(1) Block level or file level dedup?


what is the difference between the two?


(2) Inband dedup (during a write) or background dedup?


I think inband dedup is way to intensive on ressources (memory) and also
would kill every performance benchmark. So I think the offline dedup is
the right way to go.


I would not categorize it as offline, but just not as inband (i.e., you can run 
a low priority background process to handle dedup).  Offline windows are 
extremely rare in production sites these days and it could take a very long time 
to do dedup at the block level over a large file system :-)





(3) How reliably can you protect the pool of blocks? How reliably can
you protect the database that maps hashes to blocks?


You have to lock down the i/o requests for the blocks in question and
compare them byte by byte anyway, just to make sure.


Yes, one advantage we had in centera was that the objects were read-only and 
whole files, so this was not an issue for us.





(4) Can you give users who are somewhat jaded confidence in your
solution (this is where stats come in very handy!)


For virtual machines you can reduce the used data by 1/3. Of course it
can blow in your face when you don't watch your physical resources
closely.

Thomas


1/3 is not sufficient for dedup in my opinion - you can get that with normal 
compression at the block level.


Put another way, if the baseline is bzip2 levels of compression for the block 
device, when is the complexity of dedup going to pay off?


I think that Chris already mentioned that you can (for virt OS images) also 
imagine using copy on write snapshots (of something that is mostly read-only 
like the OS system partitions).  Make 50 copy-on-write snapshots of "/" 
(excluding /home) and you have an effective compression of 98% until someone 
rudely starts writing at least :-)


ric


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Ric,

> (1) Block level or file level dedup?

what is the difference between the two?

> (2) Inband dedup (during a write) or background dedup?

I think inband dedup is way to intensive on ressources (memory) and also
would kill every performance benchmark. So I think the offline dedup is
the right way to go.

> (3) How reliably can you protect the pool of blocks? How reliably can
> you protect the database that maps hashes to blocks?

You have to lock down the i/o requests for the blocks in question and
compare them byte by byte anyway, just to make sure.

> (4) Can you give users who are somewhat jaded confidence in your
> solution (this is where stats come in very handy!)

For virtual machines you can reduce the used data by 1/3. Of course it
can blow in your face when you don't watch your physical resources
closely.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote:

Ric Wheeler schrieb:

One thing in the above scheme that would be really interesting for 
all possible hash functions is maintaining good stats on hash 
collisions, effectiveness of the hash, etc. There has been a lot of 
press about MD5 hash collisions for example - it would be really neat 
to be able to track real world data on those,


See here ("The hashing function"):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

It's not "real world data", but it gives some overview which applies 
here.





I have lots of first hand "real world" data from 5 years at EMC in the 
Centera group where we used various types of hashes to do single 
instancing at the file level, but those are unfortunately not public :-) 
Note that other parts of EMC do dedup at the block level (Avamar most 
notably).


The key to doing dedup at scale is answering a bunch of questions:

(1) Block level or file level dedup?
(2) Inband dedup (during a write) or background dedup?
(3) How reliably can you protect the pool of blocks? How reliably 
can you protect the database that maps hashes to blocks?
(4) Can you give users who are somewhat jaded confidence in your 
solution (this is where stats come in very handy!)
(5) In the end, dedup is basically a data compression trick - how 
effective is it (including the costs of the metadata, etc) compared to 
less complex schemes? What does it costs in CPU, DRAM and impact on the 
foreground workload?


Dedup is a very active area in the storage world, but you need to be 
extremely robust in the face of failure since a single block failure 
could (worst case!) lose all stored data in the pool :-)


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Tomasz Chmielewski

Ric Wheeler schrieb:

One thing in the above scheme that would be really interesting for all 
possible hash functions is maintaining good stats on hash collisions, 
effectiveness of the hash, etc. There has been a lot of press about MD5 
hash collisions for example - it would be really neat to be able to 
track real world data on those,


See here ("The hashing function"):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

It's not "real world data", but it gives some overview which applies here.


--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

On 04/28/2009 01:41 PM, Michael Tharp wrote:

Thomas Glanzmann wrote:

no, I just used the md5 checksum. And even if I have a hash escalation
which is highly unlikely it still gives a good house number.


I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
bytewise comparison before actually merging them. Even the risk of an 
accidental collision is too high, and considering there are plenty of 
birthday-style MD5 attacks it would not be extraordinarily difficult 
to construct a block that collides with e.g. a system library.


Keep in mind that although digests do a fairly good job of making 
unique identifiers for larger chunks of data, they can only hold so 
many unique combinations. Considering you're comparing blocks of a few 
kibibytes in size it's best to just do a foolproof comparison. There's 
nothing wrong with using a checksum/digest as a screening mechanism 
though.


-- m. tharp


One thing in the above scheme that would be really interesting for all 
possible hash functions is maintaining good stats on hash collisions, 
effectiveness of the hash, etc. There has been a lot of press about MD5 
hash collisions for example - it would be really neat to be able to 
track real world data on those,


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 17:26 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > Your database should know, and the ioctl could check to see if the
> > source and destination already point to the same thing before doing
> > anything expensive.
> 
> I see.
> 
> > > So, if I only have file, offset, len and not the block number, is there
> > > a way from userland to tell if two blocks are already point to the same
> > > block?
> 
> > You can use the fiemap ioctl.
> 
> I see.
> 
> One more question, in my dedup test VMs, I had a block that was often
> referenced these are the top block. The last one alone saves 3 Gbyte of
> data. My question is now, how often can a block in btrfs be refferenced?
> 

The exact answer depends on if we are referencing it from a single file
or from multiple files.  But either way it is roughly 2^32.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris,

> Your database should know, and the ioctl could check to see if the
> source and destination already point to the same thing before doing
> anything expensive.

I see.

> > So, if I only have file, offset, len and not the block number, is there
> > a way from userland to tell if two blocks are already point to the same
> > block?

> You can use the fiemap ioctl.

I see.

One more question, in my dedup test VMs, I had a block that was often
referenced these are the top block. The last one alone saves 3 Gbyte of
data. My question is now, how often can a block in btrfs be refferenced?

top 4 block counts for 8 Kbyte:

32055
59514
63606
415252

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 15:58 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > But, in your ioctls you want to deal with [file, offset, len], not
> > directly with block numbers.  COW means that blocks can move around
> > without you knowing, and some of the btrfs internals will COW files in
> > order to relocate storage.
> 
> > So, what you want is a dedup file (or files) where your DB knows a given
> > offset in the file has a given csum.
> 
> how do I track if a certain block has already been deduplicated?
> 

Your database should know, and the ioctl could check to see if the
source and destination already point to the same thing before doing
anything expensive.

> So, if I only have file, offset, len and not the block number, is there
> a way from userland to tell if two blocks are already point to the same
> block?

You can use the fiemap ioctl.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris,

> But, in your ioctls you want to deal with [file, offset, len], not
> directly with block numbers.  COW means that blocks can move around
> without you knowing, and some of the btrfs internals will COW files in
> order to relocate storage.

> So, what you want is a dedup file (or files) where your DB knows a given
> offset in the file has a given csum.

how do I track if a certain block has already been deduplicated?

So, if I only have file, offset, len and not the block number, is there
a way from userland to tell if two blocks are already point to the same
block?

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 14:03 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > You can start with the code documentation section on
> > http://btrfs.wiki.kernel.org
> 
> I read through this and at the moment one questions come in my mind:
> 
> http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png
> 
> Looking at this picture, when I'm going to implement the dedup code, do I also
> have to take care to spread the blocks over the different devices or is
> there already infrastructure in place that automates that process?

The layering inside of btrfs means that you don't need to worry about
chunks or multiple devices.

But, in your ioctls you want to deal with [file, offset, len], not
directly with block numbers.  COW means that blocks can move around
without you knowing, and some of the btrfs internals will COW files in
order to relocate storage.

So, what you want is a dedup file (or files) where your DB knows a given
offset in the file has a given csum.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Michael Tharp

Thomas Glanzmann wrote:

Looking at this picture, when I'm going to implement the dedup code, do I also
have to take care to spread the blocks over the different devices or is
there already infrastructure in place that automates that process?


If you somehow had blocks duplicated exactly across two devices such 
that the deduplicator discarded all the blocks from one disk and kept 
all the blocks from the other disk, then there would be a problem. The 
best way to solve this would be to always keep the block on the 
least-full device and discard the rest, which shouldn't be too difficult 
(especially since this would be happening in kernelspace), but the 
cheapest solution is to randomize the choice which would be sufficient 
to prevent any further imbalance from developing.


-- m. tharp
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris,

> You can start with the code documentation section on
> http://btrfs.wiki.kernel.org

I read through this and at the moment one questions come in my mind:

http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png

Looking at this picture, when I'm going to implement the dedup code, do I also
have to take care to spread the blocks over the different devices or is
there already infrastructure in place that automates that process?

> I'll write up my ideas on how userspace controlled dedup should work.

I'm looking forward to this and will use it as an implementation road
map.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Bron Gondwana
On Tue, Apr 28, 2009 at 04:58:15PM -0400, Chris Mason wrote:
> > Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> > filesystem:
> > 
> > 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> > be saved to disk from time to time and be restored on startup).
> > 
> 
> Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
> the whole FS.

32Mbyte of memory is teeny too.  You can pick up a server with 32Gb for
about $5k these days.  Using 10% of that gives you room for 100Tb of
storage, which is getting to be a reasonable size.  More than that and
you would be thinking of bigger machines anyway to have enough memory
for a decent cache.

(32Gb seems to be the nice price point, because 4Gb dimms are cheap as
chips, and 8Gb are way expensive.  We're buying 32Gb machines at the
moment)

Bron ( ok, so I don't work with giant enterprise systems.  I can see
   that there are cases where this might suck, but memory seems
   to be growing as fast as disk in the low end )
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Wed, 2009-04-29 at 00:14 +0200, Thomas Glanzmann wrote:
> Hello Chris,
> 
> > They are, but only the crc32c are stored today.
> 
> maybe crc32c is good enough to identify duplicated blocks, I mean we
> only need a hint, the dedup ioctl does the double checking. I will write
> tomorrow a perl script and compare the results to the one that uses md5
> and repoort back.

Its a start at least.

> 
> > Yes, that's the idea.  An ioctl to walk the tree and report on
> > changes, but this doesn't have to be done with version 1 of the dedup
> > code, you can just scan the file based on mtime/ctime.
> 
> Good point.
> 
> > > > But, the ioctl to actually do the dedup needs to be able to verify a
> > > > given block has the contents you expect it to.  The only place you can
> > > > lock down the pages in the file and prevent new changes is inside the
> > > > kernel.
> 
> > > I totally agree to that. How much time would it consume to implement
> > > such a systemcall?
> 
> > It is probably a 3 week to one month effort.
> 
> I'm taking the challenge. Is there a document that I can read that
> introduces me to the structures used in btrfs or can someone walk me
> through on the phone to get a quick start?
> 

Great to hear.  It's an ambitious project, but I'll definitely help
explain things.

You can start with the code documentation section on
http://btrfs.wiki.kernel.org

I'll write up my ideas on how userspace controlled dedup should work.

> I also would like to retrieve the checksums and identify the potential
> blocks and after that work is done (even in a very preliminary state) in
> a way that someone can work with it, I would like to move on to the
> dedup ioctl.

Sounds fair, I'll forward the original patch.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris,

> They are, but only the crc32c are stored today.

maybe crc32c is good enough to identify duplicated blocks, I mean we
only need a hint, the dedup ioctl does the double checking. I will write
tomorrow a perl script and compare the results to the one that uses md5
and repoort back.

> Yes, that's the idea.  An ioctl to walk the tree and report on
> changes, but this doesn't have to be done with version 1 of the dedup
> code, you can just scan the file based on mtime/ctime.

Good point.

> > > But, the ioctl to actually do the dedup needs to be able to verify a
> > > given block has the contents you expect it to.  The only place you can
> > > lock down the pages in the file and prevent new changes is inside the
> > > kernel.

> > I totally agree to that. How much time would it consume to implement
> > such a systemcall?

> It is probably a 3 week to one month effort.

I'm taking the challenge. Is there a document that I can read that
introduces me to the structures used in btrfs or can someone walk me
through on the phone to get a quick start?

I also would like to retrieve the checksums and identify the potential
blocks and after that work is done (even in a very preliminary state) in
a way that someone can work with it, I would like to move on to the
dedup ioctl.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 23:12 +0200, Thomas Glanzmann wrote:
> Hello,
> 
> > > - Implement a system call that reports all checksums and unique
> > >   block identifiers for all stored blocks.
> 
> > This would require storing the larger checksums in the filesystem.  It
> > is much better done in the dedup program.
> 
> I think I misunderstood something here. I thought the checksums per
> block would already be stored somewhere in btrfs?

They are, but only the crc32c are stored today.

> 
> > > - Implement another system call that reports all checksums and
> > >   unique identifiers for all stored blocks since the last
> > >   report. This can be easily implemented:
> 
> > This is racey because there's no way to prevent new changes.
> 
> I got the point.
> 
> > >   Use a block bitmap for every block on the filesystem use one
> > >   bit. If the block is modified set the bit to one, when a
> > >   bitmap is retrieved simply zero it out:
> 
> > > Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> > > filesystem:
> 
> > > 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> > > be saved to disk from time to time and be restored on startup).
> 
> > Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
> > across the whole FS.  Btrfs has metadata that can quickly and easily
> > tell you which files and which blocks in which files have changed
> > since a given transaction id.  This is how you want to find new
> > things.
> 
> You're right the bitmap would not scale. So what is missing is a
> systemcall to report the changes to userland? (Is this feature used to
> generate off-site backups as well?)

Yes, that's the idea.  An ioctl to walk the tree and report on changes,
but this doesn't have to be done with version 1 of the dedup code, you
can just scan the file based on mtime/ctime.

> 
> > But, the ioctl to actually do the dedup needs to be able to verify a
> > given block has the contents you expect it to.  The only place you can
> > lock down the pages in the file and prevent new changes is inside the
> > kernel.
> 
> I totally agree to that. How much time would it consume to implement
> such a systemcall?

It is probably a 3 week to one month effort.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Dmitri Nikulin
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason  wrote:
> So you need an extra index either way.  It makes sense to keep the
> crc32c csums for fast verification of the data read from disk and only
> use the expensive csums for dedup.

What about self-healing? With only a CRC32 to distinguish a good block
from a bad one, statistically you're likely to get an incorrectly
healed block in only every few billion blocks. And that may not be
your machine, but it'll be somebody's, since the probability is way
too high for it not to happen to somebody. Even just a 64 bit checksum
would drop the probability plenty, but I'd really only start with 128
bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and
this traces back to the root like a highly dynamic Merkle tree.

In the CRC case the only safe redundancy is one that has 3+ copies of
the block, to compare the raw data itself, at which point you may as
well have just been using healing RAID1 without checksums.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello,

> > - Implement a system call that reports all checksums and unique
> >   block identifiers for all stored blocks.

> This would require storing the larger checksums in the filesystem.  It
> is much better done in the dedup program.

I think I misunderstood something here. I thought the checksums per
block would already be stored somewhere in btrfs?

> > - Implement another system call that reports all checksums and
> >   unique identifiers for all stored blocks since the last
> >   report. This can be easily implemented:

> This is racey because there's no way to prevent new changes.

I got the point.

> >   Use a block bitmap for every block on the filesystem use one
> >   bit. If the block is modified set the bit to one, when a
> >   bitmap is retrieved simply zero it out:

> > Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> > filesystem:

> > 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> > be saved to disk from time to time and be restored on startup).

> Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
> across the whole FS.  Btrfs has metadata that can quickly and easily
> tell you which files and which blocks in which files have changed
> since a given transaction id.  This is how you want to find new
> things.

You're right the bitmap would not scale. So what is missing is a
systemcall to report the changes to userland? (Is this feature used to
generate off-site backups as well?)

> But, the ioctl to actually do the dedup needs to be able to verify a
> given block has the contents you expect it to.  The only place you can
> lock down the pages in the file and prevent new changes is inside the
> kernel.

I totally agree to that. How much time would it consume to implement
such a systemcall?

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote:
> Hello Heinz,
> 
> > I wrote a backup tool which uses dedup, so I know a little bit about
> > the problem and the performance impact if the checksums are not in
> > memory (optionally in that tool).
> > http://savannah.gnu.org/projects/storebackup
> 
> > Dedup really helps a lot - I think more than I could imagine before I
> > was engaged in this kind of backup. You will not beleve how many
> > identical files are in a filesystem to give a simple example.
> 
> I saw it with my own yes (see my previous e-mail).
> 
> > EMC has very big boxes for this with lots of RAM in it.  I think the
> > first problem which has to be solved is the memory problem.  Perhaps
> > something asynchronous to find identical blocks and storing the
> > checksums on disk?
> 
> I think we already have a very nice solution in order to solve that
> issue:
> 
> - Implement a system call that reports all checksums and unique
>   block identifiers for all stored blocks.
> 

This would require storing the larger checksums in the filesystem.  It
is much better done in the dedup program.

> - Implement another system call that reports all checksums and
>   unique identifiers for all stored blocks since the last
>   report. This can be easily implemented:

This is racey because there's no way to prevent new changes.

> 
>   Use a block bitmap for every block on the filesystem use one
>   bit. If the block is modified set the bit to one, when a
>   bitmap is retrieved simply zero it out:

> Assuming a 4 kbyte block size that would mean for a 1 Tbyte
> filesystem:
> 
> 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
> be saved to disk from time to time and be restored on startup).
> 

Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
the whole FS.  Btrfs has metadata that can quickly and easily tell you
which files and which blocks in which files have changed since a given
transaction id.  This is how you want to find new things.

But, the ioctl to actually do the dedup needs to be able to verify a
given block has the contents you expect it to.  The only place you can
lock down the pages in the file and prevent new changes is inside the
kernel.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz,

> I wrote a backup tool which uses dedup, so I know a little bit about
> the problem and the performance impact if the checksums are not in
> memory (optionally in that tool).
> http://savannah.gnu.org/projects/storebackup

> Dedup really helps a lot - I think more than I could imagine before I
> was engaged in this kind of backup. You will not beleve how many
> identical files are in a filesystem to give a simple example.

I saw it with my own yes (see my previous e-mail).

> EMC has very big boxes for this with lots of RAM in it.  I think the
> first problem which has to be solved is the memory problem.  Perhaps
> something asynchronous to find identical blocks and storing the
> checksums on disk?

I think we already have a very nice solution in order to solve that
issue:

- Implement a system call that reports all checksums and unique
  block identifiers for all stored blocks.

- Implement another system call that reports all checksums and
  unique identifiers for all stored blocks since the last
  report. This can be easily implemented:

  Use a block bitmap for every block on the filesystem use one
  bit. If the block is modified set the bit to one, when a
  bitmap is retrieved simply zero it out:

Assuming a 4 kbyte block size that would mean for a 1 Tbyte
filesystem:

1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
be saved to disk from time to time and be restored on startup).

- Write a userland program that identifies duplicated blocks
  (for example by counting the occurance of a checksum using
  tokio cabinet[1] as persistant storage)

- Implement a systemcall that gets hints from userland about
  blocks that might be deduplicated, and dedup them after
  verifying that they match in fact on a byte per byte basis.

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann:
> Hello Heinz,
>
> > It's not only cpu time, it's also memory. You need 32 byte for each 4k
> > block.  It needs to be in RAM for performance reason.
>
> exactly and that is not going to scale.
>
> Thomas


Hi Thomas,

I wrote a backup tool which uses dedup, so I know a little bit about the 
problem and the performance impact if the checksums are not in memory 
(optionally in that tool).
http://savannah.gnu.org/projects/storebackup

Dedup really helps a lot - I think more than I could imagine before I was 
engaged in this kind of backup. You will not beleve how many identical files 
are in a filesystem to give a simple example.

EMC has very big boxes for this with lots of RAM in it.
I think the first problem which has to be solved is the memory problem. 
Perhaps something asynchronous to find identical blocks and storing the 
checksums on disk?

Heinz
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >