Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-22 Thread Gabriel de Perthuis

On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few tweaks
to suggest to the argument checking.


Awesome, thanks for looking this over!


At first the 1M limitation on the length was a bit inconvenient, but making
repeated calls in userspace is okay and allows for better error recovery
(for example, repeating the calls when only a portion of the ranges is
identical).  The destination file appears to be fragmented into 1M extents,
but these are contiguous so it's not really a problem.


Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


A cursor-style API could work here: make the offset and length 
parameters in/out, exit early in case of error or after the read quota 
has been used up.


The caller can retry as long as the length is nonzero (and at least one 
block), and the syscall will return frequently enough that it won't 
block an unmount operation or concurrent access to the ranges.



Requiring the offset or the length to align is spurious however; it doesn't
translate to any alignment in the extent tree (especially with
compression).  Requiring a minimum length of a few blocks or dropping the
alignment condition entirely would make more sense.


I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


I've had a look, and btrfs fiemap only sets FIEMAP_EXTENT_NOT_ALIGNED 
for inline extents, so the alignment requirement makes sense.  The 
caller should do the alignment and decide if it wants to extend a bit 
and accept a not-same status or shrink a bit, so just keep it as is and 
maybe add explanatory comments.



I notice there is a restriction on cross-subvolume deduplication. Hopefully
it can be lifted like it was done in 3.6 for the clone ioctl.


Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


Good to have this code shared (compression support is another win). 
bedup will need feature parity to switch from one ioctl to the other.



Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an
experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in the
bedup readme and some unnecessary subtleties in the code.


Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


See ranges_same and dedup_fileset.  The ImmutableFDs stuff can be 
removed and the fact that dedup can be partially successful over a range 
will ripple through.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be 

Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-21 Thread Mark Fasheh
On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:
 Hi,

 The following series of patches implements in btrfs an ioctl to do
 offline deduplication of file extents.

 I am a fan of this patch, the API is just right.  I just have a few tweaks 
 to suggest to the argument checking.

Awesome, thanks for looking this over!


 At first the 1M limitation on the length was a bit inconvenient, but making 
 repeated calls in userspace is okay and allows for better error recovery 
 (for example, repeating the calls when only a portion of the ranges is 
 identical).  The destination file appears to be fragmented into 1M extents, 
 but these are contiguous so it's not really a problem.

Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


 Requiring the offset or the length to align is spurious however; it doesn't 
 translate to any alignment in the extent tree (especially with 
 compression).  Requiring a minimum length of a few blocks or dropping the 
 alignment condition entirely would make more sense.

I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


 I notice there is a restriction on cross-subvolume deduplication. Hopefully 
 it can be lifted like it was done in 3.6 for the clone ioctl.

Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


 Deduplicating frozen subvolumes works fine, which is great for backups.

 Basic integration with bedup, my offline deduplication tool, is in an 
 experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

 Thanks to this, I look forward to shedding most of the caveats given in the 
 bedup readme and some unnecessary subtleties in the code.

Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


 I've made significant updates and changes from the original. In
 particular the structure passed is more fleshed out, this series has a
 high degree of code sharing between itself and the clone code, and the
 locking has been updated.

 The ioctl accepts a struct:

 struct btrfs_ioctl_same_args {
  __u64 logical_offset;   /* in - start of extent in source */
  __u64 length;   /* in - length of extent */
  __u16 total_files;  /* in - total elements in info array */

 Nit: total_files sounds like it would count the source file.
 dest_count would be better.

 By the way, extent-same might be better named range-same, since there is no 
 need for the input to fall on extent boundaries.

I actually don't like a lot of the naming. Much of it is historic from
Josef's early patch and the rest of the names I came up with earlier in
development while I was still feeling out the problem. I'm defnitely gonna
fix up total_files though - you're absolutely right that it's needlessly
confusing.


Thanks again for looking at this Gabriel. If you don't have any objections
I'll add you as CC to this series when I send new versions to the list from
now on.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-20 Thread Gabriel de Perthuis

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few 
tweaks to suggest to the argument checking.


At first the 1M limitation on the length was a bit inconvenient, but 
making repeated calls in userspace is okay and allows for better error 
recovery (for example, repeating the calls when only a portion of the 
ranges is identical).  The destination file appears to be fragmented 
into 1M extents, but these are contiguous so it's not really a problem.


Requiring the offset or the length to align is spurious however; it 
doesn't translate to any alignment in the extent tree (especially with 
compression).  Requiring a minimum length of a few blocks or dropping 
the alignment condition entirely would make more sense.


I notice there is a restriction on cross-subvolume deduplication. 
Hopefully it can be lifted like it was done in 3.6 for the clone ioctl.


Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an 
experimental branch:


  https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in 
the bedup readme and some unnecessary subtleties in the code.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be better named range-same, since there is 
no need for the input to fall on extent boundaries.



__u16 files_deduped;/* out - number of files that got deduped */
__u32 reserved;
struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
   return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS 1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
__s64 fd;   /* in - destination file */
__u64 logical_offset;   /* in - start of extent in destination */
__u64 bytes_deduped;/* out - total # of bytes we were able
 * to dedupe from this file */
/* status of this dedupe operation:
 * 0 if dedup succeeds
 *  0 for error
 * == BTRFS_SAME_DATA_DIFFERS if data differs
 */
__s32 status;   /* out - see above description */
__u32 reserved;
};


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-16 Thread Mark Fasheh
Hi,
 
The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.

To be clear, offline in this sense means that the file system is
mounted and running, but the dedupe is not done during file writes,
but after the fact when some userspace software initiates a dedupe.

The primary patch is loosely based off of one sent by Josef Bacik back
in January, 2011.

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508

I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.


The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */
__u16 files_deduped;/* out - number of files that got deduped */
__u32 reserved;
struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
  return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS 1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
__s64 fd;   /* in - destination file */
__u64 logical_offset;   /* in - start of extent in destination */
__u64 bytes_deduped;/* out - total # of bytes we were able
 * to dedupe from this file */
/* status of this dedupe operation:
 * 0 if dedup succeeds
 *  0 for error
 * == BTRFS_SAME_DATA_DIFFERS if data differs
 */
__s32 status;   /* out - see above description */
__u32 reserved;
};


The kernel patches are based off Linux v3.8. The ioctl has been tested
in the most basic sense, and should not be trusted to keep your data
safe. There are bugs.

A git tree for the kernel changes can be found at:

https://github.com/markfasheh/btrfs-extent-same


I have a userspace project, duperemove available at:

https://github.com/markfasheh/duperemove

Hopefully this can serve as an example of one possible usage of the ioctl.

duperemove takes a list of files as argument and will search them for
duplicated extents. My most recent changes have been to integrate it
with btrfs_extent_same so that the '-D' switch will have it fire off
dedupe requests once processing of data is complete. Integration with
extent_same has *not* been tested yet so don't expect that to work
flawlessly.

Within the duperemove repo is a file, btrfs-extent-same.c that acts as
a test wrapper around the ioctl. It can be compiled completely
seperately from the rest of the project via make
btrfs-extent-same. This makes direct testing of the ioctl more
convenient.

Code review is very much appreciated. Thanks,
 --Mark
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-16 Thread Marek Otahal
Hi Mark, 

could you compare (appart from online/offline) your implementation to LiuBo's 
work?, appeared on ML a while ago: 
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html

It would be interesting if the two approaches could share some code, and also 
confirmation that using one technique does not 
disregard using the other in future. 

Best wishes, 
Mark

On Tuesday 16 April 2013 15:15:31 Mark Fasheh wrote:
 Hi,
  
 The following series of patches implements in btrfs an ioctl to do
 offline deduplication of file extents.
 
 To be clear, offline in this sense means that the file system is
 mounted and running, but the dedupe is not done during file writes,
 but after the fact when some userspace software initiates a dedupe.
 
 The primary patch is loosely based off of one sent by Josef Bacik back
 in January, 2011.
 
 http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508
 
 I've made significant updates and changes from the original. In
 particular the structure passed is more fleshed out, this series has a
 high degree of code sharing between itself and the clone code, and the
 locking has been updated.
 
...
 
 Code review is very much appreciated. Thanks,
  --Mark
-- 

Marek Otahal :o)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-16 Thread Mark Fasheh
On Wed, Apr 17, 2013 at 12:50:04AM +0200, Marek Otahal wrote: could you
 compare (appart from online/offline) your implementation to LiuBo's work?,
 appeared on ML a while ago:
 http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html

Well that's the primary difference. Liu Bo's patch requires a format change
also since it's done online. My patch requires no format change. So they're
complimentary approaches in my opinion.

There's also the possibility that some other file systems could pick up the
ioctl. Ocfs2 in particular should be able to.


 It would be interesting if the two approaches could share some code, and
 also confirmation that using one technique does not disregard using the
 other in future.

Both features can exist together and probably should, I can see great uses
for both cases.

I haven't looked at the patches but with respect to code sharing I'll take a
look. My patches don't actually add any custom code for the actual let's
de-dupe this extent as I re-use the code from btrfs_ioctl_clone().
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-16 Thread Liu Bo
On Tue, Apr 16, 2013 at 04:17:15PM -0700, Mark Fasheh wrote:
 On Wed, Apr 17, 2013 at 12:50:04AM +0200, Marek Otahal wrote: could you
  compare (appart from online/offline) your implementation to LiuBo's work?,
  appeared on ML a while ago:
  http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html
 
 Well that's the primary difference. Liu Bo's patch requires a format change
 also since it's done online. My patch requires no format change. So they're
 complimentary approaches in my opinion.
 
 There's also the possibility that some other file systems could pick up the
 ioctl. Ocfs2 in particular should be able to.
 
 
  It would be interesting if the two approaches could share some code, and
  also confirmation that using one technique does not disregard using the
  other in future.
 
 Both features can exist together and probably should, I can see great uses
 for both cases.
 
 I haven't looked at the patches but with respect to code sharing I'll take a
 look. My patches don't actually add any custom code for the actual let's
 de-dupe this extent as I re-use the code from btrfs_ioctl_clone().

In online dedup, I just make some changes in write path, as we regard dedup as a
special kind of compression, doing dedup as compression way is the goal.

The difference is where hash database is -- offline dedup puts it in userspace
while online dedup in kernel.

Although there might be no code that can be shared here, I agree that both
online and offline one are useful.

Good job.

thanks,
liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html