Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem

2016-08-05 Thread Gabriel C

On 04.08.2016 18:53, Lutz Vieweg wrote:
> 
> I was today hit by what I think is probably the same bug:
> A btrfs on a close-to-4TB sized block device, only half filled
> to almost exactly 2 TB, suddenly says "no space left on device"
> upon any attempt to write to it. The filesystem was NOT automatically
> switched to read-only by the kernel, I should mention.
> 
> Re-mounting (which is a pain as this filesystem is used for
> $HOMEs of a multitude of active users who I have to kick from
> the server for doing things like re-mounting) removed the symptom
> for now, but from what I can read in linux-btrfs mailing list
> archives, it pretty likely the symptom will re-appear.
> 
> Here are some more details:
> 
> Software versions:
>> linux-4.6.1 (vanilla from kernel.org)
...
> 
> dmesg output from the time the "no space left on device"-symptom
> appeared:
> 
>> [5171203.601620] WARNING: CPU: 4 PID: 23208 at fs/btrfs/inode.c:9261 
>> btrfs_destroy_inode+0x263/0x2a0 [btrfs]


> ...
>> [5171230.306037] WARNING: CPU: 18 PID: 12656 at fs/btrfs/extent-tree.c:4233 
>> btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs]


Sounds like the bug I hit too also ..

To fix this you'll need :


crazy@zwerg:~/Work/linux-git$ git show 8b8b08cbf
commit 8b8b08cbfb9021af4b54b4175fc4c51d655aac8c
Author: Chris Mason 
Date:   Tue Jul 19 05:52:36 2016 -0700

Btrfs: fix delalloc accounting after copy_from_user faults

Commit 56244ef151c3cd11 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.

Signed-off-by: Chris Mason 
cc: sta...@vger.kernel.org # v4.6+

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f3f61d1..bcfb4a2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1629,13 +1629,11 @@ again:
 * managed to copy.
 */
if (num_sectors > dirty_sectors) {
-   /*
-* we round down because we don't want to count
-* any partial blocks actually sent through the
-* IO machines
-*/
-   release_bytes = round_down(release_bytes - copied,
- root->sectorsize);
+
+   /* release everything except the sectors we dirtied */
+   release_bytes -= dirty_sectors <<
+   root->fs_info->sb->s_blocksize_bits;
+
if (copied > 0) {
spin_lock(_I(inode)->lock);
BTRFS_I(inode)->outstanding_extents++;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: systemd KillUserProcesses=yes and btrfs scrub

2016-07-31 Thread Gabriel C


On 30.07.2016 22:02, Chris Murphy wrote:
> Short version: When systemd-logind login.conf KillUserProcesses=yes,
> and the user does "sudo btrfs scrub start" in e.g. GNOME Terminal, and
> then logs out of the shell, the user space operation is killed, and
> btrfs scrub status reports that the scrub was aborted. [1]
> 

How this is a bug ?

Is excatly what 'KillUserProcesses=yes' is extected to do..

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-21 Thread Gabriel C


On 21.07.2016 14:56, Chris Mason wrote:
> On 07/20/2016 01:50 PM, Gabriel C wrote:
>>
>> After 24h of running the program and thundirbird all is still fine here.
>>
>> I let it run one more day.. But looks very good.
>>
> 
> Thanks for your time in helping to track this down.  It'll go into the 
> next merge window and be cc'd to stable.
> 

You are welcome :)

Test program was running without problems for 52h.. I think your fix is fine :)

Also feel free to add Tested-by: Gabriel Craciunescu <nix.or....@gmail.com> to 
you commit.

Regrads,

Gabriel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-20 Thread Gabriel C


On 20.07.2016 15:50, Chris Mason wrote:
> 
> 
> On 07/19/2016 08:11 PM, Gabriel C wrote:
>>
>>
>> On 19.07.2016 13:05, Chris Mason wrote:
>>> On Mon, Jul 11, 2016 at 11:28:01AM +0530, Chandan Rajendra wrote:
>>>> Hi Chris,
>>>>
>>>> I am able to reproduce the issue with the 'short-write' program. But before
>>>> the call trace associated with btrfs_destroy_inode(), I see the following 
>>>> call
>>>> trace ...
>>>>
>>>> [ cut here ]
>>>> WARNING: CPU: 2 PID: 2311 at 
>>>> /home/chandan/repos/linux/fs/btrfs/extent-tree.c:4303 
>>>> btrfs_free_reserved_data_space_noquota+0xe8/0x100
>>>
>>> [ ... ]
>>>
>>> Ok, the problem is in how we're dealing with the offset into the sector when
>>> we fail. The dirty_sectors variable already has this accounted in it, so
>>> this patch fixes it for me.  I ran overnight, but I'll let it go for a few
>>> days just to make sure:
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index fac9b839..5842423 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -1629,13 +1629,11 @@ again:
>>>  * managed to copy.
>>>  */
>>> if (num_sectors > dirty_sectors) {
>>> -   /*
>>> -* we round down because we don't want to count
>>> -* any partial blocks actually sent through the
>>> -* IO machines
>>> -*/
>>> -   release_bytes = round_down(release_bytes - copied,
>>> - root->sectorsize);
>>> +
>>> +   /* release everything except the sectors we dirtied */
>>> +   release_bytes -= dirty_sectors <<
>>> +   root->fs_info->sb->s_blocksize_bits;
>>> +
>>> if (copied > 0) {
>>> spin_lock(_I(inode)->lock);
>>> BTRFS_I(inode)->outstanding_extents++;
>>>
>>
>> Since I guess you are testing this on latest git code I started to test on 
>> latest stable.
> 
> Any v4.7-rc or v4.6 stable where the patch applies ;)
> 
>>
>> Until now all seems file .. your test program is still running without to 
>> trigger the bug.
>>
>> Also thunderbird is running without to trigger the bug.
>>
>> I let it run overnight and report back.
> 
> Great, thanks!

After 24h of running the program and thundirbird all is still fine here.

I let it run one more day.. But looks very good.


Regards,

Gabriel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-19 Thread Gabriel C


On 19.07.2016 13:05, Chris Mason wrote:
> On Mon, Jul 11, 2016 at 11:28:01AM +0530, Chandan Rajendra wrote:
>> Hi Chris,
>>
>> I am able to reproduce the issue with the 'short-write' program. But before
>> the call trace associated with btrfs_destroy_inode(), I see the following 
>> call
>> trace ...
>>
>> [ cut here ]
>> WARNING: CPU: 2 PID: 2311 at 
>> /home/chandan/repos/linux/fs/btrfs/extent-tree.c:4303 
>> btrfs_free_reserved_data_space_noquota+0xe8/0x100
> 
> [ ... ]
> 
> Ok, the problem is in how we're dealing with the offset into the sector when
> we fail. The dirty_sectors variable already has this accounted in it, so
> this patch fixes it for me.  I ran overnight, but I'll let it go for a few
> days just to make sure:
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fac9b839..5842423 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1629,13 +1629,11 @@ again:
>* managed to copy.
>*/
>   if (num_sectors > dirty_sectors) {
> - /*
> -  * we round down because we don't want to count
> -  * any partial blocks actually sent through the
> -  * IO machines
> -  */
> - release_bytes = round_down(release_bytes - copied,
> -   root->sectorsize);
> +
> + /* release everything except the sectors we dirtied */
> + release_bytes -= dirty_sectors <<
> + root->fs_info->sb->s_blocksize_bits;
> +
>   if (copied > 0) {
>   spin_lock(_I(inode)->lock);
>   BTRFS_I(inode)->outstanding_extents++;
> 

Since I guess you are testing this on latest git code I started to test on 
latest stable.

Until now all seems file .. your test program is still running without to 
trigger the bug.

Also thunderbird is running without to trigger the bug.

I let it run overnight and report back.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-08 Thread Gabriel C

On 08.07.2016 14:41, Chris Mason wrote:




On 07/08/2016 05:57 AM, Gabriel C wrote:

2016-07-07 21:21 GMT+02:00 Chris Mason <c...@fb.com>:



On 07/07/2016 06:24 AM, Gabriel C wrote:


Hi,

while running thunderbird on linux 4.6.3 and 4.7.0-rc6 ( didn't tested
other versions )
I trigger the following :



I definitely thought we had this fixed in v4.7-rc.  Can you easily 
fsck this filesystem?  Something strange is going on.


Yes , btrfs check and btrfs check  --check-data-csum are fine , no 
errors found.


If you want me to test any patches let me know.



Can you please try a v4.5 stable kernel?  I'm curious if this really 
is the same regression that I tried to fix in v4.7




I'm on linux 4.5.7 now and everything is fine. I'm writing this email 
from thunderbird.. which was not

possible in 4.6.3 or 4.7.-rc.

Let me know you want me to test other kernels or whatever else may help 
fixing this problem.


Regards,

Gabriel C

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-08 Thread Gabriel C
2016-07-08 14:41 GMT+02:00 Chris Mason <c...@fb.com>:
>
>
> On 07/08/2016 05:57 AM, Gabriel C wrote:
>>
>> 2016-07-07 21:21 GMT+02:00 Chris Mason <c...@fb.com>:
>>>
>>>
>>>
>>> On 07/07/2016 06:24 AM, Gabriel C wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> while running thunderbird on linux 4.6.3 and 4.7.0-rc6 ( didn't tested
>>>> other versions )
>>>> I trigger the following :
>>>
>>>
>>>
>>> I definitely thought we had this fixed in v4.7-rc.  Can you easily fsck
>>> this filesystem?  Something strange is going on.
>>
>>
>> Yes , btrfs check and btrfs check  --check-data-csum are fine , no errors
>> found.
>>
>> If you want me to test any patches let me know.
>>
>
> Can you please try a v4.5 stable kernel?  I'm curious if this really is the
> same regression that I tried to fix in v4.7
>

Sure , I'll test on 4.5.7 and let you know.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A lot warnings in dmesg while running thunderbird

2016-07-08 Thread Gabriel C
2016-07-07 21:21 GMT+02:00 Chris Mason <c...@fb.com>:
>
>
> On 07/07/2016 06:24 AM, Gabriel C wrote:
>>
>> Hi,
>>
>> while running thunderbird on linux 4.6.3 and 4.7.0-rc6 ( didn't tested
>> other versions )
>> I trigger the following :
>
>
> I definitely thought we had this fixed in v4.7-rc.  Can you easily fsck this 
> filesystem?  Something strange is going on.

Yes , btrfs check and btrfs check  --check-data-csum are fine , no errors found.

If you want me to test any patches let me know.


Regards,

Gabriel C
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A lot warnings in dmesg while running thunderbird

2016-07-07 Thread Gabriel C
]  [] ?
block_group_cache_tree_search+0xb1/0xd0 [btrfs]
[ 6509.253610]  [] ? run_delalloc_nocow+0xa60/0xba0 [btrfs]
[ 6509.253627]  [] ? run_delalloc_range+0x390/0x3b0 [btrfs]
[ 6509.253630]  [] ? flush_tlb_page+0x35/0x90
[ 6509.253647]  [] ?
writepage_delalloc.isra.20+0xfb/0x170 [btrfs]
[ 6509.253664]  [] ? __extent_writepage+0xb3/0x300 [btrfs]
[ 6509.253668]  [] ? __set_page_dirty_nobuffers+0xea/0x140
[ 6509.253685]  [] ?
extent_write_cache_pages.isra.16.constprop.31+0x23c/0x350 [btrfs]
[ 6509.253702]  [] ? extent_writepages+0x48/0x60 [btrfs]
[ 6509.253718]  [] ? btrfs_direct_IO+0x360/0x360 [btrfs]
[ 6509.253723]  [] ? __filemap_fdatawrite_range+0xa2/0xe0
[ 6509.253739]  [] ? btrfs_fdatawrite_range+0x16/0x40 [btrfs]
[ 6509.253755]  [] ? start_ordered_ops+0x10/0x20 [btrfs]
[ 6509.253771]  [] ? btrfs_sync_file+0x41/0x360 [btrfs]
[ 6509.253775]  [] ? do_fsync+0x33/0x60
[ 6509.253778]  [] ? SyS_fsync+0x7/0x10
[ 6509.253782]  [] ? entry_SYSCALL_64_fastpath+0x1a/0xa4

...

See http://paste.opensuse.org/view/simple/86078072 and
http://paste.opensuse.org/view/simple/87276071

This is from running thunderbird just few seconds , when I let it run
for a while I have to reboot the system.

$ uname -a
Linux zwerg 4.7.0-rc6 #1 SMP PREEMPT Tue Jul 5 07:48:39 CEST 2016
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.6.1

sda is HW RAID0

...

Jun 23 14:27:48 localhost kernel: scsi host0: Avago SAS based MegaRAID driver
Jun 23 14:27:48 localhost kernel: scsi 0:0:6:0: Direct-Access ATA
WDC WD5002ABYS-5 3B06 PQ: 0 ANSI: 5
Jun 23 14:27:48 localhost kernel: scsi 0:0:7:0: Direct-Access ATA
WDC WD5002ABYS-5 3B06 PQ: 0 ANSI: 5
Jun 23 14:27:48 localhost kernel: scsi 0:0:10:0: Direct-Access ATA
 ST500NM0011  FTM6 PQ: 0 ANSI: 5
Jun 23 14:27:48 localhost kernel: scsi 0:2:0:0: Direct-Access LSI
MegaRAID SAS RMB 1.40 PQ: 0 ANSI: 5

...

mount | grep sda
/dev/sda1 on / type btrfs
(rw,noatime,compress=lzo,space_cache,autodefrag,subvolid=5,subvol=/)

( tested with and without compression , with just defaults the
warnings are still the same )

btrfs fi show
Label: none  uuid: 67b2e285-e331-42ad-8478-d78b17ea6970
   Total devices 1 FS bytes used 31.47GiB
   devid1 size 1.36TiB used 37.06GiB path /dev/sda1


btrfs fi df /
Data, single: total=32.00GiB, used=30.43GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=2.50GiB, used=1.04GiB
GlobalReserve, single: total=368.00MiB, used=0.00B


Regards,

Gabriel C
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [survey] sysfs layout for btrfs

2015-08-18 Thread Gabriel de Perthuis
On Sat, 15 Aug 2015 07:40:40 +0800, Anand Jain wrote:

 Hello,
 
 as of now btrfs sysfs does not include the attributes for the volume 
 manager part in its sysfs layout, so its being developed and there are 
 two types of layout here below, so I have a quick survey to know which 
 will be preferred. contenders are:
 1. FS and VM (volume manager) attributes[1] merged sysfs layout
 
/sys/fs/btrfs/fsid -- holds FS attr, VM attr will be added here.
/sys/fs/btrfs/fsid/devices/uuid [2]  -- btrfs_devices attr here

My vote is for the first one.
Lengthening the UI/API with /pools/ seems unnecessary, and it's
better to get attributes exposed earlier.

 2. FS and VM attributes separated sysfs layout.
 
   /sys/fs/btrfs/fsid --- as is, will continue to hold fs attributes.
   /sys/fs/btrfs/pools/fsid/ -- will hold VM attributes
   /sys/fs/btrfs/pools/fsid/devices/sdx -- btrfs_devices attr here

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem show _exact_ freaking size?

2014-11-18 Thread Gabriel de Perthuis
Le 18/11/2014 11:39, Robert White a écrit :
 Howdy,
 
 How does one get the exact size (in blocks preferably, but bytes okay)
 of the filesystem inside a partition? I know how to get the partition
 size, but that's not useful when shrinking a partition...

dev_item.total_bytes in brtfs-show-super's output is what you're after.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Manual deduplication would be useful

2013-07-23 Thread Gabriel de Perthuis
 Hello,
 
 For over a year now, I've been experimenting with stacked filesystems
 as a way to save on resources.  A basic OS layer is shared among
 Containers, each of which stacks a layer with modifications on top of
 it.  This approach means that Containers share buffer cache and
 loaded executables.  Concrete technology choices aside, the result is
 rock-solid and the efficiency improvements are incredible, as
 documented here:
 
 http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
 
 One problem with this setup is updating software.  In lieu of
 stacking-support in package managers, it is necessary to do this on a
 per-Container basis, meaning that each installs their own versions,
 including overwrites of the basic OS layer.  Deduplication could
 remedy this, but the generic mechanism is known from ZFS to be fairly
 inefficient.
 
 Interestingly however, this particular use case demonstrates that a
 much simpler deduplication mechanism than normally considered could
 be useful.  It would suffice if the filesystem could check on manual
 hints, or stack-specifying hints, to see if overlaid files share the
 same file contents; when they do, deduplication could commence.  This
 saves searching through the entire filesystem for every file or block
 written.  It might also mean that the actual stacking is not needed,
 but instead a basic OS could be cloned to form a new basic install,
 and kept around for this hint processing.
 
 I'm not sure if this should ideally be implemented inside the
 stacking approach (where it would be
 stacking-implementation-specific) or in the filesystem (for which it
 might be too far off the main purpose) but I thought it wouldn't hurt
 to start a discussion on it, given that (1) filesystems nowadays
 service multiple instances, (2) filesystems like Btrfs are based on
 COW, and (3) deduplication is a goal but the generic mechanism could
 use some efficiency improvements.
 
 I hope having seen this approach is useful to you!

Have a look at bedup[1] (disclaimer: I wrote it).  The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
  bedup dedup-files

The implementation in master uses a clone ioctl.  Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).

[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
Now... since the snapshot's FS tree is a direct duplicate of the
 original FS tree (actually, it's the same tree, but they look like
 different things to the outside world), they share everything --
 including things like inode numbers. This is OK within a subvolume,
 because we have the semantics that subvolumes have their own distinct
 inode-number spaces. If we could snapshot arbitrary subsections of the
 FS, we'd end up having to fix up inode numbers to ensure that they
 were unique -- which can't really be an atomic operation (unless you
 want to have the FS locked while the kernel updates the inodes of the
 billion files you just snapshotted).

I don't think so; I just checked some snapshots and the inos are the same.
Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).

The other thing to talk about here is that while the FS tree is a
 tree structure, it's not a direct one-to-one map to the directory tree
 structure. In fact, it looks more like a list of inodes, in inode
 order, with some extra info for easily tracking through the list. The
 B-tree structure of the FS tree is just a fast indexing method. So
 snapshotting a directory entry within the FS tree would require
 (somehow) making an atomic copy, or CoW copy, of only the parts of the
 FS tree that fall under the directory in question -- so you'd end up
 trying to take a sequence of records in the FS tree, of arbitrary size
 (proportional roughly to the number of entries in the directory) and
 copying them to somewhere else in the same tree in such a way that you
 can automatically dereference the copies when you modify them. So,
 ultimately, it boils down to being able to do CoW operations at the
 byte level, which is going to introduce huge quantities of extra
 metadata, and it all starts looking really awkward to implement (plus
 having to deal with the long time taken to copy the directory entries
 for the thing you're snapshotting).

Btrfs already does CoW of arbitrarily-large files (extent lists);
doing the same for directories doesn't seem impossible.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
Le mar. 23 juil. 2013 21:30:13 CEST, Hugo Mills a écrit :
 On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:
Now... since the snapshot's FS tree is a direct duplicate of the
 original FS tree (actually, it's the same tree, but they look like
 different things to the outside world), they share everything --
 including things like inode numbers. This is OK within a subvolume,
 because we have the semantics that subvolumes have their own distinct
 inode-number spaces. If we could snapshot arbitrary subsections of the
 FS, we'd end up having to fix up inode numbers to ensure that they
 were unique -- which can't really be an atomic operation (unless you
 want to have the FS locked while the kernel updates the inodes of the
 billion files you just snapshotted).

 I don't think so; I just checked some snapshots and the inos are the same.
 Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).

That's what I said. Our current implementation allows different
 subvolumes to have the same inode numbers, which is what makes it
 work. If you threw out the concept of subvolumes, or allowed snapshots
 within subvolumes, then you'd be duplicating inodes within a
 subvolume, which is one reason it doesn't work.

Sorry for misreading you.
Directory snapshots can work by giving a new device number to the snapshot.
There is no need to update inode numbers in that case.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lots of harddrive chatter on after booting with btrfs on root (slow boot)

2013-07-20 Thread Gabriel de Perthuis
On Sat, 20 Jul 2013 17:15:50 +0200, Jason Russell wrote:
 Ive also noted that this excessive hdd chatter does not occur
 immediately after a fresh format with arch on btrfs root.
 
 Ive made some deductions/assumptions:
 This only seems to occur with btrfs roots.
 This only happens after some number of reboots OR after the partition
 fills up a little bit.
 Im pretty sure of ruled out everything except for the filesystem.

In my experience (as of 3.8 or so), Btrfs performance degrades on a
filled-up filesystem, even a comparatively new one.  Various
background workers start to eat io according to atop.

 I have just done two clean installs to more thoroughly compare ext4
 and btrfs roots. So far no excessive hdd chatter from btrfs.
 
 I have also seen what I have described on two other computers
 (different hardware entirely) where there is lots of hdd chatter from
 btrfs root, and nothing from ext4.
 
 Here are two threads:
 https://bbs.archlinux.org/viewtopic.php?pid=1117932
 https://bbs.archlinux.org/viewtopic.php?pid=1301684

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-07-16 Thread Gabriel de Perthuis
On Mon, 15 Jul 2013 13:55:51 -0700, Zach Brown wrote:
 I'd get rid of all this code by only copying each input argument on to
 the stack as it's needed and by getting rid of the writable output
 struct fields.  (more on this later)

 As I said, I'd get rid of the output fields.  Like the other vectored io
 syscalls, the return value can indicate the number of initial
 consecutive bytes that worked.  When no progess was made then it can
 return errors.  Userspace is left to sort out the resulting state and
 figure out the extents to retry in exactly the same way that it found
 the initial extents to attempt to dedupe in the first place.
 
 (And imagine strace trying to print the inputs and outputs.  Poor, poor,
 strace! :))

The dedup branch that uses this syscall[1] doesn't compare files
before submitting them anymore (the kernel will do it, and ranges
may not fit in cache once I get rid of an unnecessary loop).

I don't have strong opinions on the return style, but it would be
good to have the syscall always make progress by finding at least
one good range before bailing out, and signaling which files were
involved.  With those constraints, the current struct seems like the
cleanest way to pass the data.  The early return you suggest is a
good idea if Mark agrees, but the return condition should be
something like: if one range with bytes_deduped != 0 doesn't get
bytes_deduped incremented by iteration_len in this iteration, bail
out.
That's sufficient to guarantee progress and to know which ranges
were involved.

 I hope this helps!
 
 - z

Thank you and everyone involved for the progress on this.

[1] https://github.com/g2p/bedup/tree/wip/dedup-syscall


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[XFSTESTS PATCH] btrfs: Test deduplication

2013-06-26 Thread Gabriel de Perthuis
---
The matching kernel patch is here:
https://github.com/g2p/linux/tree/v3.10%2Bextent-same (rebased on 3.10, fixing 
a small conflict)
Requires the btrfs-extent-same command:

- http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26579
- https://github.com/markfasheh/duperemove


 tests/btrfs/313 | 93 +
 tests/btrfs/313.out | 25 ++
 tests/btrfs/group   |  1 +
 3 files changed, 119 insertions(+)
 create mode 100755 tests/btrfs/313
 create mode 100644 tests/btrfs/313.out

diff --git a/tests/btrfs/313 b/tests/btrfs/313
new file mode 100755
index 000..04e4ccb
--- /dev/null
+++ b/tests/btrfs/313
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FS QA Test No. 313
+#
+# Test the deduplication syscall
+#
+#---
+# Copyright (c) 2013 Red Hat, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+ESAME=`set_prog_path btrfs-extent-same`
+
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_command $ESAME
+_require_command $XFS_IO_PROG
+_require_scratch
+
+_scratch_mkfs /dev/null
+_scratch_mount $seqres.full 21
+
+fiemap() {
+xfs_io -r -c fiemap $1 |tail -n+2
+}
+
+dedup() {
+! diff -q (fiemap $1) (fiemap $2)
+$ESAME $(stat -c %s $1) $1 0 $2 0
+diff -u (fiemap $1) (fiemap $2)
+}
+
+echo Silence is golden
+set -e
+
+v1=$SCRATCH_MNT/v1
+v2=$SCRATCH_MNT/v2
+v3=$SCRATCH_MNT/v3
+
+$BTRFS_UTIL_PROG subvolume create $v1
+$BTRFS_UTIL_PROG subvolume create $v2
+
+dd bs=1M status=none if=/dev/urandom of=$v1/file1 count=1
+dd bs=1M status=none if=/dev/urandom of=$v1/file2 count=1
+dd bs=1M status=none if=$v1/file1 of=$v2/file3
+dd bs=1M status=none if=$v1/file1 of=$v2/file4
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $v2 $v3
+
+# identical, multiple volumes
+dedup $v1/file1 $v2/file3
+
+# not identical, same volume
+! $ESAME $((2**20)) $v1/file1 0 $v1/file2 0
+
+# identical, second file on a frozen volume
+dedup $v1/file1 $v3/file4
+
+_scratch_unmount
+_check_scratch_fs
+status=0
+exit
diff --git a/tests/btrfs/313.out b/tests/btrfs/313.out
new file mode 100644
index 000..eabe6be
--- /dev/null
+++ b/tests/btrfs/313.out
@@ -0,0 +1,25 @@
+QA output created by 313
+Silence is golden
+Create subvolume 'sdir/v1'
+Create subvolume 'sdir/v2'
+Create a readonly snapshot of 'sdir/v2' in 'sdir/v3'
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v2/file3
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v1/file2
+1 files asked to be deduped
+i: 0, status: 1, bytes_deduped: 0
+0 total bytes deduped in this operation
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v3/file4
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
diff --git a/tests/btrfs/group b/tests/btrfs/group
index bc6c256..4c868c8 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -7,5 +7,6 @@
 264 auto
 265 auto
 276 auto rw metadata
 284 auto
 307 auto quick
+313 auto
-- 
1.8.3.1.588.gb04834f

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
On Thu, 20 Jun 2013 10:16:22 +0100, Hugo Mills wrote:
 On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote:
 Hi,
 
 I've observed a rather strange behaviour while trying to mount two
 identical copies of the same image to different mount points.
 Each modification to one image is also performed in the second one.

 touch m2/hello
 ls -la m1  //will now also include a file calles hello
 
 Is this behaviour intentional and known or should I create a bug-report?
 
It's known, and not desired behaviour. The problem is that you've
 ended up with two filesystems with the same UUID, and the FS code gets
 rather confused about that. The same problem exists with LVM snapshots
 (or other block-device-layer copies).
 
The solution is a combination of a tool to scan an image and change
 the UUID (offline), and of some code in the kernel that detects when
 it's being told about a duplicate image (rather than an additional
 device in the same FS). Neither of these has been written yet, I'm
 afraid.

To clarify, the loop devices are properly distinct, but the first
device ends up mounted twice.

I've had a look at the vfs code, and it doesn't seem to be uuid-aware,
which makes sense because the uuid is a property of the superblock and
the fs structure doesn't expose it.  It's a Btrfs problem.

Instead of redirecting to a different block device, Btrfs could and
should refuse to mount an already-mounted superblock when the block
device doesn't match, somewhere in or below btrfs_mount.  Registering
extra, distinct superblocks for an already mounted raid is a different
matter, but that isn't done through the mount syscall anyway.

 I've deleted quite a bunch of files on my production system because of 
 this...
 
Oops. I'm sorry to hear that. :(
 
Hugo.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
 Instead of redirecting to a different block device, Btrfs could and
 should refuse to mount an already-mounted superblock when the block
 device doesn't match, somewhere in or below btrfs_mount.  Registering
 extra, distinct superblocks for an already mounted raid is a different
 matter, but that isn't done through the mount syscall anyway.
 
The problem here is that you could quite legitimately mount
 /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with
 UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are
 both part of the same filesystem. So you can't simply prevent mounting
 based on the device that the mount's being done with.

Okay.  The check should rely on a list of known block devices
for a given filesystem uuid.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel
 Thank you for your reply. I appreciate it. Unfortunately this issue is a deal 
 killer for us. The ability to take very fast snapshots and replicate them to 
 another site is key for us. We just can't us Btrfs with this setup. That's 
 too bad. Good luck and thank you.

The issue we were discussing is: how to fail early when there are 
duplicate UUIDs.
Duplicate UUIDs will never be supported.
If *your* problem has to do with fast snapshots and fast replication, 
that's supported, see btrfs send/receive.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 22:31, Mark Fasheh a écrit :
 Perhaps this isn't a limiation per-se but extent-same requires read/write
 access to the files we want to dedupe.  During my last series I had a
 conversation with Gabriel de Perthuis about access checking where we tried
 to maintain the ability for a user to run extent-same against a readonly
 snapshot. In addition, I reasoned that since the underlying data won't
 change (at least to the user) that we ought only require the files to be
 open for read.
 
 What I found however is that neither of these is a great idea ;)
 
 - We want to require that the inode be open for writing so that an
   unprivileged user can't do things like run dedupe on a performance
   sensitive file that they might only have read access to.  In addition I
   could see it as kind of a surprise (non-standard behavior) to an
   administrator that users could alter the layout of files they are only
   allowed to read.
 
 - Readonly snapshots won't let you open for write anyway (unsuprisingly,
   open() returns -EROFS).  So that kind of kills the idea of them being able
   to open those files for write which we want to dedupe.
 
 That said, I still think being able to run this against a set of readonly
 snapshots makes sense especially if those snapshots are taken for backup
 purposes. I'm just not sure how we can sanely enable it.

The check could be: if (fmode_write || cap_sys_admin).

This isn't incompatible with mnt_want_write, that check is at the
level of the superblocks and vfsmount and not the subvolume fsid.

 Code review is very much appreciated. Thanks,
  --Mark
 
 
 ChangeLog
 
 - check that we have appropriate access to each file before deduping. For
   the source, we only check that it is opened for read. Target files have to
   be open for write.
 
 - don't dedupe on readonly submounts (this is to maintain 
 
 - check that we don't dedupe files with different checksumming states
  (compare BTRFS_INODE_NODATASUM flags)
 
 - get and maintain write access to the mount during the extent same
   operation (mount_want_write())
 
 - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
   pass them through for re-use on every call to btrfs_extent_same(). (thanks
   to David Sterba dste...@suse.cz for reporting this
 
 - As the read buffers could possibly be up to 1MB (depending on user
   request), we now conditionally vmalloc them.
 
 - removed redundant check for same inode. btrfs_extent_same() catches it now
   and bubbles the error up.
 
 - remove some unnecessary printks
 
 Changes from RFC to v1:
 
 - don't error on large length value in btrfs exent-same, instead we just
   dedupe the maximum allowed.  That way userspace doesn't have to worry
   about an arbitrary length limit.
 
 - btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
   a total of 16MB per request)
 
 - cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
   David Sterba dste...@suse.cz for reporting this)
 
 - included two fixes from Gabriel de Perthuis g2p.c...@gmail.com:
- allow dedupe across subvolumes
- don't lock compressed pages twice when deduplicating
 
 - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
   This should also give us a bit more reserved bytes.
 
 - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
   David Sterba dste...@suse.cz for reporting this)
 
 - Some more reserved bytes are now included as a result of some of my
   cleanups. Quite possibly we could add a couple more.
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 23:04, Mark Fasheh a écrit :
 On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis wrote:
 What I found however is that neither of these is a great idea ;)

 - We want to require that the inode be open for writing so that an
   unprivileged user can't do things like run dedupe on a performance
   sensitive file that they might only have read access to.  In addition I
   could see it as kind of a surprise (non-standard behavior) to an
   administrator that users could alter the layout of files they are only
   allowed to read.

 - Readonly snapshots won't let you open for write anyway (unsuprisingly,
   open() returns -EROFS).  So that kind of kills the idea of them being able
   to open those files for write which we want to dedupe.

 That said, I still think being able to run this against a set of readonly
 snapshots makes sense especially if those snapshots are taken for backup
 purposes. I'm just not sure how we can sanely enable it.

 The check could be: if (fmode_write || cap_sys_admin).

 This isn't incompatible with mnt_want_write, that check is at the
 level of the superblocks and vfsmount and not the subvolume fsid.
 
 Oh ok that's certainly better. I think we still have a problem though - how
 does a process gets write access to a file from a ro-snapshot? If I open a
 file (as root) on a ro-snapshot on my test machine here I'll get -EROFS.

Your first series did work in that case.
The process does get a read-only fd, but that's no obstacle for the ioctl.

 I'm a bit confused - how does mnt_want_write factor in here? I think that's
 for a totally seperate kind of accounting, right?

It doesn't, it's just that I had spent a few minutes checking anyway.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
 +#define BTRFS_MAX_DEDUPE_LEN   (16 * 1024 * 1024)
 +#define BTRFS_ONE_DEDUPE_LEN   (1 * 1024 * 1024)
 +
 +static long btrfs_ioctl_file_extent_same(struct file *file,
 +void __user *argp)
 +{
 +   struct btrfs_ioctl_same_args *args;
 +   struct btrfs_ioctl_same_args tmp;
 +   struct btrfs_ioctl_same_extent_info *info;
 +   struct inode *src = file-f_dentry-d_inode;
 +   struct file *dst_file = NULL;
 +   struct inode *dst;
 +   u64 off;
 +   u64 len;
 +   int args_size;
 +   int i;
 +   int ret;
 +   u64 bs = BTRFS_I(src)-root-fs_info-sb-s_blocksize;

 The ioctl is available to non-root, so an extra care should be taken to
 potentail overflows etc. I haven't spotted anything so far.
 
 
 Sure. Actually, you got me thinking about some sanity checks... I need to
 add at least this check:
 
   if (btrfs_root_readonly(root))
   return -EROFS;
 
 which isn't in there as of now.

It's not needed and I'd rather do without, read-only snapshots and 
deduplication go together well for backups.
Data and metadata are guaranteed to be immutable, extent storage isn't.  This 
is also the case with raid.


 Also I don't really check the open mode (read, write, etc) on files passed
 in. We do this in the clone ioctl and it makes sense there since data (to
 the user) can change. With this ioctl though data won't ever change (even if
 the underlying extent does). So I left the checks out. A part of me is
 thinking we might want to be conservative to start with though and just add
 those type of checks in. Basically, I figure the source should be open for
 read at least and target files need write access.

I don't know of any privileged files that one would be able to open(2), but if 
this is available to unprivileged users the files all need to be open for 
reading so that it can't be used to guess at their contents.
As long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't 
hurt my use case.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
Le sam. 25 mai 2013 00:38:27 CEST, Mark Fasheh a écrit :
 On Fri, May 24, 2013 at 09:50:14PM +0200, Gabriel de Perthuis wrote:
 Sure. Actually, you got me thinking about some sanity checks... I need to
 add at least this check:

 if (btrfs_root_readonly(root))
 return -EROFS;

 which isn't in there as of now.

 It's not needed and I'd rather do without, read-only snapshots and 
 deduplication go together well for backups.
 Data and metadata are guaranteed to be immutable, extent storage isn't.  
 This is also the case with raid.

 You're absolutely right - I miswrote the check I meant.
 Specifically, I was thinking about when the entire fs is readonly due to
 either some error or the user mounted with -oro. So something more like:

   if (root-fs_info-sb-s_flags  MS_RDONLY)
   return -EROFS;

 I think that should be reasonable and wouldn't affect most use cases,
 right?

That's all right.

 Also I don't really check the open mode (read, write, etc) on files passed
 in. We do this in the clone ioctl and it makes sense there since data (to
 the user) can change. With this ioctl though data won't ever change (even if
 the underlying extent does). So I left the checks out. A part of me is
 thinking we might want to be conservative to start with though and just add
 those type of checks in. Basically, I figure the source should be open for
 read at least and target files need write access.

 I don't know of any privileged files that one would be able to open(2),
 but if this is available to unprivileged users the files all need to be
 open for reading so that it can't be used to guess at their contents. As
 long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't
 hurt my use case.

 Oh ok so this seems to make sense. How does this logic sound:

 We're not going to worry about write access since it would be entirely
 reasonable for the user to want to do this on a readonly submount
 (specifically for the purpose of deduplicating backups).

 Read access needs to be provided however so we know that the user has access
 to the file data.

 So basically, if a user can open any files for read, they can check their
 contents and dedupe them.

 Letting users dedupe files in say, /etc seems kind of weird to me but I'm
 struggling to come up with a good explanation of why that should mean we
 limit this ioctl to root.
   --Mark

I agree with that model.  Most of the code is shared with clone (and the
copy_range RFC) which are unprivileged, so it doesn't increase the potential
surface for bugs much.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
On Fri, 17 May 2013 18:54:38 +0800, Anand Jain wrote:
 The idea was to introduce /dev/mapper to find for btrfs disk, 
 However I found first we need to congregate the disk scan
 procedure at a function so it would help to consistently tune
 it across the btrfs-progs. As of now both fi show and
 dev scan use the disks scan they do it on their own.
 
 So here it would congregate btrfs-disk scans at the function
 scan_devs_for_btrfs, adds /dev/mapper to be used to scan
 for btrfs, and updates its calling functions and few bug fixes.

Just scan /dev/block/*.  That contains all block devices.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
 Just scan /dev/block/*.  That contains all block devices.

Oh, this is about finding nicer names.  Never mind.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvol copying

2013-05-15 Thread Gabriel de Perthuis
 A user of a workstation has a home directory /home/john as a subvolume.  I 
 wrote a cron job to make read-only snapshots of it under /home/john/backup 
 which was fortunate as they just ran a script that did something like
 rm -rf ~.
 
 Apart from copying dozens of gigs of data back, is there a good way of 
 recovering it all?  Whatever you suggest isn't going to work for this time 
 (the copy is almost done) but will be useful for next time.
 
 Should I have put the backups under /backup instead so that I could just 
 delete the corrupted subvol and make a read-write snapshot of the last good 
 one?

You can move subvolumes at any time, as if they were regular directories.

For example: move the backups to an external location, move what's left
of the home to another location out of the way, and make a snapshot to
restore.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-09 Thread Gabriel de Perthuis
 We want this for btrfs_extent_same. Basically readpage and friends do their
 own extent locking but for the purposes of dedupe, we want to have both
 files locked down across a set of readpage operations (so that we can
 compare data). Introduce this variant and a flag which can be set for
 extent_read_full_page() to indicate that we are already locked.
 
 This one can get stuck in TASK_UNINTERRUPTIBLE:
 
 [32129.522257] SysRq : Show Blocked State
 [32129.524337]   taskPC stack   pid father
 [32129.526515] python  D 88021f394280 0 16281  1 
 0x0004
 [32129.528656]  88020e079a48 0082 88013d3cdd40 
 88020e079fd8
 [32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
 88013d3cdd40
 [32129.533044]   1fff 88015286f440 
 0008
 [32129.535285] Call Trace:
 [32129.537522]  [816dcca9] schedule+0x29/0x70
 [32129.539829]  [a02b4908] wait_extent_bit+0xf8/0x150 [btrfs]
 [32129.542130]  [8107ea00] ? finish_wait+0x80/0x80
 [32129.544463]  [a02b4f84] lock_extent_bits+0x44/0xa0 [btrfs]
 [32129.546824]  [a02b4ff3] lock_extent+0x13/0x20 [btrfs]
 [32129.549198]  [a02dc0cf] add_ra_bio_pages.isra.8+0x17f/0x2d0 
 [btrfs]
 [32129.551602]  [a02dccfc] btrfs_submit_compressed_read+0x25c/0x4c0 
 [btrfs]
 [32129.554028]  [a029d131] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
 [32129.556457]  [a02b2d07] submit_one_bio+0x67/0xa0 [btrfs]
 [32129.558899]  [a02b7ecd] extent_read_full_page_nolock+0x4d/0x60 
 [btrfs]
 [32129.561290]  [a02c8052] fill_data+0xb2/0x230 [btrfs]
 [32129.563623]  [a02cd57e] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
 [32129.565924]  [816ddbae] ? _raw_spin_lock+0xe/0x20
 [32129.568207]  [8119b907] ? inode_get_bytes+0x47/0x60
 [32129.570472]  [811a8297] do_vfs_ioctl+0x97/0x560
 [32129.572700]  [8119bb5a] ? sys_newfstat+0x2a/0x40
 [32129.574882]  [811a87f1] sys_ioctl+0x91/0xb0
 [32129.577008]  [816e64dd] system_call_fastpath+0x1a/0x1f

For anyone trying those patches, there's a fix here:
https://github.com/g2p/linux/tree/v3.9%2Bbtrfs-extent-same

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-07 Thread Gabriel de Perthuis
 We want this for btrfs_extent_same. Basically readpage and friends do their
 own extent locking but for the purposes of dedupe, we want to have both
 files locked down across a set of readpage operations (so that we can
 compare data). Introduce this variant and a flag which can be set for
 extent_read_full_page() to indicate that we are already locked.

This one can get stuck in TASK_UNINTERRUPTIBLE:

[32129.522257] SysRq : Show Blocked State
[32129.524337]   taskPC stack   pid father
[32129.526515] python  D 88021f394280 0 16281  1 0x0004
[32129.528656]  88020e079a48 0082 88013d3cdd40 
88020e079fd8
[32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
88013d3cdd40
[32129.533044]   1fff 88015286f440 
0008
[32129.535285] Call Trace:
[32129.537522]  [816dcca9] schedule+0x29/0x70
[32129.539829]  [a02b4908] wait_extent_bit+0xf8/0x150 [btrfs]
[32129.542130]  [8107ea00] ? finish_wait+0x80/0x80
[32129.544463]  [a02b4f84] lock_extent_bits+0x44/0xa0 [btrfs]
[32129.546824]  [a02b4ff3] lock_extent+0x13/0x20 [btrfs]
[32129.549198]  [a02dc0cf] add_ra_bio_pages.isra.8+0x17f/0x2d0 [btrfs]
[32129.551602]  [a02dccfc] btrfs_submit_compressed_read+0x25c/0x4c0 
[btrfs]
[32129.554028]  [a029d131] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
[32129.556457]  [a02b2d07] submit_one_bio+0x67/0xa0 [btrfs]
[32129.558899]  [a02b7ecd] extent_read_full_page_nolock+0x4d/0x60 
[btrfs]
[32129.561290]  [a02c8052] fill_data+0xb2/0x230 [btrfs]
[32129.563623]  [a02cd57e] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
[32129.565924]  [816ddbae] ? _raw_spin_lock+0xe/0x20
[32129.568207]  [8119b907] ? inode_get_bytes+0x47/0x60
[32129.570472]  [811a8297] do_vfs_ioctl+0x97/0x560
[32129.572700]  [8119bb5a] ? sys_newfstat+0x2a/0x40
[32129.574882]  [811a87f1] sys_ioctl+0x91/0xb0
[32129.577008]  [816e64dd] system_call_fastpath+0x1a/0x1f

Side note, I wish btrfs used TASK_KILLABLE[1] instead.

[1]: https://lwn.net/Articles/288056/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
 How will it compare to bcache? I'm currently thinking about buying an SSD 
 but bcache requires some efforts in migrating the storage to use. And after 
 all those hassles I am even not sure if it would work easily with a dracut 
 generated initramfs.

   On the side note: dm-cache, which is already in-kernel, do not need to
 reformat backing storage.

On the other hand dm-cache is somewhat complex to assemble, and letting
the system automount the unsynchronised backing device is a recipe for
data loss.

It will need lvm integration to become really convenient to use.

Anyway, here's a shameless plug for a tool that converts to bcache
in-place:  https://github.com/g2p/blocks#bcache-conversion

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
 Do you plan to support deduplication on a finer grained basis than file 
 level? As an example, in the end it could be interesting to deduplicate 1M 
 blocks of huge files. Backups of VM images come to my mind as a good 
 candidate. While my current backup script[1] takes care of this by using 
 rsync --inplace it won't consider files moved between two backup cycles. 
 This is the main purpose I'm using bedup for on my backup drive.
 
 Maybe you could define another cutoff value to consider huge files for 
 block-level deduplication?

I'm considering deduplicating aligned blocks of large files sharing the
same size (VMs with the same baseline.  Those would ideally come
pre-cowed, but rsync or scp could have broken that).

It sounds simple, and was sort-of prompted by the new syscall taking
short ranges, but it is tricky figuring out a sane heuristic (when to
hash, when to bail, when to submit without comparing, what should be the
source in the last case), and it's not something I have an immediate
need for.  It is also possible to use 9p (with standard cow and/or
small-file dedup) and trade a bit of configuration for much more
space-efficient VMs.

Finer-grained tracking of which ranges have changed, and maybe some
caching of range hashes, would be a good first step before doing any
crazy large-file heuristics.  The hash caching would actually benefit
all use cases.

 Regards,
 Kai
 
 [1]: https://gist.github.com/kakra/5520370


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
On Tue, 07 May 2013 23:58:08 +0200, Kai Krakow wrote:
 Gabriel de Perthuis g2p.c...@gmail.com schrieb:
   On the side note: dm-cache, which is already in-kernel, do not need to
 reformat backing storage.
 
 On the other hand dm-cache is somewhat complex to assemble, and letting
 the system automount the unsynchronised backing device is a recipe for
 data loss.
 
 Yes, that was my first impression, too, after reading of how it works. How 
 safe is bcache on that matter?

The bcache superblock is there just to prevent the naked backing device
from becoming available.  So it's safe in that respect.  LVM has
something similar with hidden volumes.

 Anyway, here's a shameless plug for a tool that converts to bcache
 in-place:  https://github.com/g2p/blocks#bcache-conversion
 
 Did I say: I love your shameless plugs? ;-)
 
 I've read the docs for this tool with interest. Still I do not feel very 
 comfortable with converting my storage for some unknown outcome. Sure, I can 
 take backups (and by any means: I will). But it takes time: backup, try, 
 restore, try again, maybe restore... I don't want to find out that it was 
 all useless because it's just not ready to boot a multi-device btrfs through 
 dracut. So you see, the point is: Will that work? I didn't see any docs 
 answering my questions.

Try it with a throwaway filesystem inside a VM.  The bcache list will
appreciate the feedback on Dracut, even if you don't make the switch
for real.

 Of course, if it would work I'd happily contribute documentation to your 
 project.

That would be very welcome.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to deduplicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
 Gabriel de Perthuis g2p.c...@gmail.com schrieb:
 It sounds simple, and was sort-of prompted by the new syscall taking
 short ranges, but it is tricky figuring out a sane heuristic (when to
 hash, when to bail, when to submit without comparing, what should be the
 source in the last case), and it's not something I have an immediate
 need for.  It is also possible to use 9p (with standard cow and/or
 small-file dedup) and trade a bit of configuration for much more
 space-efficient VMs.
 
 Finer-grained tracking of which ranges have changed, and maybe some
 caching of range hashes, would be a good first step before doing any
 crazy large-file heuristics.  The hash caching would actually benefit
 all use cases.
 
 Looking back to good old peer-2-peer days (I think we all got in touch with 
 that the one or the other way), one title pops back into my mind: tiger-
 tree-hash...
 
 I'm not really into it, but would it be possible to use tiger-tree-hashes to 
 find identical blocks? Even accross different sized files...

Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 95d46cc..b3f0276 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1797,23 +1797,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i  nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset 
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key-objectid;
sh.offset = key-offset;
sh.type = key-type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6



[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Cc: sta...@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
(resent, with the correct header to have stable copied)

 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c02310..f49b62f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1794,23 +1794,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i  nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset 
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key-objectid;
sh.offset = key-offset;
sh.type = key-type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-05 Thread Gabriel de Perthuis
On Sun, 05 May 2013 12:07:17 +0200, Kai Krakow wrote:
 Hey list,
 
 I wonder if it is possible to deduplicate read-only snapshots.
 
 Background:
 
 I'm using an bash/rsync script[1] to backup my whole system on a nightly 
 basis to an attached USB3 drive into a scratch area, then take a snapshot of 
 this area. I'd like to have these snapshots immutable, so they should be 
 read-only.
 
 Since rsync won't discover moved files but instead place a new copy of that 
 in the backup, I'm running the wonderful bedup application[2] to deduplicate 
 my backup drive from time to time and it almost always gains back a good 
 pile of gigabytes. The rest of storage space issues is taken care of by 
 using rsync's inplace option (although this won't cover the case of files 
 moved and changed between backup runs) and using compress-force=gzip.

 I've read about ongoing work to integrate offline (and even online) 
 deduplication into the kernel so that this process can be made atomic (and 
 even block-based instead of file-based). This would - to my understandings - 
 result in the immutable attribute no longer needed. So, given the fact above 
 and for the case read-only snapshots cannot be used for this application 
 currently, will these patches address the problem and read-only snapshots 
 could be deduplicated? Or are read-only snapshots meant to be what the name 
 suggests: Immutable, even for deduplication?

There's no deep reason read-only snapshots should keep their storage
immutable, they can be affected by raid rebalancing for example.

The current bedup restriction comes from the clone call; Mark Fasheh's
dedup ioctl[3] appears to be fine with snapshots.  The bedup integration
(in a branch) is a work in progress at the moment.  I need to fix a scan
bug, tweak parameters for the latest kernel dedup patch, remove a lot of
logic that is now unnecessary, and figure out the compatibility story.

 Regards,
 Kai
 
 [1]: https://gist.github.com/kakra/5520370
 [2]: https://github.com/g2p/bedup

[3]: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25062


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice - Partition, or not?

2013-05-01 Thread Gabriel de Perthuis
 Hello
 
 If I want to manage a complete disk with btrfs, what's the Best
 Practice? Would it be best to create the btrfs filesystem on
 /dev/sdb, or would it be better to create just one partition from
 start to end and then do mkfs.btrfs /dev/sdb1?

Partitions (GPT) are always more flexible and future-proof.  If you ever
need to shrink the btrfs filesystem and give the space to another
partition, or do a conversion to lvm/bcache/luks (shameless plug:
https://github.com/g2p/blocks ), it'd be stupid to be locked into your
current setup for want of a few megabytes of space before your
filesystem.

 Would the same recomendation hold true, if we're talking about huge
 disks, like 4TB or so?

More so, since it can be infeasible to move this much data.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Gabriel de Perthuis
  #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
   struct btrfs_ioctl_dev_replace_args)
 +#define BTRFS_IOC_DEDUP_REGISTER _IO(BTRFS_IOCTL_MAGIC, 54)

This number has already been used by the offline dedup patches.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-22 Thread Gabriel de Perthuis

On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few tweaks
to suggest to the argument checking.


Awesome, thanks for looking this over!


At first the 1M limitation on the length was a bit inconvenient, but making
repeated calls in userspace is okay and allows for better error recovery
(for example, repeating the calls when only a portion of the ranges is
identical).  The destination file appears to be fragmented into 1M extents,
but these are contiguous so it's not really a problem.


Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


A cursor-style API could work here: make the offset and length 
parameters in/out, exit early in case of error or after the read quota 
has been used up.


The caller can retry as long as the length is nonzero (and at least one 
block), and the syscall will return frequently enough that it won't 
block an unmount operation or concurrent access to the ranges.



Requiring the offset or the length to align is spurious however; it doesn't
translate to any alignment in the extent tree (especially with
compression).  Requiring a minimum length of a few blocks or dropping the
alignment condition entirely would make more sense.


I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


I've had a look, and btrfs fiemap only sets FIEMAP_EXTENT_NOT_ALIGNED 
for inline extents, so the alignment requirement makes sense.  The 
caller should do the alignment and decide if it wants to extend a bit 
and accept a not-same status or shrink a bit, so just keep it as is and 
maybe add explanatory comments.



I notice there is a restriction on cross-subvolume deduplication. Hopefully
it can be lifted like it was done in 3.6 for the clone ioctl.


Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


Good to have this code shared (compression support is another win). 
bedup will need feature parity to switch from one ioctl to the other.



Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an
experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in the
bedup readme and some unnecessary subtleties in the code.


Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


See ranges_same and dedup_fileset.  The ImmutableFDs stuff can be 
removed and the fact that dedup can be partially successful over a range 
will ripple through.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might

Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-20 Thread Gabriel de Perthuis

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few 
tweaks to suggest to the argument checking.


At first the 1M limitation on the length was a bit inconvenient, but 
making repeated calls in userspace is okay and allows for better error 
recovery (for example, repeating the calls when only a portion of the 
ranges is identical).  The destination file appears to be fragmented 
into 1M extents, but these are contiguous so it's not really a problem.


Requiring the offset or the length to align is spurious however; it 
doesn't translate to any alignment in the extent tree (especially with 
compression).  Requiring a minimum length of a few blocks or dropping 
the alignment condition entirely would make more sense.


I notice there is a restriction on cross-subvolume deduplication. 
Hopefully it can be lifted like it was done in 3.6 for the clone ioctl.


Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an 
experimental branch:


  https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in 
the bedup readme and some unnecessary subtleties in the code.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be better named range-same, since there is 
no need for the input to fall on extent boundaries.



__u16 files_deduped;/* out - number of files that got deduped */
__u32 reserved;
struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
   return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS 1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
__s64 fd;   /* in - destination file */
__u64 logical_offset;   /* in - start of extent in destination */
__u64 bytes_deduped;/* out - total # of bytes we were able
 * to dedupe from this file */
/* status of this dedupe operation:
 * 0 if dedup succeeds
 *  0 for error
 * == BTRFS_SAME_DATA_DIFFERS if data differs
 */
__s32 status;   /* out - see above description */
__u32 reserved;
};


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with building instructions for btrfs-tools in https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories

2013-03-20 Thread Gabriel de Perthuis
 There is a missing dependency: liblzo2-dev

 I suggest to make amendment to the wiki and add a liblzo2-dev to the 
 apt-get line for Ubuntu/Debian.

Added. Other distros may need some additions too.

Anyone can edit the wiki, as the spambots will attest; a ConfirmEdit 
captcha at signup would be nice.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
Hello,
I have a filesystem that has become unusable because of a balance I can't 
stop. It is very close to full, and the balance is preventing me from 
growing it.

It was started like this:
sudo btrfs filesystem balance start -v -musage=60 -dusage=60 /srv/backups

It has been stuck at 0% across reboots and kernel upgrades (currently on 
3.8.1), and cancelling it had no effect:

Balance on '/srv/backups' is running
0 out of about 5 chunks balanced (95 considered), 100% left

According to atop it is writing but not reading anything.
Unmounting never terminates, so does remounting ro, the only way to 
temporarilly kill it is to reboot. SIGKILL has no effect either. Is there 
*any* way I can get rid of it?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
On Sat, 02 Mar 2013 17:12:37 +0600, Roman Mamedov wrote:
 Mount with the skip_balance option
 https://btrfs.wiki.kernel.org/index.php/Mount_options then you can issue
 btrfs fi balance cancel and it will succeed.

Excellent, thank you.
I had just thought of doing the same thing with ro and it worked.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot

2013-01-31 Thread Gabriel
Hi,

 After mounting the system with noatime the problem disappeared, like in 
 magic.

Incidentally, the current version of bedup uses a private mountpoint with 
noatime whenever you don't give it the path to a mounted volume.  You can 
use it with no arguments or designate a filesystem by its uuid or /dev 
path.

 All the writes must have came from the dealyed metadata copy process. 
 Once all the metadata copy-update was done, file system speed was back 
 to normal, but once the new day broke out, all the copying business 
 needed to done again... This in 100% describes all the odd behavior.
 
 In particular apparently the problem had nothing to do with my complex 
 block device setup, nor with bedup, nor with unison.
 
 Thank you again, Andrew!
 
 P.S. Maybe it is not be decided by me, but this small message about 
 performance (not even labeled as warning) in 
 https://btrfs.wiki.kernel.org/index.php/Mount_options IMHO should have 
 been made more conspicuous, maybe put somewhere when the snapshot 
 mechanism is described or in FAQ. I'll try to fix it.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Corruption at start of files

2012-11-05 Thread Gabriel
Here is what I see in my kern.log (see below).

For me this first happened when the filesystem was close to full (less
than 1GB left), but someone on the irc channel mentioned a similar
problem on suspend to ram.

The files that have checksum failures end up with their first 4k filled
with 0x01 bytes. They were seeing a lot of writes; things like firefox
session data and cookie data, plus files that disappeared before I
could call inode-resolve on them.

I was running 3.6.3 when this happened; I've upgraded to -rcs since
but I haven't tried to reproduce the bug deliberately. I didn't see
relevant changes in the changelog.

Oct 31 17:06:31 moulinex kernel: [93539.008465] BTRFS warning (device dm-16): 
Aborting unused transaction.
Oct 31 17:06:31 moulinex kernel: [93539.011257] BTRFS warning (device dm-16): 
Aborting unused transaction.
Oct 31 17:06:31 moulinex kernel: [93539.017137] BTRFS warning (device dm-16): 
Aborting unused transaction.
Oct 31 17:06:46 moulinex kernel: [93554.728793] use_block_rsv: 16 callbacks 
suppressed
Oct 31 17:06:46 moulinex kernel: [93554.728795] btrfs: block rsv returned -28
Oct 31 17:06:46 moulinex kernel: [93554.728796] [ cut here 
]
Oct 31 17:06:46 moulinex kernel: [93554.728818] WARNING: at 
/home/apw/COD/linux/fs/btrfs/extent-tree.c:6323 use_block_rsv+0x19f/0x1b0 
[btrfs]()
Oct 31 17:06:46 moulinex kernel: [93554.728819] Hardware name: System Product 
Name
Oct 31 17:06:46 moulinex kernel: [93554.728820] Modules linked in: 
snd_seq_dummy vhost_net macvtap macvlan xt_recent bnep rfcomm bluetooth 
snd_hrtimer nls_utf8 sch_fq_codel ebtable_nat ebtables xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE iptable_nat bridge stp llc ppdev lp parport 
deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common 
camellia_generic camellia_x86_64 serpent_sse2_x86_64 glue_helper lrw 
serpent_generic xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common 
cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo 
binfmt_misc dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek eeepc_wmi 
asus_wmi sparse_keymap coretemp kvm_intel kvm dm_multipath scsi_dh microcode 
arc4 joydev snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rt61pci rt2x00pci 
rt2x00lib snd_seq_midi snd_rawmidi mac80211 snd_seq_midi_event snd_seq 
snd_timer snd_seq_device snd cfg80211 soundcore snd_page_alloc eeprom_93cx6 
serio_raw lpc_ich mei mac_hid k8temp hw
 mon_vid i2c_nforce2 firewire_sbp2 firew
Oct 31 17:06:46 moulinex kernel: ire_core crc_itu_t psmouse ip6t_REJECT xt_hl 
ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_multiport xt_limit 
xt_tcpudp xt_addrtype xt_state ip6table_filter ip6_tables 
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter 
ip_tables x_tables btrfs zlib_deflate libcrc32c raid10 raid0 multipath linear 
raid456 async_pq async_xor xor async_memcpy async_raid6_recov hid_generic 
raid6_pq async_tx hid_cherry usbhid hid raid1 ghash_clmulni_intel sata_via wmi 
aesni_intel ablk_helper cryptd aes_x86_64 r8169 i915 drm_kms_helper drm 
i2c_algo_bit video [last unloaded: ipmi_msghandler]
Oct 31 17:06:46 moulinex kernel: [93554.728873] Pid: 2230, comm: 
btrfs-endio-wri Tainted: GW3.6.3-030603-generic #201210211349
Oct 31 17:06:46 moulinex kernel: [93554.728874] Call Trace:
Oct 31 17:06:46 moulinex kernel: [93554.728880]  [81056f6f] 
warn_slowpath_common+0x7f/0xc0
Oct 31 17:06:46 moulinex kernel: [93554.728882]  [81056fca] 
warn_slowpath_null+0x1a/0x20
Oct 31 17:06:46 moulinex kernel: [93554.728889]  [a01feedf] 
use_block_rsv+0x19f/0x1b0 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728897]  [a020260d] 
btrfs_alloc_free_block+0x3d/0x220 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728904]  [a01ef38d] ? 
balance_level+0xcd/0x890 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728906]  [81332e10] ? 
rb_insert_color+0x110/0x150
Oct 31 17:06:46 moulinex kernel: [93554.728916]  [a022f16c] ? 
read_extent_buffer+0xbc/0x120 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728918]  [81178ebd] ? 
kmem_cache_alloc_trace+0x12d/0x150
Oct 31 17:06:46 moulinex kernel: [93554.728925]  [a01ee3b2] 
__btrfs_cow_block+0x122/0x4f0 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728927]  [81136892] ? 
set_page_dirty+0x62/0x70
Oct 31 17:06:46 moulinex kernel: [93554.728930]  [8169f37e] ? 
_raw_spin_lock+0xe/0x20
Oct 31 17:06:46 moulinex kernel: [93554.728936]  [a01ee87c] 
btrfs_cow_block+0xfc/0x220 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728943]  [a01f29f8] 
btrfs_search_slot+0x368/0x740 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728951]  [a0206e84] 
btrfs_lookup_csum+0x74/0x190 [btrfs]
Oct 31 17:06:46 moulinex kernel: [93554.728953]  [81179cfc] ? 
kmem_cache_alloc+0x11c/0x150
Oct 31 17:06:46 moulinex kernel: [93554.728960]  

Re: [PATCH][BTRFS-PROGS] Enhance btrfs fi df

2012-11-02 Thread Gabriel
On Fri, 02 Nov 2012 13:02:32 +0100, Goffredo Baroncelli wrote:
 On 2012-11-02 12:18, Martin Steigerwald wrote:
 Metadata, DUP is displayed as 3,50GB on the device level and as 1,75GB
 in total. I understand the logic behind this, but this could be a bit
 confusing.
 
 But it makes sense: Showing real allocation on device level makes
 sense,
 cause thats what really allocated on disk. Total makes some sense,
 cause thats what is being used from the tree by BTRFS.
 
 Yes, me too. At the first I was confused when you noticed this
 discrepancy. So I have to admit that it is not so obvious to understand.
 However we didn't find any way to make it more clear...
 
 It still looks confusing at first…
 We could use Chunk(s) capacity instead of total/size ? I would like an
 opinion from a english people point of view..

This is easy to fix, here's a mockup:

Metadata,DUP: Size: 1.75GB ×2, Used: 627.84MB ×2
   /dev/dm-03.50GB

   Data   Metadata MetadataSystem System  
   Single Single   DUP Single DUP Unallocated
   
/dev/dm-16 1.31TB   8.00MB  56.00GB4.00MB  16.00MB   0.00
   ==  === == === ===
Total  1.31TB   8.00MB  28.00GB ×2 4.00MB   8.00MB ×20.00
Used   1.31TB 0.00   5.65GB ×2   0.00 152.00KB ×2

Also, I don't know if you could use libblkid, but it finds more 
descriptive names than dm-NN (thanks to some smart sorting logic).


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][BTRFS-PROGS] Enhance btrfs fi df

2012-11-02 Thread Gabriel
On Fri, 02 Nov 2012 20:31:56 +0100, Goffredo Baroncelli wrote:

 On 11/02/2012 08:05 PM, Gabriel wrote:
 On Fri, 02 Nov 2012 13:02:32 +0100, Goffredo Baroncelli wrote:
 On 2012-11-02 12:18, Martin Steigerwald wrote:
 Metadata, DUP is displayed as 3,50GB on the device level and as
 1,75GB in total. I understand the logic behind this, but this could
 be a bit confusing.

 But it makes sense: Showing real allocation on device level makes
 sense,
 cause thats what really allocated on disk. Total makes some sense,
 cause thats what is being used from the tree by BTRFS.

 Yes, me too. At the first I was confused when you noticed this
 discrepancy. So I have to admit that it is not so obvious to
 understand.
 However we didn't find any way to make it more clear...

 It still looks confusing at first…
 We could use Chunk(s) capacity instead of total/size ? I would like
 an opinion from a english people point of view..
 
 This is easy to fix, here's a mockup:
 
 Metadata,DUP: Size: 1.75GB ×2, Used: 627.84MB ×2
/dev/dm-03.50GB
 
Data   Metadata MetadataSystem System Single Single  
DUP Single DUP Unallocated

 /dev/dm-16 1.31TB   8.00MB  56.00GB4.00MB  16.00MB   0.00
==  === == === ===
 Total  1.31TB   8.00MB  28.00GB ×2 4.00MB   8.00MB ×20.00
 Used   1.31TB 0.00   5.65GB ×2   0.00 152.00KB ×2
 
 Nice idea. Even tough I like the opposite:
 
 
Data   Metadata MetadataSystem System Single Single   DUP
Single DUP Unallocated
 
 /dev/dm-16 1.31TB   8.00MB  28.00GB x2 4.00MB   8.00MB x20.00
==  === == === ===
 Total  1.31TB   8.00MB  28.00GB4.00MB   8.00MB   0.00
 Used   1.31TB 0.00   5.65GB  0.00 152.00KB
 
 
 However how your solution will became when RAID5/RAID6 will arrive ? mmm
 may be the solution is simpler: the x2 factor is applied only to DUP
 profile. The other profiles span different disks.

That problem solved itself :)

 As another option, we can add a field/line which reports the RAID
 factor:
 
 Metadata,DUP: Size: 1.75GB, Used: 627.84MB, Raid factor: 2x
/dev/dm-03.50GB
 
 
 Data   Metadata Metadata   System System Single Single   DUP
Single DUPUnallocated
 
 /dev/dm-16  1.31TB   8.00MB  56.00GB 4.00MB  16.00MB0.00
 ==   ==  ===
 Raid factor  --   x2  -   x2   -
 Total   1.31TB   8.00MB  28.00GB 4.00MB   8.00MB0.00 Used   
 1.31TB 0.00   5.65GB   0.00 152.00KB

All fine options. Though if you remove the ×2 on the totals line,
you should compute it instead (it looks like a tally, both sides
of the == line should be equal).

Now that I've started bikeshedding, here is something that I would
find pretty much ideal:

DataMetadata   System Unallocated   
   

VolGroup/Btrfs
  Reserved   1.31TB 8.00MB + 2×28.00MB 16.00MB + 2×4.00MB   -
  Used   1.31TB  2× 5.65GB 2×152.00KB   -
=== == == ===
Total
  Reserved   1.31TB56.00GB24.00MB   -
  Used   1.31TB11.30GB   304.00KB   -
  Free  12.34GB44.70GB23.70MB   -



 Also, I don't know if you could use libblkid, but it finds more
 descriptive names than dm-NN (thanks to some smart sorting logic).
 
 I don't think that it would be impossible to use libblkid, however
 it would be difficult to find spaces for longer device name

I suggest cutting out the /dev and putting a line break after the
name. The extra info makes it more human-friendly, and the line
break may complicate machine parsing but the non-tabular format is
better at that anyway.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][BTRFS-PROGS] Enhance btrfs fi df

2012-11-02 Thread Gabriel
On Fri, 02 Nov 2012 22:06:04 +, Hugo Mills wrote:

 On Fri, Nov 02, 2012 at 07:05:37PM +, Gabriel wrote:
 On Fri, 02 Nov 2012 13:02:32 +0100, Goffredo Baroncelli wrote:
  On 2012-11-02 12:18, Martin Steigerwald wrote:
  Metadata, DUP is displayed as 3,50GB on the device level and as 1,75GB
  in total. I understand the logic behind this, but this could be a bit
  confusing.
  
  But it makes sense: Showing real allocation on device level makes
  sense,
  cause thats what really allocated on disk. Total makes some sense,
  cause thats what is being used from the tree by BTRFS.
  
  Yes, me too. At the first I was confused when you noticed this
  discrepancy. So I have to admit that it is not so obvious to understand.
  However we didn't find any way to make it more clear...
  
  It still looks confusing at first…
  We could use Chunk(s) capacity instead of total/size ? I would like an
  opinion from a english people point of view..
 
 This is easy to fix, here's a mockup:
 
 Metadata,DUP: Size: 1.75GB ×2, Used: 627.84MB ×2
/dev/dm-03.50GB
 
I've not considered the full semantics of all this yet -- I'll try
 to do that tomorrow. However, I note that the ×2 here could become
 non-integer with the RAID-5/6 code (which is due Real Soon Now). In
 the first RAID-5/6 code drop, it won't even be simple to calculate
 where there are different-sized devices in the filesystem. Putting an
 exact figure on that number is potentially going to be awkward. I
 think we're going to need kernel help for working out what that number
 should be, in the general case.

DUP can be nested below a device because it represents same-device
redundancy (purpose: survive smudges but not device failure).

On the other hand raid levels should occupy the same space on all
linked devices (a necessary consequence of the guarantee that RAID5
can survive the loss of any device and RAID6 any two devices).

The two probably won't need to be represented at the same time
except during a reshape, because I imagine DUP gets converted to
RAID (1 or 5) as soon as the second device is added.

A 1→2 reshape would look a bit like this (doing only the data column
and skipping totals):

InitialDevice
  Reserved   1.21TB
  Used   1.21TB
RAID1(InitialDevice, SecondDevice)
  Reserved   1.31TB + 100GB
  Used 2× 100GB

RAID5, RAID6: same with fractions, n+1⁄n and n+2⁄n.

Again, I'm raising minor points based on future capabilities, but I
 feel it's worth considering them at this stage, even if the correct
 answer is yes, we'll do this now, and deal with any other problems
 later.
 
Hugo.
 
Data   Metadata MetadataSystem System  
Single Single   DUP Single DUP Unallocated

 /dev/dm-16 1.31TB   8.00MB  56.00GB4.00MB  16.00MB   0.00
==  === == === ===
 Total  1.31TB   8.00MB  28.00GB ×2 4.00MB   8.00MB ×20.00
 Used   1.31TB 0.00   5.65GB ×2   0.00 152.00KB ×2
 
 Also, I don't know if you could use libblkid, but it finds more 
 descriptive names than dm-NN (thanks to some smart sorting logic).
 



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][BTRFS-PROGS] Enhance btrfs fi df

2012-11-02 Thread Gabriel
On Fri, 02 Nov 2012 21:46:35 +, Michael Kjörling wrote:
 On 2 Nov 2012 20:40 +, from g2p.c...@gmail.com (Gabriel):
 Now that I've started bikeshedding, here is something that I would
 find pretty much ideal:
 
 DataMetadata   System Unallocated
   
 
 VolGroup/Btrfs
   Reserved   1.31TB 8.00MB + 2×28.00GB 16.00MB + 2×4.00MB   -
   Used   1.31TB  2× 5.65GB 2×152.00KB   
 === == == ===
 Total
   Reserved   1.31TB56.00GB24.00MB   -
   Used   1.31TB11.30GB   304.00KB   
   Free  12.34GB44.70GB23.70MB   -
 
 If we can take such liberties, then why bother with the 2× at all?

It does save a line.

 Also, I think the B can go, since it's implied by talking about
 storage capacities. A lot of tools do this already; look at GNU df -h
 and ls -lh for just two examples. That gives you a few extra columns
 which can be used to make the table column spacing a little bigger even
 in an 80-column terminal.

Good idea.

 I'm _guessing_ that you meant for metadata reserved to be 2 × 28 GB and
 not 2 × 28 MB, because otherwise the numbers really don't add up.

Feh, that's just a typo from when I swapped the 8.00M to the left.

   DataMetadataSystemUnallocated
 
 VolGroup/Btrfs
   Reserved  1.31T  8.00M + 28.00G  16.00M +   4.00M-
ResRedun -  28.00G 4.00M-
   Used  1.31T   5.65G   152.00K-
UseRedun -   5.65G   152.00K-
   ===  ==    ===
 Total
   Reserved  1.31T  56.01G24.00M-
   Used  1.31T  11.30G   304.00K-
   Free 12.34G  44.71G23.70M-
 
 This way, the numbers should add up nicely. (Redun for redundancy or
 something like that.) 8M + 28G + 28G = 56.01G, 5.65G + 5.65G = 11.30G,
 56.01G - 11.30G = 44.71G. I'm not sure you couldn't even work 8.00M +
 28.00G into a single 28.01G entry at Reserved/Metadata, with
 ResRedun/Metadata 28.00G. That would require some care when the units
 are different enough that the difference doesn't show up in the numbers,
 though, since then there is nothing to indicate that parts of the
 metadata is not stored in a redundant fashion.

I tried to work out DUP vs RAID redundancy in my message to Hugo.

 If some redundancy scheme (RAID 5?) uses an oddball factor, that can
 still easily be expressed in a view like the above simply by displaying
 the user data and redundancy data separately, in exactly the same way.
 
 And personally, I feel that a summary view like this, for Data, if an
 exact number cannot be calculated, should display the _minimum amount of
 available free space_, with free space being _usable by user files_.
 If I start copying a 12.0GB file onto the file system exemplified above,
 I most assuredly _don't_ want to get a report of device full after 10
 GB! (You mating female dog, you told me I had 12.3 GB free, wrote 10 GB
 and now you're saying there's NO free space?! To hell with this, I'm
 switching to Windows!) That also saves this tool from having to take
 into account possible compression ratios for when file system level
 compression is enabled, savings from possible deduplication of data, etc
 etc. Of course it also means that the amount of free space may shrink by
 less than the size of the added data, but hey, that's a nice bonus if
 your disk grows bigger as you add more data to it. :-)

I think we can guarantee minimum amounts of free space, as long as
data/metadata/system are segregated properly?
OK, reshapes complicate this. For those we could to take the worst
case between now and the completed reshape.
Or maybe add a second tally:

devices
===
total
 reserved
 used
 free
===
anticipated (reshaped 8% eta 3:12)
 reserved
 used
 free

 I suggest cutting out the /dev and putting a line break after the
 name. The extra info makes it more human-friendly, and the line
 break may complicate machine parsing but the non-tabular format is
 better at that anyway.
 
 That might work well for anything under /dev, but what about things that
 aren't?

Absolute path for those, assuming it ever happens.

 And I stand by my earlier position that the tabular data
 shouldn't be machine-parsed anyway. As you say, the non-tabular format
 is better for that.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: find-new possibility of showing modified and deleted files/directories

2012-11-01 Thread Gabriel
On Thu, 01 Nov 2012 06:06:57 +0100, Arne Jansen wrote:
 On 11/01/2012 02:28 AM, Shane Spencer wrote:
 That's Plan B.  I'll be making a btrfs stream decoder and doing in
 place edits.  I need to move stuff around to other filesystem types
 otherwise I'd just store the stream or apply the stream to a remote
 snapshot.

 That's the whole point of the btrfs-send design: It's very easy to
 receive on different filesystems. A generic receiver is in preparation.
 And to make it even more generic: A sender using the same stream format
 is also in preparation for zfs.

Consider the rsync bundle format as well.
That should provide interoperability with any filesystem.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: find-new possibility of showing modified and deleted files/directories

2012-11-01 Thread Gabriel
On Thu, 01 Nov 2012 12:29:36 +0100, Arne Jansen wrote:
 On 01.11.2012 12:00, Gabriel wrote:
 On Thu, 01 Nov 2012 06:06:57 +0100, Arne Jansen wrote:
 On 11/01/2012 02:28 AM, Shane Spencer wrote:
 That's Plan B.  I'll be making a btrfs stream decoder and doing in
 place edits.  I need to move stuff around to other filesystem types
 otherwise I'd just store the stream or apply the stream to a remote
 snapshot.
 
 That's the whole point of the btrfs-send design: It's very easy to
 receive on different filesystems. A generic receiver is in
 preparation.
 And to make it even more generic: A sender using the same stream
 format is also in preparation for zfs.
 
 Consider the rsync bundle format as well.
 That should provide interoperability with any filesystem.
 
 Rsync is an interactive protocol. The idea with send/receive is that the
 stream can be generated without any interactions with receiver. You can
 store the stream somewhere, or replay it to many destinations.

Same with rsync's batch mode. Here is more about it:

http://manpages.ubuntu.com/manpages/precise/man1/rsync.1.html#contenttoc21


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Systemcall for offline deduplication

2012-10-26 Thread Gabriel
On Thu, 25 Oct 2012 23:26:14 -0700, Darrick J. Wong wrote:
 Now, here's my proposal for fixing that:
 A BTRFS_IOC_SAME_RANGE ioctl would be ideal. Takes two file
 descriptors, two offsets, one length, does some locking, checks that
 the ranges are identical (returns EINVAL if not), and defers to an
 implementation that works like clone_range with the metadata update and
 the writable volume restriction moved out.
 
 I didn't go with something block-based or extent-based because with
 compression and fragmentation, extents would very easily fail to be
 aligned.
 
 Thoughts on this interface?
 Anyone interested in getting this implemented, or at least providing
 some guidance and patch review?
 
 This sounds quite a bit like what Josef had proposed with the
 FILE_EXTENT_SAME ioctl a couple of years ago[1].  At the time, he was
 only interested in writing a userland dedupe program for various
 reasons, and afaict it hasn't gone anywhere.  If you're going to do the
 comparing from userspace, I'd imagine you ought to have a better method
 to pin an extent than chattr +i...

The immutable hack is a bit lame, but it will have to stay until we get a 
good kernel API.

 I guess you could create a temporary file, F_E_S the parts of the files
 you're trying to compare into the temp file, link together whichever
 parts you want to, and punch_hole the entire temp file before moving on.
  I think it's the case that if the candidate files get rewritten during
 the dedupe operation, the new data will be written elsewhere; the punch
 hole operation will release the disk space if its refcount becomes zero.

The FILE_EXTENT_SAME proposal is not the one I'd prefer.
The parameters (fds, offsets, one length) are fine.
It's not as extent-based as the name implies (no extents in the 
parameters), except that it sill needs a single extent on the left side, 
which won't work for fragmented files. That alone may be worked around by 
creating a new tempfile to use on the source side, but that has 
downsides: it will unshare extents and might actually increase disk use, 
and it won't work on read-only snapshots.
It is better to just pass fragmented offsets to the kernel and not put 
workarounds that reduce visibility for the implementation.

The restrictions for compressed or encrypted files and cross-subvolume 
dedup are also inconvenient. That makes me more interested in an 
implementation based on clone_range, which has neither limitation.
That's the proposal above.

  The offline dedupe scheme seems like a good way to reclaim disk space
 if you don't mind having fewer copies of data.

I'm happy with the gains, although they are entirely dependent on having 
a lot of redundant data in the first place. The messier the better.

 As for online dedupe (which seems useful for reducing writes), would it
 be useful if one could, given a write request, compare each of the dirty
 pages in that request against whatever else the fs has loaded in the
 page cache, and try to dedupe against that?  We could probably speed up
 the search by storing hashes of whatever we have in the page cache and
 using that to find candidates for the memcmp() test.  This of course is
 not a comprehensive solution, but (a)
 we combine it with offline dedupe later and (b) we don't make a disk
 write out data that we've recently read or written.  Obviously you'd
 want to be able to opt-in to this sort of thing with an inode flag or
 something.

That's another kettle of fish, and will require an entirely different 
approach. ZFS has some experience doing that. While their implementation 
may reduce writes it is at the cost of storing hashes of every block in 
RAM.

 [1]
 http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg07779.html
 
 [1] https://github.com/g2p/bedup#readme

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Systemcall for offline deduplication

2012-10-26 Thread Gabriel
 As for online dedupe (which seems useful for reducing writes), would it
 be useful if one could, given a write request, compare each of the
 dirty pages in that request against whatever else the fs has loaded in
 the page cache, and try to dedupe against that?  We could probably
 speed up the search by storing hashes of whatever we have in the page
 cache and using that to find candidates for the memcmp() test.  This of
 course is not a comprehensive solution, but (a)
 we combine it with offline dedupe later and (b) we don't make a disk
 write out data that we've recently read or written.  Obviously you'd
 want to be able to opt-in to this sort of thing with an inode flag or
 something.
 
 That's another kettle of fish, and will require an entirely different
 approach. ZFS has some experience doing that. While their implementation
 may reduce writes it is at the cost of storing hashes of every block in
 RAM.

Though your proposal is quite different from the ZFS thing, and might 
actually be useful for a larger public, so forget I said anything about 
it.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix a sign bug causing invalid memory access in the ino_paths ioctl.

2012-10-10 Thread Gabriel de Perthuis
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:

  ls -i
  btrfs inspect inode-resolve -v $ino /mnt/btrfs

I noticed the memory layout of the fspath-val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.

---
 fs/btrfs/backref.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 868cf5b..29d05c6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1131,7 +1131,7 @@ char *btrfs_iref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
int slot;
u64 next_inum;
int ret;
-   s64 bytes_left = size - 1;
+   s64 bytes_left = ((s64)size) - 1;
struct extent_buffer *eb = eb_in;
struct btrfs_key found_key;
int leave_spinning = path-leave_spinning;
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] warn when skipping snapshots created with older

2012-09-05 Thread Gabriel de Perthuis
Thanks, I fixed the objectid test.
Apply with --scissors.

-- 8 --
Subject: [PATCH] btrfs send: warn when skipping snapshots created with older
 kernels.

This message is more explicit than ERROR: could not resolve root_id,
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs, to prevent spurious warnings.

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 send-utils.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..03ca72a 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,13 +224,18 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh-objectid != 5 
sh-objectid  BTRFS_FIRST_FREE_OBJECTID) ||
-   sh-objectid == BTRFS_FREE_INO_OBJECTID)
+   sh-objectid  BTRFS_LAST_FREE_OBJECTID)
goto skip;
 
if (sh-type == BTRFS_ROOT_ITEM_KEY) {
/* older kernels don't have uuids+times */
if (sh-len  sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   Ignoring subvolume id %llu, 
+   btrfs send needs snapshots 
+   created with kernel 3.6+\n,
+   sh-objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Warn when skipping snapshots created with older kernels.

2012-09-04 Thread Gabriel de Perthuis
This message is more explicit than ERROR: could not resolve root_id,
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs.

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 send-utils.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..386aeb3 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,6 +224,7 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh-objectid != 5 
sh-objectid  BTRFS_FIRST_FREE_OBJECTID) ||
+   sh-objectid = BTRFS_LAST_FREE_OBJECTID ||
sh-objectid == BTRFS_FREE_INO_OBJECTID)
goto skip;
 
@@ -231,6 +232,11 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
/* older kernels don't have uuids+times */
if (sh-len  sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   Ignoring subvolume id %llu, 
+   btrfs send needs snapshots 
+   created with kernel 3.6+\n,
+   sh-objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bcp: fix off-by-one errors in path handling

2010-02-13 Thread Eduard - Gabriel Munteanu
This fixes a bug which causes the first character of each filename in
the destination to be omitted.

Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro
---
 bcp |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/bcp b/bcp
index 5729e91..c6b4bef 100755
--- a/bcp
+++ b/bcp
@@ -137,7 +137,7 @@ for srci in xrange(0, src_args):
 statinfo = os.lstat(srcname)
 
 if srcname.startswith(src):
-part = srcname[len(src) + 1:]
+part = srcname[len(src):]
 
 if stat.S_ISLNK(statinfo.st_mode):
 copylink(srcname, dst, part, statinfo, None)
@@ -153,7 +153,7 @@ for srci in xrange(0, src_args):
 for f in filenames:
 srcname = os.path.join(dirpath, f)
 if srcname.startswith(src):
-part = srcname[len(src) + 1:]
+part = srcname[len(src):]
 
 statinfo = os.lstat(srcname)
 copyfile(srcname, dst, part, statinfo, None)
-- 
1.6.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html