Re: Incremental backup for a raid1

2014-03-13 Thread Duncan
Michael Schuerig posted on Thu, 13 Mar 2014 20:12:44 +0100 as excerpted:

> My backup use case is different from the what has been recently
> discussed in another thread. I'm trying to guard against hardware
> failure and other causes of destruction.
> 
> I have a btrfs raid1 filesystem spread over two disks. I want to backup
> this filesystem regularly and efficiently to an external disk (same
> model as the ones in the raid) in such a way that
> 
> * when one disk in the raid fails, I can substitute the backup and
> rebalancing from the surviving disk to the substitute only applies the
> missing changes.
> 
> * when the entire raid fails, I can re-build a new one from the backup.
> 
> The filesystem is mounted at its root and has several nested subvolumes
> and snapshots (in a .snapshots subdir on each subvol).
> 
> Is it possible to do what I'm looking for?

AFAICS, as mentioned down the other subthread, the closest thing to this 
would be N-way mirroring, a coming feature on the roadmap for 
introduction after raid5/6 mode[1] gets completed.  The current raid1 
mode is 2-way-mirroring only, regardless of the number of devices.

N-way-mirroring is actually my most hotly anticipated feature for a 
different reason[2], but for you it would work like this:

1) Setup the 3-way (or 4-way if preferred) mirroring and balance to 
ensured copies of all data on all devices.

2) Optionally scrub to ensure the integrity of all copies.

3) Disconnect the backup device(s).  (Don't btrfs device delete, this 
would remove the copy.  Just disconnect.)  

4) Store the backups.

5) Periodically get them out and reconnect.

6) Rebalance to update.  (Since the devices remain members of the mirror, 
simply outdated, the balance should only update, not rewrite the entire 
thing.)

7) Optionally scrub to verify.

8) Repeat steps 3-7 as necessary.

If you went 4-way so two backups and alternated the one you plugged in, 
it'd also protect against mishap that might take out all devices during 
steps 5-7 when the backup is connected as well, since you'd still have 
that other backup available.

Unfortunately, completing raid5/6 support is still an ongoing project, 
and as a result, fully functional and /reasonably/ tested N-way-mirroring 
remains the same 6-months-minimum away that it has been for over a year 
now.  But I sure am anticipating that day!

---
[1] Currently, the raid5/6 support is incomplete, the parity is 
calculated and writes are done, but some restore scenarios aren't yet 
properly supported and raid5/6-mode scrub isn't complete either, so the 
current code is considered testing-only, not for deployment where the 
raid5/6 feature would actually be relied on.  That has remained the 
raid5/6 status for several kernels now, as the focus has been on bugfixing 
other areas including snapshot-aware defrag which is currently 
deactivated due to horrible scaling issues (current defrag COWS the 
operational mount only, duplicating previously shared blocks), send/
receive.

[2] In addition to loss of N-1 device-protection, I really love btrfs' 
data integrity features and the ability to recover from other copies if 
the one is found to be corrupted, which is why I'm running raid1 mode 
here.  But currently, there's only the two copies and if both get 
corrupted...  My sweet spot would be three copies, allowing corruption of 
two and recovery from the third, which is why I personally am so hotly 
anticipating N-way-mirroring, but unfortunately, it's looking a bit like 
the proverbial carrot on the stick in front of the donkey, these days.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-13 Thread Marc MERLIN
On Thu, Mar 13, 2014 at 09:39:02PM -0600, Chris Murphy wrote:
> 
> On Mar 13, 2014, at 8:11 PM, Marc MERLIN  wrote:
> 
> > On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote:
> >> discard is, except on the very latest hardware, a synchronous command
> >> (it's a limitation of the SATA standard), and therefore results in
> >> very very poor performance.
> > 
> > Interesting. How do I know if a given SSD will hang on discard?
> > Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :)
> 
> smartctl -a or -x will tell you what SATA revision is in place. The queued 
> trim support is in SATA Rev 3.1. I'm not certain if this requires only the 
> drive to support that revision level, or both controller and drive.

I'm not sure I'm seeing this, which field is that?

=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 840 EVO 1TB
Serial Number:S1D9NEAD934600N
LU WWN Device Id: 5 002538 85009a8ff
Firmware Version: EXT0BB0Q
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Size:  512 bytes logical/physical
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4c
Local Time is:Thu Mar 13 22:15:14 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(15000) seconds.
Offline data collection
capabilities:(0x53) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 250) minutes.
SCT capabilities:  (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010-0
  9 Power_On_Hours  -O--CK   099   099   000-2219
 12 Power_Cycle_Count   -O--CK   099   099   000-659
177 Wear_Leveling_Count PO--C-   099   099   000-3
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010-0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010-0
182 Erase_Fail_Count_Total  -O--CK   100   100   010-0
183 Runtime_Bad_Block   PO--C-   100   100   010-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Airflow_Temperature_Cel -O--CK   054   041   000-46
195 Hardware_ECC_Recovered  -O-RC-   200   200   000-0
199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
235 Unknown_Attribute   -O--C-   099   099   000-35
241 Total_LBAs_Written  -O--CK   099   099   000-12186944165
||_ K auto-keep
|__ C event count
___ R error rate
||| S speed/performance
||_ O updated online
|__ P prefailure warning


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe 

Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)

2014-03-13 Thread Duncan
Marc MERLIN posted on Thu, 13 Mar 2014 18:48:13 -0700 as excerpted:

> Are others seeing some btrfs operations on filesystem/diskA
> hang/deadlock other btrfs operations on filesystem/diskB ?

Well, if the filesystem in filesystem/diskA and filesystem/diskB is the 
same (multi-device) filesystem, as the above definitely implies...  Tho 
based on the context I don't believe that's what you actually meant.

Meanwhile, send/receive is intensely focused in bug-finding/fixing mode 
ATM.  The basic concept is there, but to this point it has definitely 
been more development/testing-reliability (as befitted btrfs overall 
state, with the eat-your-babies kconfig option warning only recently 
toned down to what I'd call semi-stable) than enterprise-reliability.  
Hopefully by the time they're done with all this bug-stomping it'll be 
rather closer to the latter.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 8:11 PM, Marc MERLIN  wrote:

> On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote:
>> discard is, except on the very latest hardware, a synchronous command
>> (it's a limitation of the SATA standard), and therefore results in
>> very very poor performance.
> 
> Interesting. How do I know if a given SSD will hang on discard?
> Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :)

smartctl -a or -x will tell you what SATA revision is in place. The queued trim 
support is in SATA Rev 3.1. I'm not certain if this requires only the drive to 
support that revision level, or both controller and drive.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 7:14 PM, Lists  wrote:
> 
> I'm assuming that BTRFS send/receive works similar to ZFS's similarly named 
> feature.

Similar yes but not all options are the same between them. e.g. zfs send -R 
replicates all descendent file systems. I don't think zfs requires volumes, 
filesystems, or snapshots to be read-only, whereas btrfs send only works on 
read only snapshot-subvolumes. There has been some suggestion of a recursive 
snapshot creation and recursive send for btrfs.

> So just I don't get the "backup" problem. Place btrfs' equivalent of a pool 
> on the external drive, and use send/receive of the filesystem or snapshot(s). 
> Does BTRFS work so differently in this regard? If so, I'd like to know what's 
> different.

Top most thing in zfs is the pool, which on btrfs is the volume. Neither zfs 
send or btrfs send works on this level to send everything within a pool/volume. 
zfs has the file system and btrfs has the subvolume which can be snapshot. 
Either (or both) can be used with send. 

zfs also has the volume which is a block device that can be snapshot, there 
isn't yet a btrfs equivalent.

Btrfs and zfs have clones but the distinction is stronger with zfs. Like zfs 
snapshots can't be deleted unless its clones are deleted. Btrfs send has a -c 
clone-src option that I don't really understand, and also the --reflink which 
is a clone at the file level.

Anyway there are a lot of similarities but also quite a few differences. Basic 
functionality seems pretty much the same.


> 
> My primary interest in BTRFS vs ZFS is two-fold:
> 
> 1) ZFS has a couple of limitations that I find disappointing, that don't 
> appear to be present in BTRFS.
>A) Inability to upgrade a non-redundant ZFS pool/vdev to raidz or increase 
> the raidz (redundancy) level after creation. (Yes, you can plan around this, 
> but I see no good reason to HAVE to)
>B) Inability to remove a vdev once added to a pool.
> 
> 2) Licensing: ZFS on Linux is truly great so far in all my testing, can't 
> throw enough compliments their way, but I would really like to rely on a 
> "first class citizen" as far as the Linux kernel is concerned.


3. On btrfs you can delete a parent subvolume and the children remain. On zfs, 
you can't destroy a zfs filesystem/volume unless its snapshots are deleted, and 
you can't delete snapshots unless their clones are deleted.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: Fix a memleak in btrfs_scan_lblkid().

2014-03-13 Thread quwen...@cn.fujitsu.com
In btrfs_scan_lblkid(), blkid_get_cache() is called but cache not freed.
This patch adds blkid_put_cache() to free it.

Signed-off-by: Qu Wenruo 
---
 utils.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/utils.c b/utils.c
index 93cf9ac..b809bc5 100644
--- a/utils.c
+++ b/utils.c
@@ -2067,6 +2067,7 @@ int btrfs_scan_lblkid(int update_kernel)
btrfs_register_one_device(path);
}
blkid_dev_iterate_end(iter);
+   blkid_put_cache(cache);
return 0;
 }
 
-- 
1.9.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: Fix a memleak in btrfs_scan_one_device.

2014-03-13 Thread quwen...@cn.fujitsu.com
Valgrind reports memleak in btrfs_scan_one_device() about allocating
btrfs_device but on btrfs_close_devices() they are not reclaimed.

Although not a bug since after btrfs_close_devices() btrfs will exit so
memory will be reclaimed by system anyway, it's better to fix it anyway.

Signed-off-by: Qu Wenruo 
---
 cmds-filesystem.c |  6 ++
 volumes.c | 13 ++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index f02e871..c9e27fc 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -651,6 +651,12 @@ devs_only:
if (search && !found)
ret = 1;
 
+   while (!list_empty(all_uuids)) {
+   fs_devices = list_entry(all_uuids->next,
+   struct btrfs_fs_devices, list);
+   list_del(&fs_devices->list);
+   btrfs_close_devices(fs_devices);
+   }
 out:
printf("%s\n", BTRFS_BUILD_VERSION);
free_seen_fsid();
diff --git a/volumes.c b/volumes.c
index 8c45851..77ffd32 100644
--- a/volumes.c
+++ b/volumes.c
@@ -160,11 +160,12 @@ static int device_list_add(const char *path,
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices;
-   struct list_head *cur;
struct btrfs_device *device;
+
 again:
-   list_for_each(cur, &fs_devices->devices) {
-   device = list_entry(cur, struct btrfs_device, dev_list);
+   while (!list_empty(&fs_devices->devices)) {
+   device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
if (device->fd != -1) {
fsync(device->fd);
if (posix_fadvise(device->fd, 0, 0, 
POSIX_FADV_DONTNEED))
@@ -173,6 +174,11 @@ again:
device->fd = -1;
}
device->writeable = 0;
+   list_del(&device->dev_list);
+   /* free the memory */
+   free(device->name);
+   free(device->label);
+   free(device);
}
 
seed_devices = fs_devices->seed;
@@ -182,6 +188,7 @@ again:
goto again;
}
 
+   free(fs_devices);
return 0;
 }
 
-- 
1.9.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send/receive locking

2014-03-13 Thread Marc MERLIN
On Sat, Mar 08, 2014 at 09:53:50PM +, Hugo Mills wrote:
>Is there anything that can be done about the issues of btrfs send
> blocking? I've been writing a backup script (slowly), and several
> times I've managed to hit a situation where large chunks of the
> machine grind to a complete halt in D state because the backup script
> has jammed up.
 
Ah, we're doing the exact same thing then :)

>Now, I'm aware that you can't send and receive to the same
> filesystem at the same time, and that's a restriction I can live with.
> However, having things that aren't related to the backup process
> suddenly stop working because the backup script is trying to log its
> progress to the same FS it's backing up is... umm... somewhat vexing,
> to say the least.

Mmmh, my backup doesn't log to disk, just in a screen bufffer, but I've seen
extensive hangs too, and my 6TB send/receive priming has been taking 6 days
on local disks. I think it stops all the time due to locks.
But as per my other message below, it's very bad when it deadlocks other
filesystems not involved in the backup, like my root filesystem.

See the other thread I appended to before seeing your message:
Subject: Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices 
(near deadlocks)

attached below.
I'll be happy to try new stuff, but I want that 6 day running send/receive
to finish first. It took so long that I don't want to do it again :)

Marc

On Thu, Mar 13, 2014 at 06:48:13PM -0700, Marc MERLIN wrote:
> Can anyone comment on this.
> 
> Are others seeing some btrfs operations on filesystem/diskA hang/deadlock
> other btrfs operations on filesystem/diskB ?
> 
> I just spent time fixing near data corruption in one of my systems due to
> a 7h delay between when the timestamp was written and the actual data was
> written, and traced it down to a btrfs hang that should never have happened
> on that filesystem.
> 
> Surely, it's not a single queue for all filesystem and devices, right?
> 
> If not, does anyone know what bugs I've been hitting then?
> 
> Is the full report below I spent quite a while getting together for you :)
> useful in any way to see where the hangs are?
> 
> To be honest, I'm looking at moving some important filesystems back to ext4
> because I can't afford such long hangs on my root filesystem when I have a
> media device that is doing heavy btrfs IO or a send/receive.
> 
> Mmmh, is it maybe just btrfs send/receive that is taking a btrfs-wide lock?
> Or btrfs scrub maybe?
> 
> Thanks,
> Marc
> 
> On Wed, Mar 12, 2014 at 08:18:08AM -0700, Marc MERLIN wrote:
> > I have a file server with 4 cpu cores and 5 btrfs devices:
> > Label: btrfs_boot  uuid: e4c1daa8-9c39-4a59-b0a9-86297d397f3b
> > Total devices 1 FS bytes used 48.92GiB
> > devid1 size 79.93GiB used 73.04GiB path /dev/mapper/cryptroot
> > 
> > Label: varlocalspace  uuid: 9f46dbe2-1344-44c3-b0fb-af2888c34f18
> > Total devices 1 FS bytes used 1.10TiB
> > devid1 size 1.63TiB used 1.50TiB path /dev/mapper/cryptraid0
> > 
> > Label: btrfs_pool1  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
> > Total devices 1 FS bytes used 7.16TiB
> > devid1 size 14.55TiB used 7.50TiB path /dev/mapper/dshelf1
> > 
> > Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
> > Total devices 1 FS bytes used 3.34TiB
> > devid1 size 7.28TiB used 3.42TiB path /dev/mapper/dshelf2
> > 
> > Label: bigbackup  uuid: 024ba4d0-dacb-438d-9f1b-eeb34083fe49
> > Total devices 5 FS bytes used 6.02TiB
> > devid1 size 1.82TiB used 1.43TiB path /dev/dm-9
> > devid2 size 1.82TiB used 1.43TiB path /dev/dm-6
> > devid3 size 1.82TiB used 1.43TiB path /dev/dm-5
> > devid4 size 1.82TiB used 1.43TiB path /dev/dm-7
> > devid5 size 1.82TiB used 1.43TiB path /dev/dm-8
> > 
> > 
> > I have a very long running btrfs send/receive from btrfs_pool1 to bigbackup
> > (long running meaning that it's been slowly copying over 5 days)
> > 
> > The problem is that this is blocking IO to btrfs_pool2 which is using
> > totally different drives.
> > By blocking IO I mean that IO to pool2 kind of works sometimes, and
> > hangs for very long times at other times.
> > 
> > It looks as if one rsync to btrfs_pool2 or one piece of IO hangs on a 
> > shared lock
> > and once that happens, all IO to btrfs_pool2 stops for a long time.
> > It does recover eventually without reboot, but the wait times are 
> > ridiculous (it 
> > could be 1H or more).
> > 
> > As I write this, I have a killall -9 rsync that waited for over 10mn before
> > these processes would finally die:
> > 23555   07:36 wait_current_trans.isra.15 rsync -av -SH --delete 
> > (...)
> > 23556   07:36 exit   [rsync] 
> > 25387  2-04:41:22 wait_current_trans.isra.15 rsync --password-file  
> > (...)
> > 27481   31:26 wait_current_trans.isr

Re: discard synchronous on most SSDs?

2014-03-13 Thread Marc MERLIN
On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote:
> discard is, except on the very latest hardware, a synchronous command
> (it's a limitation of the SATA standard), and therefore results in
> very very poor performance.

Interesting. How do I know if a given SSD will hang on discard?
Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :)

Thanks
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  


signature.asc
Description: Digital signature


Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)

2014-03-13 Thread Marc MERLIN
Can anyone comment on this.

Are others seeing some btrfs operations on filesystem/diskA hang/deadlock
other btrfs operations on filesystem/diskB ?

I just spent time fixing near data corruption in one of my systems due to
a 7h delay between when the timestamp was written and the actual data was
written, and traced it down to a btrfs hang that should never have happened
on that filesystem.

Surely, it's not a single queue for all filesystem and devices, right?

If not, does anyone know what bugs I've been hitting then?

Is the full report below I spent quite a while getting together for you :)
useful in any way to see where the hangs are?

To be honest, I'm looking at moving some important filesystems back to ext4
because I can't afford such long hangs on my root filesystem when I have a
media device that is doing heavy btrfs IO or a send/receive.

Mmmh, is it maybe just btrfs send/receive that is taking a btrfs-wide lock?
Or btrfs scrub maybe?

Thanks,
Marc

On Wed, Mar 12, 2014 at 08:18:08AM -0700, Marc MERLIN wrote:
> I have a file server with 4 cpu cores and 5 btrfs devices:
> Label: btrfs_boot  uuid: e4c1daa8-9c39-4a59-b0a9-86297d397f3b
> Total devices 1 FS bytes used 48.92GiB
> devid1 size 79.93GiB used 73.04GiB path /dev/mapper/cryptroot
> 
> Label: varlocalspace  uuid: 9f46dbe2-1344-44c3-b0fb-af2888c34f18
> Total devices 1 FS bytes used 1.10TiB
> devid1 size 1.63TiB used 1.50TiB path /dev/mapper/cryptraid0
> 
> Label: btrfs_pool1  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
> Total devices 1 FS bytes used 7.16TiB
> devid1 size 14.55TiB used 7.50TiB path /dev/mapper/dshelf1
> 
> Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
> Total devices 1 FS bytes used 3.34TiB
> devid1 size 7.28TiB used 3.42TiB path /dev/mapper/dshelf2
> 
> Label: bigbackup  uuid: 024ba4d0-dacb-438d-9f1b-eeb34083fe49
> Total devices 5 FS bytes used 6.02TiB
> devid1 size 1.82TiB used 1.43TiB path /dev/dm-9
> devid2 size 1.82TiB used 1.43TiB path /dev/dm-6
> devid3 size 1.82TiB used 1.43TiB path /dev/dm-5
> devid4 size 1.82TiB used 1.43TiB path /dev/dm-7
> devid5 size 1.82TiB used 1.43TiB path /dev/dm-8
> 
> 
> I have a very long running btrfs send/receive from btrfs_pool1 to bigbackup
> (long running meaning that it's been slowly copying over 5 days)
> 
> The problem is that this is blocking IO to btrfs_pool2 which is using
> totally different drives.
> By blocking IO I mean that IO to pool2 kind of works sometimes, and
> hangs for very long times at other times.
> 
> It looks as if one rsync to btrfs_pool2 or one piece of IO hangs on a shared 
> lock
> and once that happens, all IO to btrfs_pool2 stops for a long time.
> It does recover eventually without reboot, but the wait times are ridiculous 
> (it 
> could be 1H or more).
> 
> As I write this, I have a killall -9 rsync that waited for over 10mn before
> these processes would finally die:
> 23555   07:36 wait_current_trans.isra.15 rsync -av -SH --delete (...)
> 23556   07:36 exit   [rsync] 
> 25387  2-04:41:22 wait_current_trans.isra.15 rsync --password-file  (...)
> 27481   31:26 wait_current_trans.isra.15 rsync --password-file  (...)
> 2926804:41:34 wait_current_trans.isra.15 rsync --password-file  (...)
> 2934304:41:31 exit   [rsync] 
> 2949204:41:27 wait_current_trans.isra.15 rsync --password-file  (...)
> 
> 1455907:14:49 wait_current_trans.isra.15 cp -i -al current 
> 20140312-feisty
> 
> This is all stuck in btrfs kernel code.
> If someeone wants sysrq-w, there it is.
> http://marc.merlins.org/tmp/btrfs_full.txt
> 
> A quick summary:
> SysRq : Show Blocked State
>   taskPC stack   pid father
> btrfs-cleaner   D 8802126b0840 0  3332  2 0x
>  8800c5dc9d00 0046 8800c5dc9fd8 8800c69f6310
>  000141c0 8800c69f6310 88017574c170 880211e671e8
>   880211e67000 8801e5936e20 8800c5dc9d10
> Call Trace:
>  [] schedule+0x73/0x75
>  [] wait_current_trans.isra.15+0x98/0xf4
>  [] ? finish_wait+0x65/0x65
>  [] start_transaction+0x48e/0x4f2
>  [] ? __btrfs_end_transaction+0x2a1/0x2c6
>  [] btrfs_start_transaction+0x1b/0x1d
>  [] btrfs_drop_snapshot+0x443/0x610
>  [] ? _raw_spin_unlock+0x17/0x2a
>  [] ? finish_task_switch+0x51/0xdb
>  [] ? __schedule+0x537/0x5de
>  [] btrfs_clean_one_deleted_snapshot+0x103/0x10f
>  [] cleaner_kthread+0x103/0x136
>  [] ? btrfs_alloc_root+0x26/0x26
>  [] kthread+0xae/0xb6
>  [] ? __kthread_parkme+0x61/0x61
>  [] ret_from_fork+0x7c/0xb0
>  [] ? __kthread_parkme+0x61/0x61
> btrfs-transacti D 88021387eb00 0    2 0x
>  8800c5dcb890 0046 8800c5dcbfd8 88021387e5d0
>  000141c0 88021387e5d0 88021f2141c0 88021387e5d0
>  88

Re: Incremental backup for a raid1

2014-03-13 Thread Lists

See comments at the bottom:

On 03/13/2014 05:29 PM, George Mitchell wrote:

On 03/13/2014 04:03 PM, Michael Schuerig wrote:

On Thursday 13 March 2014 16:04:33 Chris Murphy wrote:

On Mar 13, 2014, at 3:14 PM, Michael Schuerig

 wrote:

On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:

On 2014-Mar-13 14:28, Hugo Mills wrote:

On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:

My backup use case is different from the what has been recently
discussed in another thread. I'm trying to guard against hardware
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to
backup this filesystem regularly and efficiently to an external
disk (same model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup
and
rebalancing from the surviving disk to the substitute only
applies
the missing changes.

* when the entire raid fails, I can re-build a new one from the
backup.

The filesystem is mounted at its root and has several nested
subvolumes and snapshots (in a .snapshots subdir on each subvol).

[...]


I'm new; btrfs noob; completely unqualified to write intelligently
on
this topic, nevertheless:
I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
backup device someplace /dev/C

Could you, at the time you wanted to backup the filesystem:
1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
3) balance to effect the backup (i.e. rebuilding the RAID1 onto
/dev/C) 4) break/reconnect the original devices: remove /dev/C;
re-add /dev/B to the fs

I've thought of this but don't dare try it without approval from the
experts. At any rate, for being practical, this approach hinges on
an
ability to rebuild the raid1 incrementally. That is, the rebuild
would have to start from what already is present on disk B (or C,
when it is re-added). Starting from an effectively blank disk each
time would be prohibitive.

Even if this would work, I'd much prefer keeping the original raid1
intact and to only temporarily add another mirror: "lazy mirroring",
to give the thing a name.

[...]

In the btfs device add case, you now have a three disk raid1 which is
a whole different beast. Since this isn't n-way raid1, each disk is
not stand alone. You're only assured data survives a one disk failure
meaning you must have two drives.

Yes, I understand that. Unless someone convinces me that it's a bad
idea, I keep wishing for a feature that allows to intermittently add a
third disk to a two disk raid1 and update that disk so that it could
replace one of the others.


So the btrfs replace scenario might work but it seems like a bad idea.
And overall it's a use case for which send/receive was designed
anyway so why not just use that?

Because it's not "just". Doing it right doesn't seem trivial. For one
thing, there are multiple subvolumes; not at the top-level but nested
inside a root subvolume. Each of them already has snapshots of its own.
If there already is a send/receive script that can handle such a setup
I'll happily have a look at it.

Michael

I think the closest thing there will ever be to this is n-way 
mirroring.  I currently use rsync to a separate drive to maintain a 
backup copy, but it is not integrated into the array like n-way would 
be, and is definitely not a perfect solution.  But a 3 drive 3-way 
would require the 3rd drive to be in the array the whole time or it 
would run into the same problem requiring a complete rebuild rather 
than an incremental when reintroduced, UNLESS such a feature was 
specifically included in the design, and even then, in a 3-way 
configuration, you would end up simplex on at least some data until 
the partial rebuild was completed.  Personally, I will be DELIGHTED 
when n-way appears simply because basic 3-way gets us out of the 
dreaded simplex trap.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm coming from ZFS land, am a BTRFS newbie, and I don't understand this 
discussion, at all. I'm assuming that BTRFS send/receive works similar 
to ZFS's similarly named feature. We use snapshots and ZFS send/receive 
to a remote server to do our backups. To do an rsync of our production 
file store takes days because there are so many files, while 
snapshotting and using ZFS send/receive takes tens of minutes at local 
(Gbit) speeds, and a few hours at WAN speeds, nearly all of that time 
being transfer time.


So just I don't get the "backup" problem. Place btrfs' equivalent of a 
pool on the external drive, and use send/receive of the filesystem or 
snapshot(s). Does BTRFS work so differently in this regard? If so, I'd 
like to know what's different.


My primary interest in BTRFS vs ZFS is two-fold:

1) ZFS has a couple of limitat

Re: Incremental backup for a raid1

2014-03-13 Thread George Mitchell

On 03/13/2014 04:03 PM, Michael Schuerig wrote:

On Thursday 13 March 2014 16:04:33 Chris Murphy wrote:

On Mar 13, 2014, at 3:14 PM, Michael Schuerig

 wrote:

On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:

On 2014-Mar-13 14:28, Hugo Mills wrote:

On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:

My backup use case is different from the what has been recently
discussed in another thread. I'm trying to guard against hardware
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to
backup this filesystem regularly and efficiently to an external
disk (same model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup
and
rebalancing from the surviving disk to the substitute only
applies
the missing changes.

* when the entire raid fails, I can re-build a new one from the
backup.

The filesystem is mounted at its root and has several nested
subvolumes and snapshots (in a .snapshots subdir on each subvol).

[...]


I'm new; btrfs noob; completely unqualified to write intelligently
on
this topic, nevertheless:
I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
backup device someplace /dev/C

Could you, at the time you wanted to backup the filesystem:
1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
3) balance to effect the backup (i.e. rebuilding the RAID1 onto
/dev/C) 4) break/reconnect the original devices: remove /dev/C;
re-add /dev/B to the fs

I've thought of this but don't dare try it without approval from the
experts. At any rate, for being practical, this approach hinges on
an
ability to rebuild the raid1 incrementally. That is, the rebuild
would have to start from what already is present on disk B (or C,
when it is re-added). Starting from an effectively blank disk each
time would be prohibitive.

Even if this would work, I'd much prefer keeping the original raid1
intact and to only temporarily add another mirror: "lazy mirroring",
to give the thing a name.

[...]

In the btfs device add case, you now have a three disk raid1 which is
a whole different beast. Since this isn't n-way raid1, each disk is
not stand alone. You're only assured data survives a one disk failure
meaning you must have two drives.

Yes, I understand that. Unless someone convinces me that it's a bad
idea, I keep wishing for a feature that allows to intermittently add a
third disk to a two disk raid1 and update that disk so that it could
replace one of the others.


So the btrfs replace scenario might work but it seems like a bad idea.
And overall it's a use case for which send/receive was designed
anyway so why not just use that?

Because it's not "just". Doing it right doesn't seem trivial. For one
thing, there are multiple subvolumes; not at the top-level but nested
inside a root subvolume. Each of them already has snapshots of its own.
If there already is a send/receive script that can handle such a setup
I'll happily have a look at it.

Michael

I think the closest thing there will ever be to this is n-way 
mirroring.  I currently use rsync to a separate drive to maintain a 
backup copy, but it is not integrated into the array like n-way would 
be, and is definitely not a perfect solution.  But a 3 drive 3-way would 
require the 3rd drive to be in the array the whole time or it would run 
into the same problem requiring a complete rebuild rather than an 
incremental when reintroduced, UNLESS such a feature was specifically 
included in the design, and even then, in a 3-way configuration, you 
would end up simplex on at least some data until the partial rebuild was 
completed.  Personally, I will be DELIGHTED when n-way appears simply 
because basic 3-way gets us out of the dreaded simplex trap.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts'o
On Thu, Mar 13, 2014 at 04:28:23PM +, Steven Whitehouse wrote:
> 
> I guess the same is true for other file systems which are mounted ro
> too. So maybe a check for MS_RDONLY before doing the sync in those
> cases?

My original patch moved the sync_filesystem into the check for
MS_RDONLY in the core VFS code.  The objection was raised that there
might be some file system out there that might depend on this
behaviour.  I can't imagine why, but I suppose it's at least
theoretically possible.

So the idea is that this particular patch is *guaranteed* not to make
any difference.  That way there can be no question about the patch'es
correctness.

I'm going to follow up with a patch for ext4 that does exactly that,
but the idea is to allow each file system maintainer to do that for
their own file system.

I could do that as well for file systems that are "obviously"
read-only, but then I'll find out that there's some wierd case where
the file system can be used in a read-write fashion.  (Example: UDF is
normally used for DVD's, but at least in theory it can be used
read/write --- I'm told that Windows supports read-write UDF file
systems on USB sticks, and at least in theory it could be used as a
inter-OS exchange format in situations where VFAT and exFAT might not
be appropriate for various reasons.)

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig
On Thursday 13 March 2014 16:04:33 Chris Murphy wrote:
> On Mar 13, 2014, at 3:14 PM, Michael Schuerig 
 wrote:
> > On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
> >> On 2014-Mar-13 14:28, Hugo Mills wrote:
> >>> On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
>  My backup use case is different from the what has been recently
>  discussed in another thread. I'm trying to guard against hardware
>  failure and other causes of destruction.
>  
>  I have a btrfs raid1 filesystem spread over two disks. I want to
>  backup this filesystem regularly and efficiently to an external
>  disk (same model as the ones in the raid) in such a way that
>  
>  * when one disk in the raid fails, I can substitute the backup
>  and
>  rebalancing from the surviving disk to the substitute only
>  applies
>  the missing changes.
>  
>  * when the entire raid fails, I can re-build a new one from the
>  backup.
>  
>  The filesystem is mounted at its root and has several nested
>  subvolumes and snapshots (in a .snapshots subdir on each subvol).
> > 
> > [...]
> > 
> >> I'm new; btrfs noob; completely unqualified to write intelligently
> >> on
> >> this topic, nevertheless:
> >> I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
> >> backup device someplace /dev/C
> >> 
> >> Could you, at the time you wanted to backup the filesystem:
> >> 1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
> >> 2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
> >> 3) balance to effect the backup (i.e. rebuilding the RAID1 onto
> >> /dev/C) 4) break/reconnect the original devices: remove /dev/C;
> >> re-add /dev/B to the fs
> > 
> > I've thought of this but don't dare try it without approval from the
> > experts. At any rate, for being practical, this approach hinges on
> > an
> > ability to rebuild the raid1 incrementally. That is, the rebuild
> > would have to start from what already is present on disk B (or C,
> > when it is re-added). Starting from an effectively blank disk each
> > time would be prohibitive.
> > 
> > Even if this would work, I'd much prefer keeping the original raid1
> > intact and to only temporarily add another mirror: "lazy mirroring",
> > to give the thing a name.

[...]
> In the btfs device add case, you now have a three disk raid1 which is
> a whole different beast. Since this isn't n-way raid1, each disk is
> not stand alone. You're only assured data survives a one disk failure
> meaning you must have two drives.

Yes, I understand that. Unless someone convinces me that it's a bad 
idea, I keep wishing for a feature that allows to intermittently add a 
third disk to a two disk raid1 and update that disk so that it could 
replace one of the others.

> So the btrfs replace scenario might work but it seems like a bad idea.
> And overall it's a use case for which send/receive was designed
> anyway so why not just use that?

Because it's not "just". Doing it right doesn't seem trivial. For one 
thing, there are multiple subvolumes; not at the top-level but nested 
inside a root subvolume. Each of them already has snapshots of its own. 
If there already is a send/receive script that can handle such a setup 
I'll happily have a look at it.

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC errors during raid1 rebalance

2014-03-13 Thread Eugene Crosser
Hello,

I want to report that I have the same problem as Michael Russo, except in my
case there is definitely *a lot* of free space.

I had ext4 on a 1Tb LVM mirror. The filesystem was 96% full, with many multi-Gb
files. I successfully converted it into btrfs, removed ext2_saved subvolume, but
did *not* defragment or balance.

Then I added two fresh 4Tb disks to the filesystem, and tried to convert it to
raid1. My plan was to then delete the original LVM disk and have all my data
migrated to the new 4Tb disks under btrfs mirror.

But balancing cannot complete with the same symptoms:

[...]
[12746.391828] block group has cluster?: no
[12746.391830] 0 blocks of free space at or bigger than bytes is
[12747.420098] btrfs: 35 enospc errors during balance

root@pccross:~# btrfs fi sh
Label: 'export'  uuid: 02f39e9d-9115-4a79-9015-a3a9decb87cf
Total devices 3 FS bytes used 798.15GB
devid3 size 3.64TB used 855.03GB path /dev/sdd1
devid2 size 3.64TB used 855.00GB path /dev/sdc1
devid1 size 891.51GB used 175.48GB path /dev/md5

Btrfs v0.20-rc1

root@pccross:~# btrfs fi df /export
Data, RAID1: total=849.00GB, used=650.25GB
Data: total=175.48GB, used=145.64GB
System: total=32.00MB, used=136.00KB
Metadata, RAID1: total=6.00GB, used=2.21GB

root@pccross:~# uname -a
Linux pccross 3.13.0-17-generic #37-Ubuntu SMP Mon Mar 10 21:44:01 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux

Attempt "btrfs device delete" fails with the same "no space" diagnostic.

I am running defragmentation on all files bigger than 1Gb now, and see what
happens. If that does not help, is there any other advice? I can collect
debugging data if needed.

Thanks,

Eugene



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Btrfs: remove transaction from send

2014-03-13 Thread Hugo Mills
On Thu, Mar 13, 2014 at 03:42:13PM -0400, Josef Bacik wrote:
> Lets try this again.  We can deadlock the box if we send on a box and try to
> write onto the same fs with the app that is trying to listen to the send pipe.
> This is because the writer could get stuck waiting for a transaction commit
> which is being blocked by the send.  So fix this by making sure looking at the
> commit roots is always going to be consistent.  We do this by keeping track of
> which roots need to have their commit roots swapped during commit, and then
> taking the commit_root_sem and swapping them all at once.  Then make sure we
> take a read lock on the commit_root_sem in cases where we search the commit 
> root
> to make sure we're always looking at a consistent view of the commit roots.
> Previously we had problems with this because we would swap a fs tree commit 
> root
> and then swap the extent tree commit root independently which would cause the
> backref walking code to screw up sometimes.  With this patch we no longer
> deadlock and pass all the weird send/receive corner cases.  Thanks,

   There's something still going on here. I managed to get about twice
as far through my test as I had before, but I again got an "unexpected
EOF in stream", with btrfs send returning 1. As before, I have this in
syslog:

Mar 13 22:09:12 s_src@amelia kernel: BTRFS error (device sda2): did not find 
backref in send_root. inode=1786631, offset=825257984, disk_byte=36504023040 
found extent=36504023040\x0a

   So, on the evidence of one data point (I'll have another one when I
wake up tomorrow morning), this has made the problem harder to trigger
but it's still possible.

   Hugo.

> Reportedy-by: Hugo Mills 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/backref.c | 33 +++
>  fs/btrfs/ctree.c   | 88 
> --
>  fs/btrfs/ctree.h   |  3 +-
>  fs/btrfs/disk-io.c |  3 +-
>  fs/btrfs/extent-tree.c | 20 ++--
>  fs/btrfs/inode-map.c   | 14 
>  fs/btrfs/send.c| 57 ++--
>  fs/btrfs/transaction.c | 45 --
>  fs/btrfs/transaction.h |  1 +
>  9 files changed, 77 insertions(+), 187 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 860f4f2..0be0e94 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
> *fs_info,
>   goto out;
>   }
>  
> - root_level = btrfs_old_root_level(root, time_seq);
> + if (path->search_commit_root)
> + root_level = btrfs_header_level(root->commit_root);
> + else
> + root_level = btrfs_old_root_level(root, time_seq);
>  
>   if (root_level + 1 == level) {
>   srcu_read_unlock(&fs_info->subvol_srcu, index);
> @@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct 
> btrfs_trans_handle *trans,
>   *
>   * returns 0 on success, < 0 on error.
>   */
> -int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> - struct btrfs_fs_info *fs_info, u64 bytenr,
> - u64 time_seq, struct ulist **roots)
> +static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> +   struct btrfs_fs_info *fs_info, u64 bytenr,
> +   u64 time_seq, struct ulist **roots)
>  {
>   struct ulist *tmp;
>   struct ulist_node *node = NULL;
> @@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle 
> *trans,
>   return 0;
>  }
>  
> +int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
> +  struct btrfs_fs_info *fs_info, u64 bytenr,
> +  u64 time_seq, struct ulist **roots)
> +{
> + int ret;
> +
> + if (!trans)
> + down_read(&fs_info->commit_root_sem);
> + ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots);
> + if (!trans)
> + up_read(&fs_info->commit_root_sem);
> + return ret;
> +}
> +
>  /*
>   * this makes the path point to (inum INODE_ITEM ioff)
>   */
> @@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
>   if (IS_ERR(trans))
>   return PTR_ERR(trans);
>   btrfs_get_tree_mod_seq(fs_info, &tree_mod_seq_elem);
> + } else {
> + down_read(&fs_info->commit_root_sem);
>   }
>  
>   ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid,
> @@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
>  
>   ULIST_ITER_INIT(&ref_uiter);
>   while (!ret && (ref_node = ulist_next(refs, &ref_uiter))) {
> - ret = btrfs_find_all_roots(trans, fs_info, ref_node->val,
> -tree_mod_seq_elem.seq, &roots);
> + ret = __btrfs_find_all_roots(trans, fs_info, ref_node->val,
> +  

Re: Incremental backup for a raid1

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 3:14 PM, Michael Schuerig  wrote:

> On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
>> On 2014-Mar-13 14:28, Hugo Mills wrote:
>>> On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
 My backup use case is different from the what has been recently
 discussed in another thread. I'm trying to guard against hardware
 failure and other causes of destruction.
 
 I have a btrfs raid1 filesystem spread over two disks. I want to
 backup this filesystem regularly and efficiently to an external
 disk (same model as the ones in the raid) in such a way that
 
 * when one disk in the raid fails, I can substitute the backup and
 rebalancing from the surviving disk to the substitute only applies
 the missing changes.
 
 * when the entire raid fails, I can re-build a new one from the
 backup.
 
 The filesystem is mounted at its root and has several nested
 subvolumes and snapshots (in a .snapshots subdir on each subvol).
> [...]
> 
>> I'm new; btrfs noob; completely unqualified to write intelligently on
>> this topic, nevertheless:
>> I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
>> backup device someplace /dev/C
>> 
>> Could you, at the time you wanted to backup the filesystem:
>> 1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
>> 2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
>> 3) balance to effect the backup (i.e. rebuilding the RAID1 onto
>> /dev/C) 4) break/reconnect the original devices: remove /dev/C;
>> re-add /dev/B to the fs
> 
> I've thought of this but don't dare try it without approval from the 
> experts. At any rate, for being practical, this approach hinges on an 
> ability to rebuild the raid1 incrementally. That is, the rebuild would 
> have to start from what already is present on disk B (or C, when it is 
> re-added). Starting from an effectively blank disk each time would be 
> prohibitive.
> 
> Even if this would work, I'd much prefer keeping the original raid1 
> intact and to only temporarily add another mirror: "lazy mirroring", to 
> give the thing a name.

At best this seems fragile, but I don't think it works and is an edge case from 
the start. This is what send/receive is for.

In the btrfs replace scenario, the missing device is removed from the volume. 
It's like a divorce. Missing device 2 is replaced by a different physical 
device also called device 2. If you then removed 2b and readd (formerly 
replaced) device 2a, what happens? I don't know, I'm pretty sure the volume 
knows this is not device 2b as it should be, and won't accept formerly replaced 
device 2a. But it's an edge case to do this because you've said "device 
replace". So lexicon wise, I wouldn't even want this to work, we'd need a 
different command even if not different logic.

In the btfs device add case, you now have a three disk raid1 which is a whole 
different beast. Since this isn't n-way raid1, each disk is not stand alone. 
You're only assured data survives a one disk failure meaning you must have two 
drives. You've just increased your risk by doing this, not reduced it. It 
further proposes running an (ostensibly) production workflow with an always 
degraded volume, mounted with -o degraded, on an on-going basis. So it's three 
strikes.  It's not n-way, you have no uptime if you lose one of two disks 
onsite, you's have to go get the offsite/onshelf disk to keep working. Plus 
that offsite disk isn't stand alone, so why even have it offsite? This is a 
fail.

So the btrfs replace scenario might work but it seems like a bad idea. And 
overall it's a use case for which send/receive was designed anyway so why not 
just use that?

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig
On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
> On 2014-Mar-13 14:28, Hugo Mills wrote:
> > On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
> >> My backup use case is different from the what has been recently
> >> discussed in another thread. I'm trying to guard against hardware
> >> failure and other causes of destruction.
> >> 
> >> I have a btrfs raid1 filesystem spread over two disks. I want to
> >> backup this filesystem regularly and efficiently to an external
> >> disk (same model as the ones in the raid) in such a way that
> >> 
> >> * when one disk in the raid fails, I can substitute the backup and
> >> rebalancing from the surviving disk to the substitute only applies
> >> the missing changes.
> >> 
> >> * when the entire raid fails, I can re-build a new one from the
> >> backup.
> >> 
> >> The filesystem is mounted at its root and has several nested
> >> subvolumes and snapshots (in a .snapshots subdir on each subvol).
[...]

> I'm new; btrfs noob; completely unqualified to write intelligently on
> this topic, nevertheless:
> I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
> backup device someplace /dev/C
> 
> Could you, at the time you wanted to backup the filesystem:
> 1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
> 2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
> 3) balance to effect the backup (i.e. rebuilding the RAID1 onto
> /dev/C) 4) break/reconnect the original devices: remove /dev/C;
> re-add /dev/B to the fs

I've thought of this but don't dare try it without approval from the 
experts. At any rate, for being practical, this approach hinges on an 
ability to rebuild the raid1 incrementally. That is, the rebuild would 
have to start from what already is present on disk B (or C, when it is 
re-added). Starting from an effectively blank disk each time would be 
prohibitive.

Even if this would work, I'd much prefer keeping the original raid1 
intact and to only temporarily add another mirror: "lazy mirroring", to 
give the thing a name.

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Brendan Hide

On 2014/03/13 09:48 PM, Andrew Skretvedt wrote:

On 2014-Mar-13 14:28, Hugo Mills wrote:

On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:

I have a btrfs raid1 filesystem spread over two disks. I want to backup
this filesystem regularly and efficiently to an external disk (same
model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup and
rebalancing from the surviving disk to the substitute only applies the
missing changes.

For point 1, not really. It's a different filesystem
[snip]
Hugo.

I'm new

We all start somewhere. ;)

Could you, at the time you wanted to backup the filesystem:
1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
Its this step that won't work "as is" and, from an outsider's 
perspective, it is not obvious why:
As Hugo mentioned, "It's a different filesystem". The two disks don't 
have any "co-ordinating" record of data and don't have any record 
indicating that the other disk even exists. The files they store might 
even be identical - but there's a lot of missing information that would 
be necessary to tell them they can work together. All this will do is 
reformat /dev/C and then it will be rewritten again by the balance 
operation in step 3) below.

3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C)
4) break/reconnect the original devices: remove /dev/C; re-add /dev/B 
to the fs
Again, as with 2), /dev/A is now synchronised with (for all intents and 
purposes) a new disk. If you want to re-add /dev/B, you're going to lose 
any data on /dev/B (view this in the sense that, if you wiped the disk, 
the end-result would be the same) and then you would be re-balancing new 
data onto it from scratch.


Before removing /dev/B:
Disk A: abdeg__cf__
Disk B: abc_df_ge__ <- note that data is *not* necessarily stored in the 
exact same position on both disks

Disk C: gbfc_d__a_e

All data is available on all disks. Disk C has no record indicating that 
disks A and B exist.
Disk A and B have a record indicating that the other disk is part of the 
same FS. These two disks have no record indicating disk C exists.


1. Remove /dev/B:
Disk A: abdeg__cf__
Disk C: gbfc_d__a_e

2. Add /dev/C to /dev/A as RAID1:
Disk A: abdeg__cf__
Disk C: _## <- system reformats /dev/C and treats the old data 
as garbage


3. Balance /dev/{A,C}:
Disk A: abdeg__cf__
Disk C: abcdefg

Both disks now have a full record of where the data is supposed to be 
and have a record indicating that the other disk is part of the FS. 
Notice that, though Disk C has the exact same files as it did before 
step 1, the on-disk filesystem looks very different.


4. Follow steps 1, 2, and 3 above - but with different disks - similar 
end-result.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Testing BTRFS

2014-03-13 Thread Avi Miller
Hi,

On 14 Mar 2014, at 5:10 am, Lists  wrote:

> Is there any issue with BTRFS and 32 bit O/S like with ZFS?

We provide some btrfs support with the 32-bit UEK Release 2 on OL6, but we 
strongly recommend only using the UEK Release 3 which is 64-bit only.

--
Oracle 
Avi Miller | Product Management Director | +61 (3) 8616 3496
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Andrew Skretvedt

On 2014-Mar-13 14:28, Hugo Mills wrote:

On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:


My backup use case is different from the what has been recently
discussed in another thread. I'm trying to guard against hardware
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to backup
this filesystem regularly and efficiently to an external disk (same
model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup and
rebalancing from the surviving disk to the substitute only applies the
missing changes.

* when the entire raid fails, I can re-build a new one from the backup.

The filesystem is mounted at its root and has several nested subvolumes
and snapshots (in a .snapshots subdir on each subvol).

Is it possible to do what I'm looking for?


For point 2, yes. (Add new disk, balance -oconvert from single to
raid1).

For point 1, not really. It's a different filesystem, so it'll have
a different UUID. You *might* be able to get away with rsync of one of
the block devices in the array to the backup block device, but you'd
have to unmount the FS (or halt all writes to it) for the period of
the rsync to ensure a consistent image, and the rsync would have to
read all the data in the device being synced to work out what to send.
Probably not what you want.

Hugo.

I'm new; btrfs noob; completely unqualified to write intelligently on 
this topic, nevertheless:
I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a 
backup device someplace /dev/C


Could you, at the time you wanted to backup the filesystem:
1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B
2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added
3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C)
4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to 
the fs


I think this could be done online. Any one device [ABC] surviving is 
sufficient to rebuild a RAID1 of the filesystem, or be accessed alone in 
degraded fashion for disaster recovery purposes.


I think that would address point 1. Is my thinking horrible on this? 
(again, noob to btrfs)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove transaction from send

2014-03-13 Thread Josef Bacik
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/backref.c | 33 +++
 fs/btrfs/ctree.c   | 88 --
 fs/btrfs/ctree.h   |  3 +-
 fs/btrfs/disk-io.c |  3 +-
 fs/btrfs/extent-tree.c | 20 ++--
 fs/btrfs/inode-map.c   | 14 
 fs/btrfs/send.c| 57 ++--
 fs/btrfs/transaction.c | 45 --
 fs/btrfs/transaction.h |  1 +
 9 files changed, 77 insertions(+), 187 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 860f4f2..0be0e94 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
goto out;
}
 
-   root_level = btrfs_old_root_level(root, time_seq);
+   if (path->search_commit_root)
+   root_level = btrfs_header_level(root->commit_root);
+   else
+   root_level = btrfs_old_root_level(root, time_seq);
 
if (root_level + 1 == level) {
srcu_read_unlock(&fs_info->subvol_srcu, index);
@@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct btrfs_trans_handle 
*trans,
  *
  * returns 0 on success, < 0 on error.
  */
-int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info, u64 bytenr,
-   u64 time_seq, struct ulist **roots)
+static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info, u64 bytenr,
+ u64 time_seq, struct ulist **roots)
 {
struct ulist *tmp;
struct ulist_node *node = NULL;
@@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
+int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr,
+u64 time_seq, struct ulist **roots)
+{
+   int ret;
+
+   if (!trans)
+   down_read(&fs_info->commit_root_sem);
+   ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots);
+   if (!trans)
+   up_read(&fs_info->commit_root_sem);
+   return ret;
+}
+
 /*
  * this makes the path point to (inum INODE_ITEM ioff)
  */
@@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
if (IS_ERR(trans))
return PTR_ERR(trans);
btrfs_get_tree_mod_seq(fs_info, &tree_mod_seq_elem);
+   } else {
+   down_read(&fs_info->commit_root_sem);
}
 
ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid,
@@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 
ULIST_ITER_INIT(&ref_uiter);
while (!ret && (ref_node = ulist_next(refs, &ref_uiter))) {
-   ret = btrfs_find_all_roots(trans, fs_info, ref_node->val,
-  tree_mod_seq_elem.seq, &roots);
+   ret = __btrfs_find_all_roots(trans, fs_info, ref_node->val,
+tree_mod_seq_elem.seq, &roots);
if (ret)
break;
ULIST_ITER_INIT(&root_uiter);
@@ -1542,6 +1561,8 @@ out:
if (!search_commit_root) {
btrfs_put_tree_mod_seq(fs_info, &tree_mod_seq_elem);
btrfs_end_transaction(trans, fs_info->extent_root);
+   } else {
+   up_read(&fs_info->commit_root_sem);
}
 
return ret;
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 88d1b1e..9d89c16 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5360,7 +5360,6 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 {
int ret;
int cmp;
-   struct btrfs_trans_handle *trans = NULL;
struct btrfs_path *left_p

Re: Incremental backup for a raid1

2014-03-13 Thread Hugo Mills
On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
> 
> My backup use case is different from the what has been recently 
> discussed in another thread. I'm trying to guard against hardware 
> failure and other causes of destruction.
> 
> I have a btrfs raid1 filesystem spread over two disks. I want to backup 
> this filesystem regularly and efficiently to an external disk (same 
> model as the ones in the raid) in such a way that
> 
> * when one disk in the raid fails, I can substitute the backup and 
> rebalancing from the surviving disk to the substitute only applies the 
> missing changes.
> 
> * when the entire raid fails, I can re-build a new one from the backup.
> 
> The filesystem is mounted at its root and has several nested subvolumes 
> and snapshots (in a .snapshots subdir on each subvol).
> 
> Is it possible to do what I'm looking for?

   For point 2, yes. (Add new disk, balance -oconvert from single to
raid1).

   For point 1, not really. It's a different filesystem, so it'll have
a different UUID. You *might* be able to get away with rsync of one of
the block devices in the array to the backup block device, but you'd
have to unmount the FS (or halt all writes to it) for the period of
the rsync to ensure a consistent image, and the rsync would have to
read all the data in the device being synced to work out what to send.
Probably not what you want.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Do not meddle in the affairs of system administrators,  for ---   
  they are subtle,  and quick to anger.  


signature.asc
Description: Digital signature


Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig

My backup use case is different from the what has been recently 
discussed in another thread. I'm trying to guard against hardware 
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to backup 
this filesystem regularly and efficiently to an external disk (same 
model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup and 
rebalancing from the surviving disk to the substitute only applies the 
missing changes.

* when the entire raid fails, I can re-build a new one from the backup.

The filesystem is mounted at its root and has several nested subvolumes 
and snapshots (in a .snapshots subdir on each subvol).

Is it possible to do what I'm looking for?

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Testing BTRFS

2014-03-13 Thread Lists

On 03/10/2014 06:02 PM, Avi Miller wrote:

Oracle Linux 6 with the Unbreakable Enterprise Kernel Release 2 or Release 3 
has production-ready btrfs support. You can even convert your existing CentOS6 
boxes across to Oracle Linux 6 in-place without reinstalling:

http://linux.oracle.com/switch/centos/

Oracle also now provides all errata, including security and bug fixes for free 
athttp://public-yum.oracle.com  and our kernel source code can be found 
athttps://oss.oracle.com/git/


Is there any issue with BTRFS and 32 bit O/S like with ZFS?

-Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding btrfs and backups

2014-03-13 Thread Chris Murphy

On Mar 7, 2014, at 7:03 AM, Eric Mesa  wrote:
> 
> Duncan - thanks for this comprehensive explanation. For a huge portion of
> your reply...I was all wondering why you and others were saying snapshots
> aren't backups. They certainly SEEMED like backups. But now I see that the
> problem is one of precise terminology vs colloquialisms. In other words,
> snapsshots are not backups in and of themselves. They are like Mac's Time
> Machine. BUT if you take these snapshots and then put them on another media
> - whether that's local or not - THEN you have backups. Am I right, or am I
> still missing something subtle?

Hmm, yes because snapshots on a mirrored drive are on another media but that's 
still not considered a backup. I think what makes a backup is separate device 
and separate file system. That's because the top vectors for data loss are: 
user induced, device failure, and file system corruption. These are 
substantially mitigated by having backup files located both on separate file 
systems and device.

Also, Time Machine qualifies as a backup because it copies files to a separate 
device with a separate file system. (There is a feature in recent OS X versions 
that store hourly incremental backups on the local drive when the usual target 
device isn't available - these are arguably not backups but rather snapshots 
that are pending backups. Once the target device is available, the snapshots 
are copied over to it.)

If you have data you feel is really important, my suggestion is that you have a 
completely different backup/restore method than what you're talking about. It 
needs to be bullet proof, well tested. And consider all the Btrfs send/receive 
work you're doing as testing/work-in-progress. There are still cases on the 
list where people have had problems with send/receive, both the send and 
receive code have a lot of churn, so I don't know that anyone can definitively 
tell you that a btrfs send/receive only based backup is going to reliably 
restore in one month let alone three years. Should it? Yes of course. Will it?


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Steven Whitehouse
Hi,

On Thu, 2014-03-13 at 17:23 +0100, Jan Kara wrote:
> On Thu 13-03-14 10:20:56, Ted Tso wrote:
> > Previously, the no-op "mount -o mount /dev/xxx" operation when the
>   ^^remount
> 
> > file system is already mounted read-write causes an implied,
> > unconditional syncfs().  This seems pretty stupid, and it's certainly
> > documented or guaraunteed to do this, nor is it particularly useful,
> > except in the case where the file system was mounted rw and is getting
> > remounted read-only.
> > 
> > However, it's possible that there might be some file systems that are
> > actually depending on this behavior.  In most file systems, it's
> > probably fine to only call sync_filesystem() when transitioning from
> > read-write to read-only, and there are some file systems where this is
> > not needed at all (for example, for a pseudo-filesystem or something
> > like romfs).
>   Hum, I'd avoid this excercise at least for filesystem where
> sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
> also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
> efs, freevxfs, romfs, squashfs. I think you can find a couple more which
> clearly don't care about sync_filesystem() if you look a bit closer.
> 
>
>   Honza

I guess the same is true for other file systems which are mounted ro
too. So maybe a check for MS_RDONLY before doing the sync in those
cases?

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Jan Kara
On Thu 13-03-14 10:20:56, Ted Tso wrote:
> Previously, the no-op "mount -o mount /dev/xxx" operation when the
  ^^remount

> file system is already mounted read-write causes an implied,
> unconditional syncfs().  This seems pretty stupid, and it's certainly
> documented or guaraunteed to do this, nor is it particularly useful,
> except in the case where the file system was mounted rw and is getting
> remounted read-only.
> 
> However, it's possible that there might be some file systems that are
> actually depending on this behavior.  In most file systems, it's
> probably fine to only call sync_filesystem() when transitioning from
> read-write to read-only, and there are some file systems where this is
> not needed at all (for example, for a pseudo-filesystem or something
> like romfs).
  Hum, I'd avoid this excercise at least for filesystem where
sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
efs, freevxfs, romfs, squashfs. I think you can find a couple more which
clearly don't care about sync_filesystem() if you look a bit closer.

Honza
> 
> Signed-off-by: "Theodore Ts'o" 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Christoph Hellwig 
> Cc: Artem Bityutskiy 
> Cc: Adrian Hunter 
> Cc: Evgeniy Dushistov 
> Cc: Jan Kara 
> Cc: OGAWA Hirofumi 
> Cc: Anders Larsen 
> Cc: Phillip Lougher 
> Cc: Kees Cook 
> Cc: Mikulas Patocka 
> Cc: Petr Vandrovec 
> Cc: x...@oss.sgi.com
> Cc: linux-btrfs@vger.kernel.org
> Cc: linux-c...@vger.kernel.org
> Cc: samba-techni...@lists.samba.org
> Cc: codal...@coda.cs.cmu.edu
> Cc: linux-e...@vger.kernel.org
> Cc: linux-f2fs-de...@lists.sourceforge.net
> Cc: fuse-de...@lists.sourceforge.net
> Cc: cluster-de...@redhat.com
> Cc: linux-...@lists.infradead.org
> Cc: jfs-discuss...@lists.sourceforge.net
> Cc: linux-...@vger.kernel.org
> Cc: linux-ni...@vger.kernel.org
> Cc: linux-ntfs-...@lists.sourceforge.net
> Cc: ocfs2-de...@oss.oracle.com
> Cc: reiserfs-de...@vger.kernel.org
> ---
>  fs/adfs/super.c  | 1 +
>  fs/affs/super.c  | 1 +
>  fs/befs/linuxvfs.c   | 1 +
>  fs/btrfs/super.c | 1 +
>  fs/cifs/cifsfs.c | 1 +
>  fs/coda/inode.c  | 1 +
>  fs/cramfs/inode.c| 1 +
>  fs/debugfs/inode.c   | 1 +
>  fs/devpts/inode.c| 1 +
>  fs/efs/super.c   | 1 +
>  fs/ext2/super.c  | 1 +
>  fs/ext3/super.c  | 2 ++
>  fs/ext4/super.c  | 2 ++
>  fs/f2fs/super.c  | 2 ++
>  fs/fat/inode.c   | 2 ++
>  fs/freevxfs/vxfs_super.c | 1 +
>  fs/fuse/inode.c  | 1 +
>  fs/gfs2/super.c  | 2 ++
>  fs/hfs/super.c   | 1 +
>  fs/hfsplus/super.c   | 1 +
>  fs/hpfs/super.c  | 2 ++
>  fs/isofs/inode.c | 1 +
>  fs/jffs2/super.c | 1 +
>  fs/jfs/super.c   | 1 +
>  fs/minix/inode.c | 1 +
>  fs/ncpfs/inode.c | 1 +
>  fs/nfs/super.c   | 2 ++
>  fs/nilfs2/super.c| 1 +
>  fs/ntfs/super.c  | 2 ++
>  fs/ocfs2/super.c | 2 ++
>  fs/openpromfs/inode.c| 1 +
>  fs/proc/root.c   | 2 ++
>  fs/pstore/inode.c| 1 +
>  fs/qnx4/inode.c  | 1 +
>  fs/qnx6/inode.c  | 1 +
>  fs/reiserfs/super.c  | 1 +
>  fs/romfs/super.c | 1 +
>  fs/squashfs/super.c  | 1 +
>  fs/super.c   | 2 --
>  fs/sysv/inode.c  | 1 +
>  fs/ubifs/super.c | 1 +
>  fs/udf/super.c   | 1 +
>  fs/ufs/super.c   | 1 +
>  fs/xfs/xfs_super.c   | 1 +
>  44 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/adfs/super.c b/fs/adfs/super.c
> index 7b3003c..952aeb0 100644
> --- a/fs/adfs/super.c
> +++ b/fs/adfs/super.c
> @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
> *options)
>  
>  static int adfs_remount(struct super_block *sb, int *flags, char *data)
>  {
> + sync_filesystem(sb);
>   *flags |= MS_NODIRATIME;
>   return parse_options(sb, data);
>  }
> diff --git a/fs/affs/super.c b/fs/affs/super.c
> index d098731..3074530 100644
> --- a/fs/affs/super.c
> +++ b/fs/affs/super.c
> @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char 
> *data)
>  
>   pr_debug("AFFS: remount(flags=0x%x,opts=\"%s\")\n",*flags,data);
>  
> + sync_filesystem(sb);
>   *flags |= MS_NODIRATIME;
>  
>   memcpy(volume, sbi->s_volume, 32);
> diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
> index 845d2d6..56d70c8 100644
> --- a/fs/befs/linuxvfs.c
> +++ b/fs/befs/linuxvfs.c
> @@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int 
> silent)
>  static int
>  befs_remount(struct super_block *sb, int *flags, char *data)
>  {
> + sync_filesystem(sb);
>   if (!(*flags & MS_RDONLY))
>   return -EINVAL;
>   return 0;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 9

[PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts'o
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" 
Cc: linux-fsde...@vger.kernel.org
Cc: Christoph Hellwig 
Cc: Artem Bityutskiy 
Cc: Adrian Hunter 
Cc: Evgeniy Dushistov 
Cc: Jan Kara 
Cc: OGAWA Hirofumi 
Cc: Anders Larsen 
Cc: Phillip Lougher 
Cc: Kees Cook 
Cc: Mikulas Patocka 
Cc: Petr Vandrovec 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-c...@vger.kernel.org
Cc: samba-techni...@lists.samba.org
Cc: codal...@coda.cs.cmu.edu
Cc: linux-e...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: fuse-de...@lists.sourceforge.net
Cc: cluster-de...@redhat.com
Cc: linux-...@lists.infradead.org
Cc: jfs-discuss...@lists.sourceforge.net
Cc: linux-...@vger.kernel.org
Cc: linux-ni...@vger.kernel.org
Cc: linux-ntfs-...@lists.sourceforge.net
Cc: ocfs2-de...@oss.oracle.com
Cc: reiserfs-de...@vger.kernel.org
---
 fs/adfs/super.c  | 1 +
 fs/affs/super.c  | 1 +
 fs/befs/linuxvfs.c   | 1 +
 fs/btrfs/super.c | 1 +
 fs/cifs/cifsfs.c | 1 +
 fs/coda/inode.c  | 1 +
 fs/cramfs/inode.c| 1 +
 fs/debugfs/inode.c   | 1 +
 fs/devpts/inode.c| 1 +
 fs/efs/super.c   | 1 +
 fs/ext2/super.c  | 1 +
 fs/ext3/super.c  | 2 ++
 fs/ext4/super.c  | 2 ++
 fs/f2fs/super.c  | 2 ++
 fs/fat/inode.c   | 2 ++
 fs/freevxfs/vxfs_super.c | 1 +
 fs/fuse/inode.c  | 1 +
 fs/gfs2/super.c  | 2 ++
 fs/hfs/super.c   | 1 +
 fs/hfsplus/super.c   | 1 +
 fs/hpfs/super.c  | 2 ++
 fs/isofs/inode.c | 1 +
 fs/jffs2/super.c | 1 +
 fs/jfs/super.c   | 1 +
 fs/minix/inode.c | 1 +
 fs/ncpfs/inode.c | 1 +
 fs/nfs/super.c   | 2 ++
 fs/nilfs2/super.c| 1 +
 fs/ntfs/super.c  | 2 ++
 fs/ocfs2/super.c | 2 ++
 fs/openpromfs/inode.c| 1 +
 fs/proc/root.c   | 2 ++
 fs/pstore/inode.c| 1 +
 fs/qnx4/inode.c  | 1 +
 fs/qnx6/inode.c  | 1 +
 fs/reiserfs/super.c  | 1 +
 fs/romfs/super.c | 1 +
 fs/squashfs/super.c  | 1 +
 fs/super.c   | 2 --
 fs/sysv/inode.c  | 1 +
 fs/ubifs/super.c | 1 +
 fs/udf/super.c   | 1 +
 fs/ufs/super.c   | 1 +
 fs/xfs/xfs_super.c   | 1 +
 44 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 7b3003c..952aeb0 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
*options)
 
 static int adfs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
return parse_options(sb, data);
 }
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d098731..3074530 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data)
 
pr_debug("AFFS: remount(flags=0x%x,opts=\"%s\")\n",*flags,data);
 
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
 
memcpy(volume, sbi->s_volume, 32);
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 845d2d6..56d70c8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int 
silent)
 static int
 befs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
if (!(*flags & MS_RDONLY))
return -EINVAL;
return 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97cc241..00cd0c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1381,6 +1381,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned int old_metadata_ratio = fs_info->metadata_ratio;
int ret;
 
+   sync_filesystem(sb);
btrfs_remount_prepare(fs_info);
 
ret = btrfs_parse_options(root, data);
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..4942c94 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -541,6 +541,7 @@ static int cifs_show_stats(struct seq_file *s, struct 
dentry *root)
 
 static int cifs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
return 0;
 }
diff 

Re: [PATCH] Btrfs: fix joining same transaction handle more than twice

2014-03-13 Thread Josef Bacik
On 03/13/2014 01:19 AM, Wang Shilong wrote:
> We hit something like the following function call flows:
> 
> |->run_delalloc_range()
>  |->btrfs_join_transaction()
>|->cow_file_range()
>  |->btrfs_join_transaction()
>|->find_free_extent()
>  |->btrfs_join_transaction()
> 
> Trace infomation can be seen as:
> 
> [ 7411.127040] [ cut here ]
> [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 
> start_transaction+0x561/0x580 [btrfs]()
> [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G   O 
> 3.13.0+ #4
> [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
> [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
> [ 7411.127092] Call Trace:
> [ 7411.127097]  [] dump_stack+0x45/0x56
> [ 7411.127101]  [] warn_slowpath_common+0x7d/0xa0
> [ 7411.127102]  [] warn_slowpath_null+0x1a/0x20
> [ 7411.127109]  [] start_transaction+0x561/0x580 [btrfs]
> [ 7411.127115]  [] btrfs_join_transaction+0x17/0x20 [btrfs]
> [ 7411.127120]  [] find_free_extent+0xa21/0xb50 [btrfs]
> [ 7411.127126]  [] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
> [ 7411.127131]  [] btrfs_alloc_free_block+0xee/0x440 [btrfs]
> [ 7411.127137]  [] ? btree_set_page_dirty+0xe/0x10 [btrfs]
> [ 7411.127142]  [] __btrfs_cow_block+0x121/0x530 [btrfs]
> [ 7411.127146]  [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
> [ 7411.127151]  [] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
> [ 7411.127157]  [] btrfs_lookup_file_extent+0x37/0x40 
> [btrfs]
> [ 7411.127163]  [] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
> [ 7411.127169]  [] ? start_transaction+0x93/0x580 [btrfs]
> [ 7411.127171]  [] ? kmem_cache_alloc+0x132/0x140
> [ 7411.127176]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
> [ 7411.127182]  [] cow_file_range_inline+0x181/0x2e0 [btrfs]
> [ 7411.127187]  [] cow_file_range+0x2ed/0x440 [btrfs]
> [ 7411.127194]  [] ? free_extent_buffer+0x4f/0xb0 [btrfs]
> [ 7411.127200]  [] run_delalloc_nocow+0x38f/0xa60 [btrfs]
> [ 7411.127207]  [] ? test_range_bit+0x30/0x180 [btrfs]
> [ 7411.127212]  [] run_delalloc_range+0x2e8/0x350 [btrfs]
> [ 7411.127219]  [] ? find_lock_delalloc_range+0x1a9/0x1e0 
> [btrfs]
> [ 7411.127222]  [] ? blk_queue_bio+0x2c1/0x330
> [ 7411.127228]  [] __extent_writepage+0x2f4/0x760 [btrfs]
> 
> Here we fix it by avoiding joining transaction again if we have held
> a transaction handle when allocating chunk in find_free_extent().
> 
>

So I just put that warning there to see if we were ever embedding 3
joins at a time, not because it was an actual problem, I'd say just kill
the warning.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] xfstests/btrfs: add stress test for btrfs quota operations

2014-03-13 Thread Josef Bacik
On 03/12/2014 11:12 PM, Dave Chinner wrote:
> On Mon, Mar 10, 2014 at 03:48:43PM -0400, Josef Bacik wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> On 03/09/2014 11:44 PM, Wang Shilong wrote:
>>> So this is a stress test for btrfs quota operations. it can also 
>>> detect the following commit fixed problem:
>>>
>>> 4082bd3d73(Btrfs: fix oops when writting dirty qgroups to disk)
>>>
>>> Signed-off-by: Wang Shilong  --- 
>>> v1->v2: switch into new helper _run_btrfs_util_prog() --- 
>>> tests/btrfs/043 | 76
>>> + 
>>> tests/btrfs/043.out |  2 ++ tests/btrfs/group   |  1 + 3 files
>>> changed, 79 insertions(+) create mode 100755 tests/btrfs/043 create
>>> mode 100644 tests/btrfs/043.out
>>>
>>> diff --git a/tests/btrfs/043 b/tests/btrfs/043 new file mode
>>> 100755 index 000..d6c4bf3 --- /dev/null +++ b/tests/btrfs/043 
>>> @@ -0,0 +1,76 @@ +#! /bin/bash +# FS QA Test No. 043 +# +#
>>> stresstest for btrfs quota operations. we run fsstress and quota +#
>>> operations concurrently. +# 
>>> +#---
>>>
>>>
>> +# Copyright (c) 2014 Fujitsu.  All Rights Reserved.
>>> +# +# This program is free software; you can redistribute it
>>> and/or +# modify it under the terms of the GNU General Public
>>> License as +# published by the Free Software Foundation. +# +# This
>>> program is distributed in the hope that it would be useful, +# but
>>> WITHOUT ANY WARRANTY; without even the implied warranty of +#
>>> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the +#
>>> GNU General Public License for more details. +# +# You should have
>>> received a copy of the GNU General Public License +# along with
>>> this program; if not, write the Free Software Foundation, +# Inc.,
>>> 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA 
>>> +#---
>>>
>>>
>> +#
>>> + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output
>>> created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1   # failure is
>>> the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + 
>>> +_cleanup() +{ +cd / +  rm -f $tmp.* +} + +# get standard
>>> environment, filters and checks +. ./common/rc +. ./common/filter 
>>> + +_supported_fs btrfs +_supported_os Linux +_require_scratch + +rm
>>> -f $seqres.full + +_quota_enabled_background() +{ + i=1 +   while [
>>> $i -le 5 ] +do +_run_btrfs_util_prog quota enable 
>>> $SCRATCH_MNT +
>>> _run_btrfs_util_prog quota disable $SCRATCH_MNT +   i=$(($i+1)) +
>>> sleep 1 +   done +} + +MKFS_SIZE=$((1024 * 1024 * 1024)) +run_check
>>> _scratch_mkfs_sized $MKFS_SIZE +run_check _scratch_mount + 
>>> +_quota_enabled_background & +run_check $FSSTRESS_PROG -d
>>> $SCRATCH_MNT -w -p 5 -n 1000 \ +$FSSTRESS_AVOID + +run_check
>>> _scratch_unmount +_check_scratch_fs +
>>
>> You should probably be doing something to make sure the background
>> quota stuff exits properly before your script exits, my fio box can
>> run the fsstress in way less than 5 seconds.  Thanks,
> 
> josef - you might want to have a look at what your mailer is doing
> to quoted email ad fix it... ;)

Eesh I think it's enigmail, I'll turn it off.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] xfstests: add test for btrfs send issuing premature rmdir operations

2014-03-13 Thread Filipe David Borba Manana
Regression test for btrfs incremental send issue where a rmdir instruction
is sent against an orphan directory inode which is not empty yet, causing
btrfs receive to fail when it attempts to remove the directory.

This issue is fixed by the following linux kernel btrfs patch:

Btrfs: fix send attempting to rmdir non-empty directories

Signed-off-by: Filipe David Borba Manana 
Reviewed-by: Josef Bacik 
---

Resending since Dave Chinner asked to do it for any patches he might have
missed in his last merge.

 tests/btrfs/043 |  149 +++
 tests/btrfs/043.out |1 +
 tests/btrfs/group   |1 +
 3 files changed, 151 insertions(+)
 create mode 100644 tests/btrfs/043
 create mode 100644 tests/btrfs/043.out

diff --git a/tests/btrfs/043 b/tests/btrfs/043
new file mode 100644
index 000..b1fef96
--- /dev/null
+++ b/tests/btrfs/043
@@ -0,0 +1,149 @@
+#! /bin/bash
+# FS QA Test No. btrfs/043
+#
+# Regression test for btrfs incremental send issue where a rmdir instruction
+# is sent against an orphan directory inode which is not empty yet, causing
+# btrfs receive to fail when it attempts to remove the directory.
+#
+# This issue is fixed by the following linux kernel btrfs patch:
+#
+#   Btrfs: fix send attempting to rmdir non-empty directories
+#
+#---
+# Copyright (c) 2014 Filipe Manana.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=`mktemp -d`
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fssum
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs >/dev/null 2>&1
+_scratch_mount
+
+mkdir -p $SCRATCH_MNT/a/b
+mkdir $SCRATCH_MNT/0
+mkdir $SCRATCH_MNT/1
+mkdir $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/0 $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/1 $SCRATCH_MNT/a/b/c
+echo 'ola mundo' > $SCRATCH_MNT/a/b/c/foo.txt
+mkdir $SCRATCH_MNT/a/b/c/x
+mkdir $SCRATCH_MNT/a/b/c/x2
+mkdir $SCRATCH_MNT/a/b/y
+mkdir $SCRATCH_MNT/a/b/z
+mkdir -p $SCRATCH_MNT/a/b/d1/d2/d3
+mkdir $SCRATCH_MNT/a/b/d4
+
+# Filesystem looks like:
+#
+# .(ino 256)
+# |-- a/   (ino 257)
+# |-- b/   (ino 258)
+# |-- c/   (ino 261)
+# |   |-- foo.txt  (ino 262)
+# |   |-- 0/   (ino 259)
+# |   |-- 1/   (ino 260)
+# |   |-- x/   (ino 263)
+# |   |-- x2/  (ino 264)
+# |
+# |-- y/   (ino 265)
+# |-- z/   (ino 266)
+# |-- d1/  (ino 267)
+# |   |-- d2/  (ino 268)
+# |   |-- d3/  (ino 269)
+# |
+# |-- d4/  (ino 270)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
+
+rm -f $SCRATCH_MNT/a/b/c/foo.txt
+mv $SCRATCH_MNT/a/b/y $SCRATCH_MNT/a/b/YY
+mv $SCRATCH_MNT/a/b/z $SCRATCH_MNT/a
+mv $SCRATCH_MNT/a/b/c/x $SCRATCH_MNT/a/b/YY
+mv $SCRATCH_MNT/a/b/c/0 $SCRATCH_MNT/a/b/YY/00
+mv $SCRATCH_MNT/a/b/c/x2 $SCRATCH_MNT/a/z/X_2
+mv $SCRATCH_MNT/a/b/c/1 $SCRATCH_MNT/a/z/X_2
+rmdir $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/a/b/d4 $SCRATCH_MNT/a/d44
+mv $SCRATCH_MNT/a/b/d1/d2 $SCRATCH_MNT/a/d44
+rmdir $SCRATCH_MNT/a/b/d1
+
+# Filesystem now looks like:
+#
+# .(ino 256)
+# |-- a/   (ino 257)
+# |-- b/   (ino 258)
+# |   |-- YY/  (ino 265)
+# ||-- x/  (ino 263)
+# ||-- 00/ (ino 259)
+# |
+# |-- z/   (ino 266)
+# |   |-- X_2/ (ino 264)
+# ||-- 1/  (ino 260)
+# |
+# |-- d44/ (ino 270)
+#  |-- d2/ (ino 268)
+#  |-- d3/ (ino 269)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
+
+run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
+run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
+   $SCRATCH_MNT/mysnap2
+
+_run_btrfs_util_

[PATCH RESEND] xfstests: add regression test for btrfs incremental send

2014-03-13 Thread Filipe David Borba Manana
Regression test for a btrfs incremental send issue where invalid paths for
utimes, chown and chmod operations were sent to the send stream, causing
btrfs receive to fail.

If a directory had a move/rename operation delayed, and none of its parent
directories, except for the immediate one, had delayed move/rename operations,
after processing the directory's references, the incremental send code would
issue invalid paths for utimes, chown and chmod operations.

This issue is fixed by the following linux kernel btrfs patch:

Btrfs: fix send issuing outdated paths for utimes, chown and chmod

Signed-off-by: Filipe David Borba Manana 
Reviewed-by: Josef Bacik 
---

Resending since Dave Chinner asked to do it for any patches he might have
missed in his last merge.

Originally submitted with the title:
"xfstests: add test btrfs/042 for btrfs incremental send"

 tests/btrfs/044 |  129 +++
 tests/btrfs/044.out |1 +
 tests/btrfs/group   |1 +
 3 files changed, 131 insertions(+)
 create mode 100644 tests/btrfs/044
 create mode 100644 tests/btrfs/044.out

diff --git a/tests/btrfs/044 b/tests/btrfs/044
new file mode 100644
index 000..dae189e
--- /dev/null
+++ b/tests/btrfs/044
@@ -0,0 +1,129 @@
+#! /bin/bash
+# FS QA Test No. btrfs/044
+#
+# Regression test for a btrfs incremental send issue where under certain
+# scenarios invalid paths for utimes, chown and chmod operations were sent
+# to the send stream, causing btrfs receive to fail.
+#
+# If a directory had a move/rename operation delayed, and none of its parent
+# directories, except for the immediate one, had delayed move/rename 
operations,
+# after processing the directory's references, the incremental send code would
+# issue invalid paths for utimes, chown and chmod operations.
+#
+# This issue is fixed by the following linux kernel btrfs patch:
+#
+#   Btrfs: fix send issuing outdated paths for utimes, chown and chmod
+#
+#---
+# Copyright (c) 2014 Filipe Manana.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=`mktemp -d`
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fssum
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs >/dev/null 2>&1
+_scratch_mount
+
+umask 0
+mkdir -p $SCRATCH_MNT/a/b/c/d/e
+mkdir $SCRATCH_MNT/a/b/c/f
+echo 'ola ' > $SCRATCH_MNT/a/b/c/d/e/file.txt
+chmod 0777 $SCRATCH_MNT/a/b/c/d/e
+
+# Filesystem looks like:
+#
+# .   (ino 256)
+# |-- a/  (ino 257)
+# |-- b/  (ino 258)
+# |-- c/  (ino 259)
+# |-- d/  (ino 260)
+# |   |-- e/  (ino 261)
+# |   |-- file.txt(ino 262)
+# |
+# |-- f/  (ino 263)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
+
+echo 'mundo' >> $SCRATCH_MNT/a/b/c/d/e/file.txt
+mv $SCRATCH_MNT/a/b/c/d/e/file.txt $SCRATCH_MNT/a/b/c/d/e/file2.txt
+mv $SCRATCH_MNT/a/b/c/f $SCRATCH_MNT/a/b/f2
+mv $SCRATCH_MNT/a/b/c/d/e $SCRATCH_MNT/a/b/f2/e2
+mv $SCRATCH_MNT/a/b/c $SCRATCH_MNT/a/b/c2
+mv $SCRATCH_MNT/a/b/c2/d $SCRATCH_MNT/a/b/c2/d2
+chmod 0700 $SCRATCH_MNT/a/b/f2/e2
+
+# Filesystem now looks like:
+#
+# .  (ino 256)
+# |-- a/ (ino 257)
+# |-- b/ (ino 258)
+# |-- c2/(ino 259)
+# |   |-- d2/(ino 260)
+# |
+# |-- f2/(ino 263)
+# |-- e2 (ino 261)
+# |-- file2.txt  (ino 263)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
+
+run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
+run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
+   $SCRATCH_MNT/mysnap2
+

Re: [PATCH 2/2] Btrfs-progs: mkfs: make sure we can deal with hard links with -r option

2014-03-13 Thread Wang Shilong

Hi Dave,

On 03/13/2014 12:21 AM, David Sterba wrote:

On Tue, Mar 11, 2014 at 06:29:09PM +0800, Wang Shilong wrote:

@@ -840,6 +833,10 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
  cur_file->d_name, cur_inum,
  parent_inum, dir_index_cnt,
  &cur_inode);
+   if (ret == -EEXIST) {
+   BUG_ON(st.st_nlink <= 1);

As the mkfs operation is restartable, can we handle the error?
This should be a logic error which means a inode has hard links(but 
links <= 1). :-)


Add error handling may be better,  i will update it.

Thanks,
Wang


Otherwise, good fix, thanks.


+   continue;
+   }
if (ret) {
fprintf(stderr, "add_inode_items failed\n");
goto fail;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ordering of directory operations maintained across system crashes in Btrfs?

2014-03-13 Thread Goswin von Brederlow
On Mon, Mar 03, 2014 at 11:56:49AM -0600, thanumalayan mad wrote:
> Chris,
> 
> Great, thanks. Any guesses whether other filesystems (disk-based) do
> things similar to the last two examples you pointed out? Saying "we
> think 3 normal filesystems reorder stuff" seems to motivate
> application developers to fix bugs ...
> 
> Also, just for more information, the sequence we observed was,
> 
> Thread A:
> 
> unlink(foo)
> rename(somefile X, somefile Y)
> fsync(somefile Z)
> 
> The source and destination of the renamed file are unrelated to the
> fsync. But the rename happens in the fsync()'s transaction, while
> unlink() is delayed. I guess this has something to do with backrefs
> too.
> 
> Thanks,
> Thanu
> 
> On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason  wrote:
> > On 02/25/2014 09:01 PM, thanumalayan mad wrote:
> >>
> >> Hi all,
> >>
> >> Slightly complicated question.
> >>
> >> Assume I do two directory operations in a Btrfs partition (such as an
> >> unlink() and a rename()), one after the other, and a crash happens
> >> after the rename(). Can Btrfs (the current version) send the second
> >> operation to the disk first, so that after the crash, I observe the
> >> effects of rename() but not the effects of the unlink()?
> >>
> >> I think I am observing Btrfs re-ordering an unlink() and a rename(),
> >> and I just want to confirm that my observation is true. Also, if Btrfs
> >> does send directory operations to disk out of order, is there some
> >> limitation on this? Like, is this restricted to only unlink() and
> >> rename()?
> >>
> >> I am looking at some (buggy) applications that use Btrfs, and this
> >> behavior seems to affect them.
> >
> >
> > There isn't a single answer for this one.
> >
> > You might have
> >
> > Thread A:
> >
> > ulink(foo);
> > rename(somefile, somefile2);
> > 
> >
> > This should always have the rename happen before or in the same transaction
> > as the rename.
> >
> > Thread A:
> >
> > ulink(dirA/foo);
> > rename(dirB/somefile, dirB/somefile2);
> >
> > Here you're at the mercy of what is happening in dirB.  If someone fsyncs
> > that directory, it may hit the disk before the unlink.
> >
> > Thread A:
> >
> > ulink(foo);
> > rename(somefile, somefile2);
> > fsync(somefile);
> >
> > This one is even fuzzier.  Backrefs allow us to do some file fsyncs without
> > touching the directory, making it possible the unlink will hit disk after
> > the fsync.
> >
> > -chris

As I understand it POSIX only garanties that the in-core data is
updated by the syscalls in-order. On crash anything can happen. If the
application needs something to be commited to disk then it needs to
fsync(). Specifically it needs to fsync() the changed files AND
directories.

>From man fsync:

   Calling  fsync()  does  not  necessarily  ensure  that the entry in the
   directory containing the file has  also  reached  disk.   For  that  an
   explicit fsync() on a file descriptor for the directory is also needed.

So the fsync(somefile) above doesn't necessarily force the rename to
disk.


My experience with fuse tells me that at least fuse handles operations
in parallel and only blocks a later operation if it is affected by an
earlier operation. An unlink in one directory can (and will) run in
parallel to a rename in another directory. Then, depending on how
threads get scheduled, the rename can complete before the unlink.

My conclusion is that you need to fsync() the directory to ensure the
metadata update has made it to the disk if you require that. Otherwise
you have to be able to cope with (meta)data loss on crash.


Note: https://code.google.com/p/leveldb/issues/detail?id=189 talks a
lot about journaling and that any yournaling filesystem should
preserve the order. I think that is rather pointless for two reasons:

1) The journal gets replayed after a crash so in whatever order the
two journal entries are written doesn't matter. They both make it to
disk. You can't see one without the other. This is assuming you
fsync()ed the dirs so force the metadata change into the journal in
the first place.

2) btrfs afaik doesn't have any journal since COW already garanties
atomic updates and crash protection.


Overall I also think the fear of fsync() is overrated for this issue.
This would only happen on programm start or whenever you open a
database. Not somthing that happens every second.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding btrfs and backups

2014-03-13 Thread Chris Samuel
On Sun, 9 Mar 2014 03:30:44 PM Duncan wrote:

> While I realize that was in reference to the "up in flames" comment and 
> presumably if there's a need to worry about that, offsite backup /is/ of 
> some value, for some people, offsite backup really isn't that valuable.

Actually I missed that comment altogether, it was really just an illustration 
of why people should think about it - and then come to a decision about 
whether or not it makes sense for them.

In your case maybe not, but for me (and my wife) it certainly does.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



signature.asc
Description: This is a digitally signed message part.