Re: question regarding caching

2014-01-03 Thread Sander
Austin S Hemmelgarn wrote (ao):
 The data is probably still cached in the block layer, so after
 unmounting, you could try 'echo 1  /proc/sys/vm/drop_caches' before
 mounting again, but make sure to run sync right before doing that,
 otherwise you might lose data.

Lose data? Where you get this from?

Sander
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/11] btrfs: Add barrier option to support -o remount,barrier

2014-01-03 Thread Mike Fleetwood
On 3 January 2014 06:10, Qu Wenruo quwen...@cn.fujitsu.com wrote:
 Btrfs can be remounted without barrier, but there is no barrier option
 so nobody can remount btrfs back with barrier on. Only umount and
 mount again can re-enable barrier.(Quite awkward)

 Also the mount options in the document is also changed slightly for the
 further pairing options changes.

 Reported-by: Daniel Blueman dan...@quora.org
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 Cc: David Sterba dste...@suse.cz
 ---
 changelog:
 v1: Add barrier option
 v2: Change the document style to fit pairing options better
 ---
  Documentation/filesystems/btrfs.txt | 13 +++--
  fs/btrfs/super.c|  8 +++-
  2 files changed, 14 insertions(+), 7 deletions(-)

 diff --git a/Documentation/filesystems/btrfs.txt 
 b/Documentation/filesystems/btrfs.txt
 index 5dd282d..2d2e016 100644
 --- a/Documentation/filesystems/btrfs.txt
 +++ b/Documentation/filesystems/btrfs.txt
 @@ -38,7 +38,7 @@ Mount Options
  =

  When mounting a btrfs filesystem, the following option are accepted.
 -Unless otherwise specified, all options default to off.
 +Options with (*) are default options and will not show in the mount options.

alloc_start=bytes
 Debugging option to force all block allocations above a certain
 @@ -138,12 +138,13 @@ Unless otherwise specified, all options default to off.
 Disable support for Posix Access Control Lists (ACLs).  See the
 acl(5) manual page for more information about ACLs.

 +  barrier(*)
nobarrier
 -Disables the use of block layer write barriers.  Write barriers 
 ensure
 -   that certain IOs make it through the device cache and are on 
 persistent
 -   storage.  If used on a device with a volatile (non-battery-backed)
 -   write-back cache, this option will lead to filesystem corruption on a
 -   system crash or power loss.
 +Disable/enable the use of block layer write barriers.  Write barriers

Please use
  Enable/Disable ...
to match order on the options barrier(*) then nobarrier immediately above.

 +   ensure that certain IOs make it through the device cache and are on
 +   persistent storage. If used on a device with a volatile

And:
  ...  If disabled on a device with a volatile
to make more sense when both enable and disable options are listed.

 +   (non-battery-backed) write-back cache, this option will lead to
 +   filesystem corruption on a system crash or power loss.

nodatacow
 Disable data copy-on-write for newly created files.  Implies 
 nodatasum,
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index e9c13fb..fe9d8a6 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -323,7 +323,7 @@ enum {
 Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
 Opt_check_integrity, Opt_check_integrity_including_extent_data,
 Opt_check_integrity_print_mask, Opt_fatal_errors, 
 Opt_rescan_uuid_tree,
 -   Opt_commit_interval,
 +   Opt_commit_interval, Opt_barrier,
 Opt_err,
  };

 @@ -335,6 +335,7 @@ static match_table_t tokens = {
 {Opt_nodatasum, nodatasum},
 {Opt_nodatacow, nodatacow},
 {Opt_nobarrier, nobarrier},
 +   {Opt_barrier, barrier},
 {Opt_max_inline, max_inline=%s},
 {Opt_alloc_start, alloc_start=%s},
 {Opt_thread_pool, thread_pool=%d},
 @@ -494,6 +495,11 @@ int btrfs_parse_options(struct btrfs_root *root, char 
 *options)
 btrfs_clear_opt(info-mount_opt, SSD);
 btrfs_clear_opt(info-mount_opt, SSD_SPREAD);
 break;
 +   case Opt_barrier:
 +   if (btrfs_test_opt(root, NOBARRIER))
 +   btrfs_info(root-fs_info, turning on 
 barriers);
 +   btrfs_clear_opt(info-mount_opt, NOBARRIER);
 +   break;
 case Opt_nobarrier:
 btrfs_info(root-fs_info, turning off barriers);
 btrfs_set_opt(info-mount_opt, NOBARRIER);
 --
 1.8.5.2


Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transaction blocked for more than 120 seconds

2014-01-03 Thread Duncan
Kai Krakow posted on Fri, 03 Jan 2014 02:24:01 +0100 as excerpted:

 Duncan 1i5t5.dun...@cox.net schrieb:
 
 But because a full balance rewrites everything anyway, it'll
 effectively defrag too.
 
 Is that really true? I thought it just rewrites each distinct extent and
 shuffels chunks around... This would mean it does not merge extents
 together.

While I'm not a coder and they're free to correct me if I'm wrong...

With a full balance (there are now options allowing one to do only data, 
or only metadata, or for that matter only system, and do other filtering, 
say to rebalance only chunks less than 10% used or only those not yet 
converted to a new raid level, if desired, but we're talking a full 
balance here), all chunks are rewritten, merging data (or metadata) into 
fewer chunks if possible, eliminating the then unused chunks and 
returning the space they took to the unallocated pool.

Given that everything is being rewritten anyway, a process that can take 
hours or even days on multi-terabyte spinning rust filesystems, /not/ 
doing a file defrag as part of the process would be stupid.

So doing a separate defrag and balance isn't necessary.  And while we're 
at it, doing a separate scrub and balance isn't necessary, for the same 
reason.  (If one copy of the data is invalid and there's another, it'll 
be used for the rewrite and redup if necessary during the balance and the 
invalid copy will simply be erased.  If there's no valid copy, then there 
will be balance errors and I believe the chunks containing the bad data 
are simply not rewritten at all, tho the valid data from them might be 
rewritten, leaving only the bad data (I'm not sure which, on that), thus 
allowing the admin to try other tools to clean up or recover from the 
damage as necessary.)

That's one reason why the balance operation can take so much longer than 
a straight sequential read/write of the data might indicate, because it's 
doing all that extra work behind the scenes as well.

Tho I'm not sure that it defrags across chunks, particularly if a file's 
fragments reach across enough chunks that they'd not have been processed 
by the time a written chunk is full and the balance progresses to the 
next one.  However, given that data chunks are 1 GiB in size, that should 
still cut down a multi-thousand-extent file to perhaps a few dozen 
extents, one each per rewritten chunk.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: only fua the first superblock when writting supers

2014-01-03 Thread Wang Shilong
We only intent to fua the first superblock in every device from
comments, fix it.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
 fs/btrfs/disk-io.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9417b73..b016657 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3142,7 +3142,10 @@ static int write_dev_supers(struct btrfs_device *device,
 * we fua the first super.  The others we allow
 * to go down lazy.
 */
-   ret = btrfsic_submit_bh(WRITE_FUA, bh);
+   if (i == 0)
+   ret = btrfsic_submit_bh(WRITE_FUA, bh);
+   else
+   ret = btrfsic_submit_bh(WRITE_SYNC, bh);
if (ret)
errors++;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question regarding caching

2014-01-03 Thread Austin S Hemmelgarn
On 2014-01-03 03:39, Sander wrote:
 Austin S Hemmelgarn wrote (ao):
 The data is probably still cached in the block layer, so after 
 unmounting, you could try 'echo 1  /proc/sys/vm/drop_caches'
 before mounting again, but make sure to run sync right before
 doing that, otherwise you might lose data.
 
 Lose data? Where you get this from?
 
 Sander
 
Sorry, misread the documentation, thought it said destructive where it
really said non-destructive.
It's still a good idea to run sync before trying to clear the caches
though, cause dirty objects aren't freeable.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


1be41b78: +18% increased btrfs write throughput

2014-01-03 Thread fengguang . wu
Hi Josef,

FYI. We are doing 0day performance tests and happen to notice that
btrfs write throughput increased considerably during v3.10-11 time
frame:

  v3.10  v3.11  v3.12   
   v3.13-rc6
---  -  -  
-
 50619 ~ 1% +17.0%  59209 ~ 2% +18.8%  60159 ~ 2% 
+20.5%  61007 ~ 0%  lkp-ws02/micro/dd-write/11HDD-JBOD-cfq-btrfs-1dd
 50619  +17.0%  59209  +18.8%  60159  
+20.5%  61007   TOTAL iostat.sdd.wkB/s

and it's contributed by 

commit 1be41b78bc688fc634bf30965d2be692c99fd11d
Author: Josef Bacik jba...@fusionio.com
AuthorDate: Wed Jun 12 13:56:06 2013 -0400
Commit: Josef Bacik jba...@fusionio.com
CommitDate: Mon Jul 1 08:52:28 2013 -0400

Btrfs: fix transaction throttling for delayed refs

Dave has this fs_mark script that can make btrfs abort with sufficient 
amount of
ram.  This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs.  What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still 
have
delayed refs to run.  To fix this we need to make the delayed ref flushing 
and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve.  With
this patch we not only stop aborting transactions but we also get a 
smoother run
speed with fs_mark and it makes us about 10% faster.  Thanks,

Reported-by: David Sterba dste...@suse.cz
Signed-off-by: Josef Bacik jba...@fusionio.com

 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c | 61 ++
 fs/btrfs/transaction.c | 24 +---
 3 files changed, 69 insertions(+), 18 deletions(-)

 time.elapsed_time

   740 ++--*+
   *.. .*.*..**..*.*..*.|
   720 ++ *.*..*.*. *..*.*..*.  .*..*..*..*.  .*.   |
   |  *.  *..*  *.   *..*
   700 ++   |
   ||
   680 ++   |
   ||
   660 ++   |
   ||
   640 ++   |
   ||
   620 ++ O  O  |
   OO  O O  O O  O O  OO  O O  O O  O O  O  O O  O O  O O  O|
   600 ++---+


time.voluntary_context_switches

   10 ++--*-**--+
9 ++   *..*..*.*. *.   *..  |
  *. ..  *..*  .*.. .*..*.*.*..*.*..*.*..* .*
8 ++* **  +  .* |
7 ++   *.   |
  | |
6 ++|
5 ++|
4 ++|
  | |
3 ++|
2 ++|
  | |
1 ++|
0 O+O--O-O--O-O--O-O--O-O-O--O-O--O-O--O-O--O-O-O--O-O--O-O--O-O+


time.file_system_inputs

   80 ++**--+
  |*..*..*.*..*   *.   *..  |
   70 *+ ..  *..*  .*.. .*..*.*.*..*.*..*.*..* .*
  | * **  +  .* |
   60 ++

Re: 1be41b78: +18% increased btrfs write throughput

2014-01-03 Thread Chris Mason
On Fri, 2014-01-03 at 23:54 +0800, fengguang...@intel.com wrote:
 Hi Josef,
 
 FYI. We are doing 0day performance tests and happen to notice that
 btrfs write throughput increased considerably during v3.10-11 time
 frame:
 
   v3.10  v3.11  v3.12 
  v3.13-rc6
 ---  -  -  
 -
  50619 ~ 1% +17.0%  59209 ~ 2% +18.8%  60159 ~ 2% 
 +20.5%  61007 ~ 0%  lkp-ws02/micro/dd-write/11HDD-JBOD-cfq-btrfs-1dd
  50619  +17.0%  59209  +18.8%  60159  
 +20.5%  61007   TOTAL iostat.sdd.wkB/s
 
 and it's contributed by 
 
 commit 1be41b78bc688fc634bf30965d2be692c99fd11d
 Author: Josef Bacik jba...@fusionio.com
 AuthorDate: Wed Jun 12 13:56:06 2013 -0400
 Commit: Josef Bacik jba...@fusionio.com
 CommitDate: Mon Jul 1 08:52:28 2013 -0400
 
 Btrfs: fix transaction throttling for delayed refs

Bonus points for increasing the performance on purpose.  Thanks for
running these Wu.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Status of raid5/6 in 2014?

2014-01-03 Thread Dave
Back in Feb 2013 there was quite a bit of press about the preliminary
raid5/6 implementation in Btrfs.  At the time it wasn't useful for
anything other then testing and it's my understanding that this is
still the case.

I've seen a few git commits and some chatter on this list but it would
appear the developers are largely silent.  Parity based raid would be
a powerful addition the the Btrfs feature stack and it's the feature I
most anxiously await.  Are there any milestones planned for 2014?

Keep up the good work...
-- 
-=[dave]=-

Entropy isn't what it used to be.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: only fua the first superblock when writting supers

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 06:22:57PM +0800, Wang Shilong wrote:
 We only intent to fua the first superblock in every device from
 comments, fix it.

Good catch, this could gain some speedup when there are up to 2 less
flushes.

There's one thing that's a different from currnet behaviour:
Without this patch, all the superblocks are written with FUA, now only
the first one, so my question is what if the first fails and the others
succeed but do not get flushed immediatelly?

This is more of a theoretical scenario, and if the 1st superblock write
fails more serious problems can be expected. But let's say the write
error of 1st is transient, do you or others think that it's reasonable
to try to write all the remainig sb's with FUA?


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/11] btrfs: Add nocheck_int mount option.

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 02:10:26PM +0800, Qu Wenruo wrote:
 Add nocheck_int mount option to disable integrity check with
 remount option.
 
 + nocheck_int disables all the debug options above.

I think this option is not needed, the integrity checker is a
deveoplment functionality and used by people who know what they're
doing. Besides this would need to clean up all the data structures that
the checker uses (see eg. btrfsic_unmount that's called only if the
mount option is used). I see little benefit compared to the amount of
work to make sure that disabling the checker functionality in the middle
works properly.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: only fua the first superblock when writting supers

2014-01-03 Thread Chris Mason
On Fri, 2014-01-03 at 18:03 +0100, David Sterba wrote:
 On Fri, Jan 03, 2014 at 06:22:57PM +0800, Wang Shilong wrote:
  We only intent to fua the first superblock in every device from
  comments, fix it.
 
 Good catch, this could gain some speedup when there are up to 2 less
 flushes.
 
 There's one thing that's a different from currnet behaviour:
 Without this patch, all the superblocks are written with FUA, now only
 the first one, so my question is what if the first fails and the others
 succeed but do not get flushed immediatelly?
 
 This is more of a theoretical scenario, and if the 1st superblock write
 fails more serious problems can be expected. But let's say the write
 error of 1st is transient, do you or others think that it's reasonable
 to try to write all the remainig sb's with FUA?

Not a bad idea, if we get a failure on the first SB, fua the others?  I
think it does make sense to do the others non-fua, just because they
only get used in emergencies anyway.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] btrfs: Add missing pairing mount options.

2014-01-03 Thread Eric Sandeen
On 1/3/14, 12:10 AM, Qu Wenruo wrote:
 Some options should be paired to support triggering different functions
 when remounting.
 
 This patchset add these missing pairing mount options.

I think this really would benefit from a regression test which
ensures that every remount transition works properly...

Thanks,
-Eric

 changelog:
 v1: Initial commit with only barrier option
 v2: Add other missing pairing options
 
 Qu Wenruo (11):
   btrfs: Add barrier option to support -o remount,barrier
   btrfs: Add noautodefrag mount option.
   btrfs: Add nocheck_int mount option.
   btrfs: Add nodiscard mount option.
   btrfs: Add noenospc_debug mount option.
   btrfs: Add noflushoncommit mount option.
   btrfs: Add noinode_cache mount option.
   btrfs: Add acl mount option.
   btrfs: Add datacow mount option.
   btrfs: Add datasum mount option.
   btrfs: Add treelog mount option.
 
  Documentation/filesystems/btrfs.txt | 56 ++--
  fs/btrfs/super.c| 74 
 -
  2 files changed, 110 insertions(+), 20 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/11] btrfs: Add noinode_cache mount option.

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 02:10:30PM +0800, Qu Wenruo wrote:
 Add noinode_cache mount option to disable inode map cache with
 remount option.

This looks almost safe, there's a sync_filesystem called before the
filesystem's remount handler, the transaction gets committed and flushes
all tha data related to inode_cache.

The caching thread keeps running, which is not a serious problem as
it'll finish at umount time, only consuming resources.

There's a window between sync_filesystem and successful remount when the
INODE_MAP_CACHE bit is set and the cache could be used to get a free ino,
then the INODE_MAP_CACHE is cleared but the ino cache remains is not
synced back to disk, normally called from transaction commit via
btrfs_unpin_free_ino. I haven't looked if something else blocks that to
happen.

I'd leave this patch out for now, it probably needs more code updates
than just unsetting the bit.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] btrfs: Add missing pairing mount options.

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 02:10:23PM +0800, Qu Wenruo wrote:
 Some options should be paired to support triggering different functions
 when remounting.
 This patchset add these missing pairing mount options.

Thanks!

   btrfs: Add nocheck_int mount option.
   btrfs: Add noinode_cache mount option.

Commented separately, imho not to be merged in current state.

   btrfs: Add barrier option to support -o remount,barrier
   btrfs: Add noautodefrag mount option.
   btrfs: Add nodiscard mount option.
   btrfs: Add noenospc_debug mount option.
   btrfs: Add noflushoncommit mount option.
   btrfs: Add acl mount option.
   btrfs: Add datacow mount option.
   btrfs: Add datasum mount option.
   btrfs: Add treelog mount option.

All ok.

Reviewed-by: David Sterba dste...@suse.cz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about ext4 conversion and leaf size

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 12:29:51AM +, Holger Hoffstätte wrote:
 Conversion from ext4 works really well and is an important step for
 adoption. After recently converting a large-ish device I noticed
 dodgy performance, even after defragment  rebalance; noticeably
 different from the quite good performance of a newly-created btrfs
 with 16k leaf size, as is the default since recently.
 
 So I went spelunking and found that the btrfs-convert logic indeed
 uses the ext4 block size as leaf size (from #2220):
 https://git.kernel.org/cgit/linux/kernel/git/mason/btrfs-progs.git/tree/btrfs-convert.c#n2245
 
 This is typically 4096 bytes and explains the observed performance.
 
 So while I'm basically familiar with btrfs's design, I know nothing
 about the details of the conversion (I'm amazed that it works so well,
 including rollback!) but can/should this not be updated to the new default
 of 16k, or is there a strong necessary correlation between the ext4 block
 size and the newly created btrfs?

The sectorsize has to be same for ext4 and btrfs, which is 4k
(PAGE_SIZE) nowadays. The btrfs metadata block is not limited by that.

I've tried to implement the dumb  simple support for larger metadata
block some time ago

http://repo.or.cz/w/btrfs-progs-unstable/devel.git/commitdiff/337ac35f5a6ebeaee375329084b89ea4a868b4be?hp=704a08cb8ae8735f8538e637a1be822e76e69d3c

but the conversion did not work properly, and I haven't debugged that
further.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] Btrfs: use flags instead of the bool variants in delayed node

2014-01-03 Thread David Sterba
On Fri, Jan 03, 2014 at 05:27:51PM +0800, Miao Xie wrote:
 On Thu, 2 Jan 2014 18:49:55 +0100, David Sterba wrote:
  On Thu, Dec 26, 2013 at 01:07:05PM +0800, Miao Xie wrote:
  +#define BTRFS_DELAYED_NODE_IN_LIST0
  +#define BTRFS_DELAYED_NODE_INODE_DIRTY1
  +
   struct btrfs_delayed_node {
 u64 inode_id;
 u64 bytes_reserved;
  @@ -65,8 +68,7 @@ struct btrfs_delayed_node {
 struct btrfs_inode_item inode_item;
 atomic_t refs;
 u64 index_cnt;
  -  bool in_list;
  -  bool inode_dirty;
  +  unsigned long flags;
 int count;
   };
  
  What's the reason to do that? Replacing 2 bools with a bitfield
  does not seem justified, not from saving memory, nor from a performance
  gain side.  Also some of the bit operations imply the lock instruction
  prefix so this affects the surrounding items as well.
  
  I don't think this is needed, unless you have further plans with the
  flags item.
 
 Yes, I introduced a flag in the next patch.

That's still 3 bool flags that are quite independent and consume less
than the unsigned long anyway. Also the bool flags are something that
compiler understands and can use during optimizations unlike the
obfuscated bit access.

I don't mind using bitfields, but it imo starts to make sense to use
them when there are more than a few, like BTRFS_INODE_* or
EXTENT_BUFFER_*. The point of my objections is to establish good coding
patterns to follow.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [block:for-3.14/core] kernel BUG at fs/bio.c:1748

2014-01-03 Thread Muthu Kumar
Looks like Kent missed the btrfs endio in the original commit. How
about this patch:

-

In btrfs_end_bio, call bio_endio_nodec on the restored bio so the
bi_remaining is accounted for correctly.

Reported-by: fengguang...@intel.com
Cc: Kent Overstreet k...@daterainc.com
CC: Jens Axboe ax...@kernel.dk
Signed-off-by: Muthukumar Ratty mut...@gmail.com


 fs/btrfs/volumes.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f2130de..edfed52 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5316,7 +5316,11 @@ static void btrfs_end_bio(struct bio *bio, int err)
}
kfree(bbio);

-   bio_endio(bio, err);
+/*
+ * Call endio_nodec on the restored bio so the bi_remaining is
+ * accounted for correctly
+ */
+   bio_endio_nodec(bio, err);
} else if (!is_orig_bio) {
bio_put(bio);
}

On Wed, Jan 1, 2014 at 9:31 PM,  fengguang...@intel.com wrote:
 Greetings,

 We hit the below bug when doing write tests to btrfs.
 Other filesystems (ext4, xfs) works fine. 2 full dmesgs are attached.

 196d38bccfcfa32faed8c561868336fdfa0fe8e4 is the first bad commit
 commit 196d38bccfcfa32faed8c561868336fdfa0fe8e4
 Author: Kent Overstreet k...@daterainc.com
 AuthorDate: Sat Nov 23 18:34:15 2013 -0800
 Commit: Kent Overstreet k...@daterainc.com
 CommitDate: Sat Nov 23 22:33:56 2013 -0800

 block: Generic bio chaining

 This adds a generic mechanism for chaining bio completions. This is
 going to be used for a bio_split() replacement, and it turns out to be
 very useful in a fair amount of driver code - a fair number of drivers
 were implementing this in their own roundabout ways, often painfully.

 Note that this means it's no longer to call bio_endio() more than once
 on the same bio! This can cause problems for drivers that save/restore
 bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
 - in all but the simplest cases they'd be better off just cloning the
 bio, and immutable biovecs is making bio cloning cheaper. But for now,
 we add a bio_endio_nodec() for these cases.

 Signed-off-by: Kent Overstreet k...@daterainc.com
 Cc: Jens Axboe ax...@kernel.dk

  drivers/md/bcache/io.c   |  2 +-
  drivers/md/dm-cache-target.c |  6 
  drivers/md/dm-snap.c |  1 +
  drivers/md/dm-thin.c |  8 +++--
  drivers/md/dm-verity.c   |  2 +-
  fs/bio-integrity.c   |  2 +-
  fs/bio.c | 76 
 
  include/linux/bio.h  |  2 ++
  include/linux/blk_types.h|  2 ++
  9 files changed, 90 insertions(+), 11 deletions(-)

 [   35.466413] random: nonblocking pool is initialized
 [  196.918039] [ cut here ]
 [  196.919770] kernel BUG at fs/bio.c:1748!
 [  196.921505] invalid opcode:  [#1] SMP
 [  196.921788] Modules linked in: microcode processor
 [  196.921788] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
 3.13.0-rc6-01897-g2b48961 #1
 [  196.921788] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [  196.921788] task: 8804094acad0 ti: 8804094e8000 task.ti: 
 8804094e8000
 [  196.921788] RIP: 0010:[811ef01e]  [811ef01e] 
 bio_endio+0x1e/0x6a
 [  196.921788] RSP: 0018:88041fc83da8  EFLAGS: 00010046
 [  196.921788] RAX:  RBX: fffb RCX: 
 0001802a0002
 [  196.921788] RDX: 0001802a0003 RSI:  RDI: 
 8800299ff9e8
 [  196.921788] RBP: 88041fc83dc0 R08: ea00096cc980 R09: 
 8804097f5100
 [  196.921788] R10: ea000aeb8280 R11: 8143841e R12: 
 88025b326780
 [  196.921788] R13:  R14:  R15: 
 3000
 [  196.921788] FS:  () GS:88041fc8() 
 knlGS:
 [  196.921788] CS:  0010 DS:  ES:  CR0: 8005003b
 [  196.921788] CR2: 7f16e7a1948f CR3: 7f85e000 CR4: 
 06e0
 [  196.921788] Stack:
 [  196.921788]  8800299ff9e8 8800299ff9e8 88025b326780 
 88041fc83de8
 [  196.921788]  81438429 fffb 8803d36e6c00 
 
 [  196.921788]  88041fc83e10 811ef063 8802bae0a1e8 
 8802bae0a1e8
 [  196.921788] Call Trace:
 [  196.921788]  IRQ
 [  196.921788]  [81438429] btrfs_end_bio+0x116/0x11d
 [  196.921788]  [811ef063] bio_endio+0x63/0x6a
 [  196.921788]  [814cb712] blk_mq_complete_request+0x89/0xfe
 [  196.921788]  [814cb79d] __blk_mq_end_io+0x16/0x18
 [  196.921788]  [814cb7bf] blk_mq_end_io+0x20/0xb1
 [  196.921788]  [815a1ba9] virtblk_done+0xa4/0xf6
 [  196.921788]  [8155c463] vring_interrupt+0x7c/0x8a
 [  196.921788]  [81107427] handle_irq_event_percpu+0x4a/0x1bc
 [  

Re: btrfs-transaction blocked for more than 120 seconds

2014-01-03 Thread Marc MERLIN
First, a big thank you for taking the time to post this very informative
message.

On Wed, Jan 01, 2014 at 12:37:42PM +, Duncan wrote:
 Apparently the way some distribution installation scripts work results in 
 even a brand new installation being highly fragmented. =:^(  If in 
 addition they don't add autodefrag to the mount options used when 
 mounting the filesystem for the original installation, the problem is 
 made even worse, since the autodefrag mount option is designed to help 
 catch some of this sort of issue, and schedule the affected files for 
 auto-defrag by a separate thread.
 
Assuming you can stomach a bit of occasional performance loss due to
autodefrag, is there a reason not to always have this on btrfs
filesystems in newer kernels? (let's say 3.12+)?

Is there even a reason for this not to become a default mount option in
newer kernels?

 The NOCOW file attribute.
 
 Simple command form:
 
 chattr +C /path/to/file/or/directory
 
Thank you for that tip, I had been unaware of it 'till now.
This will make my virtualbox image directory much happier :)

 Meanwhile, if there's a point at which the file exists in its more or 
 less permanent form and won't be written into any longer (a torrented 
 file is fully downloaded, or a VM image is backed up), sequentially 
 copying it elsewhere (possibly using cp --reflink=never if on the same 
 filesystem, to avoid a reflink copy pointing at the same fragmented 
 extents!), then deleting the original fragmented version, should 
 effectively defragment the file too.  And since it's not being written 
 into any more at that point, it should stay defragmented.
 
 Or just btrfs filesystem defrag the individual file..

I know I can do the cp --reflink=never, but that will generate 100GB of
new files and force me to drop all my hourly/daily/weekly snapshots, so 
file defrag is definitely a better option.

 Finally, there's some more work going into autodefrag now, to hopefully 
 increase its performance, and make it work more efficiently on a bit 
 larger files as well.  The goal is to eliminate the problems with 
 systemd's journal, among other things, now that it's known to be a common 
 problem, given systemd's widespread use and the fact that both systemd 
 and btrfs aim to be the accepted general Linux default within a few years.
 
Is there a good guideline on which kinds of btrfs filesystems autodefrag
is likely not a good idea, even if the current code does not have
optimal performance?
I suppose fragmented files that are deleted soon after being written are
a loss, but otherwise it's mostly a win. Am I missing something?
 
Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
lot of writing and chewed up my 4 CPUs.
Then, it started to be hard to move my mouse cursor and my procmeter
graph was barely updating seconds.
Next, nothing updated on my X server anymore, not even seconds in time
widgets.
But, I could still sometimes move my mouse cursor, and I could sometimes
see the HD light fliker a bit before going dead again. In other words,
the system wasn't fully deadlocked, but btrfs sure got into a state
where it was unable to to finish the job, and took the kernel down with
it (64bit, 8GB of RAM).
I waited 2H and it never came out of it, I had to power down the system
in the end.
Note that this was on a top of the line 500MB/s write Samsung Evo 840 SSD,
not a slow HD.

I think I had enough free space:
Label: 'btrfs_pool1'  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
Total devices 1 FS bytes used 732.14GB
devid1 size 865.01GB used 865.01GB path /dev/dm-0

Is it possible expected behaviour of defrag to lock up on big files?
Should I have had more spare free space for it to work?
Other?

On the plus side, the file I was trying to defragment and hung my system, 
was not corrupted by the process.

Any idea what I should try from here?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: coredump in btrfsck

2014-01-03 Thread Marc MERLIN
On Thu, Jan 02, 2014 at 10:37:28AM -0700, Chris Murphy wrote:
 
 On Jan 1, 2014, at 3:35 PM, Oliver Mangold o.mang...@gmail.com wrote:
 
  On 01.01.2014 22:58, Chris Murphy wrote:
  On Jan 1, 2014, at 2:27 PM, Oliver Mangold o.mang...@gmail.com wrote:
  
  I fear, I broke my FS by running btrfsck. I tried 'btrfsck --repair' and 
  it fixed several problems but finally crashed with some debug message 
  from 'extent-tree.c', so I also tried 'btrfsck --repair 
  --init-extent-tree'.
  It is sort of a (near) last restort, you know this right? What did you try 
  before btrfsck? Did you set dmesg -n7, then mount -o recovery and if so 
  what was recorded in dmesg?
  Ehm, actually, no.
 
 https://btrfs.wiki.kernel.org/index.php/FAQ#When_will_Btrfs_have_a_fsck_like_tool.3F
 
 This is a bit dated, but the general idea is to not use repair except on 
 advice of a developer, and also there are still some risks. Just a week or so 
 ago, one said it was a little dangerous still. So yeah, -o recovery should be 
 the first choice.
 
I was thinking about this:
Considering that everyone out there has been conditioned/used to running
fsck on any filesystem if thre is a problem, and considering btrfs has
been different and likely will be for the forseable future, I'd like to
suggest the following:

In order to accomodate more users trying btrfs, the documentation for
btrfsck really needs to be changed. Neither the tool help nor the man
page say anything about 'this is not the fsck you're looking for', nor
point to the wiki above.

See:
gandalfthegreat:~# btrfsck 
usage: btrfs check [options] device

Check an unmounted btrfs filesystem.
(...)
and
man btrfsck

Would it be possible for whoever maintains btrfs-tools to change both
the man page and the help included in the tool to clearly state that
running the fsck tool is unlikely to be the right course of action
and talk about btrfs-zero-log as well as mount -o recovery?

Cheers,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is anyone using btrfs send/receive for backups instead of rsync?

2014-01-03 Thread Marc MERLIN
On Mon, Dec 30, 2013 at 09:57:40AM -0800, Marc MERLIN wrote:
 On Mon, Dec 30, 2013 at 10:48:10AM -0700, Chris Murphy wrote:
  
  On Dec 30, 2013, at 10:10 AM, Marc MERLIN m...@merlins.org wrote:
   
   If one day, it could at least work on a subvolume level (only sync a
   subvolume), then it would be more useful to me. Maybe later…
  
  Maybe I'm missing something, but btrfs send/receive only work on a 
  subvolume level.
 
 Never mind, I seem to be the one being dense. I mis-read that you needed
 to create the filesystem with btrfs receive.
 Indeed, it's on a subvolume level, so it's actually fine since it does
 allow over provisionning afterall.

Mmmh, but I just realized that on my laptop, I do boot the btrfs copy
(currently done with rsync) from time to time (i.e. emergency boot from
the HD the SSD was copied to).
If I do that, it'll change the filesystem that was created with btrfs
receive and break it, preventing further updates, correct?

If so, can I get around that by making a boot snapshot after each copy
and mount that snapshot for emergency boot instead of the main volume?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is anyone using btrfs send/receive for backups instead of rsync?

2014-01-03 Thread Chris Mason
On Fri, 2014-01-03 at 12:15 -0800, Marc MERLIN wrote:
 On Mon, Dec 30, 2013 at 09:57:40AM -0800, Marc MERLIN wrote:
  On Mon, Dec 30, 2013 at 10:48:10AM -0700, Chris Murphy wrote:
   
   On Dec 30, 2013, at 10:10 AM, Marc MERLIN m...@merlins.org wrote:

If one day, it could at least work on a subvolume level (only sync a
subvolume), then it would be more useful to me. Maybe later…
   
   Maybe I'm missing something, but btrfs send/receive only work on a 
   subvolume level.
  
  Never mind, I seem to be the one being dense. I mis-read that you needed
  to create the filesystem with btrfs receive.
  Indeed, it's on a subvolume level, so it's actually fine since it does
  allow over provisionning afterall.
 
 Mmmh, but I just realized that on my laptop, I do boot the btrfs copy
 (currently done with rsync) from time to time (i.e. emergency boot from
 the HD the SSD was copied to).
 If I do that, it'll change the filesystem that was created with btrfs
 receive and break it, preventing further updates, correct?
 
 If so, can I get around that by making a boot snapshot after each copy
 and mount that snapshot for emergency boot instead of the main volume?

Yes that will work.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transaction blocked for more than 120 seconds

2014-01-03 Thread Duncan
Marc MERLIN posted on Fri, 03 Jan 2014 09:25:06 -0800 as excerpted:

 First, a big thank you for taking the time to post this very informative
 message.
 
 On Wed, Jan 01, 2014 at 12:37:42PM +, Duncan wrote:
 Apparently the way some distribution installation scripts work results
 in even a brand new installation being highly fragmented. =:^(  If in
 addition they don't add autodefrag to the mount options used when
 mounting the filesystem for the original installation, the problem is
 made even worse, since the autodefrag mount option is designed to help
 catch some of this sort of issue, and schedule the affected files for
 auto-defrag by a separate thread.
  
 Assuming you can stomach a bit of occasional performance loss due to
 autodefrag, is there a reason not to always have this on btrfs
 filesystems in newer kernels? (let's say 3.12+)?
 
 Is there even a reason for this not to become a default mount option in
 newer kernels?

For big internal write files, autodefrag isn't yet well tuned, because 
it effectively write-magnifies too much, forcing rewrite of the entire 
file for just a small change.  If whatever app is more or less constantly 
writing those small changes, faster than the file can be rewritten...

I don't know where the break-over might be, but certainly, multi-gig 
sized IO-active VMs images or databases aren't something I'd want to use 
it with.  That's where the NOCOW thing will likely work better.

IIRC someone also mentioned problems with autodefrag and an about 3/4 gig 
systemd journal.  My gut feeling (IOW, *NOT* benchmarked!) is that double-
digit MiB files should /normally/ be fine, but somewhere in the lower 
triple digits, write-magnification could well become an issue, depending 
of course on exactly how much active writing the app is doing into the 
file.

As I said there's more work going into tuning autodefrag ATM, but as it 
is, I couldn't really recommend making it a global default... tho maybe a 
distro could enable it by default on a no-VM desktop system (as opposed 
to a server).  Certainly I'd recommend most desktop types enable it.

 The NOCOW file attribute.
 
 Simple command form:
 
 chattr +C /path/to/file/or/directory
  
 Thank you for that tip, I had been unaware of it 'till now.
 This will make my virtualbox image directory much happier :)

I think I said it, but it bears repeating.  Once you set that attribute 
on the dir, you may want to move the files out of the dir (to another 
partition would make sure the data is actually moved) and back in, so 
they're effectively new files in the dir.  Or use something like cat 
oldfile  newfile, so you know it's actually creating the new file, not 
reflinking.  That'll ensure the NOCOW takes effect.

 Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
 lot of writing and chewed up my 4 CPUs. Then, it started to be hard to
 move my mouse cursor and my procmeter graph was barely updating seconds.
 Next, nothing updated on my X server anymore, not even seconds in time
 widgets.
 
 But, I could still sometimes move my mouse cursor, and I could sometimes
 see the HD light fliker a bit before going dead again. In other words,
 the system wasn't fully deadlocked, but btrfs sure got into a state
 where it was unable to to finish the job, and took the kernel down with
 it (64bit, 8GB of RAM).
 
 I waited 2H and it never came out of it, I had to power down the system
 in the end.  Note that this was on a top of the line 500MB/s write
 Samsung Evo 840 SSD, not a slow HD.

That was defrag (the command) or autodefrag (the mount option)?  I'd 
guess defrag (the command).

That's fragmentation for you!  What did/does filefrag have to say about 
that file?  Were you the one that posted the 6-digit extents?

For something that bad, it might be faster to copy/move it off-device 
(expect it to take awhile) then move it back.  That way you're only 
trying to read OR write on the device, not both, and the move elsewhere 
should defrag it quite a bit, effectively sequential write, then read and 
write on the move back.

But even that might be prohibitive.  At some point, you may need to 
either simply give up on it (if you're lazy), or get down and dirty with 
the tracing/profiling, working with a dev to figure out where it's 
spending its time and hopefully get btrfs recoded to work a bit faster 
for that sort of thing.

 I think I had enough free space:
 Label: 'btrfs_pool1'  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
   Total devices 1 FS bytes used 732.14GB
   devid1 size 865.01GB used 865.01GB path /dev/dm-0
 
 Is it possible expected behaviour of defrag to lock up on big files?
 Should I have had more spare free space for it to work?
 Other?

From my understanding it's not the file size, but the number of 
fragments.  I'm guessing you simply overwhelmed the system.  Ideally you 
never let it get that bad in the first place. =:^(

As I suggested above, you might try the old school method of defrag, 

btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the 
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).


I discovered to my horror during testing today that neither raid1 nor 
raid10 arrays are fault tolerant of losing an actual disk.


mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo test  /test/test
btrfs filesystem sync /test
shutdown -hP now

After shutting down the VM, I can remove ANY of the drives from the 
btrfs raid10 array, and be unable to mount the array. In this case, I 
removed the drive that was at /dev/vde, then restarted the VM.


btrfs fi show
Label: none  uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
Total devices 4 FS bytes used 156.00KB
 devid3 size 1.00GB used 212.75MB path /dev/vdd
 devid3 size 1.00GB used 212.75MB path /dev/vdc
 devid3 size 1.00GB used 232.75MB path /dev/vdb
 *** Some devices missing

OK, we have three of four raid10 devices present. Should be fine. Let's 
mount it:


mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
   missing codepage or helper program, or other error
   In some cases useful info is found in syslog - try
   dmesg | tail or so

What's the kernel log got to say about it?

dmesg | tail -n 4
[  536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1 
transid 7 /dev/vdb

[  536.700515] btrfs: disk space caching is enabled
[  536.703491] btrfs: failed to read the system array on vdd
[  536.708337] btrfs: open_ctree failed

Same behavior persists whether I create a raid1 or raid10 array, and 
whether I create it as that raid level using mkfs.btrfs or convert it 
afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn. 
Also persists even if I both scrub AND sync the array before shutting 
the machine down and removing one of the disks.


What's up with this? This is a MASSIVE bug, and I haven't seen anybody 
else talking about it... has nobody tried actually failing out a disk 
yet, or what?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Joshua Schüler
Am 03.01.2014 23:28, schrieb Jim Salter:
 I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
 btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).
 
 I discovered to my horror during testing today that neither raid1 nor
 raid10 arrays are fault tolerant of losing an actual disk.
 
 mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
 mkdir /test
 mount /dev/vdb /test
 echo test  /test/test
 btrfs filesystem sync /test
 shutdown -hP now
 
 After shutting down the VM, I can remove ANY of the drives from the
 btrfs raid10 array, and be unable to mount the array. In this case, I
 removed the drive that was at /dev/vde, then restarted the VM.
 
 btrfs fi show
 Label: none  uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
 Total devices 4 FS bytes used 156.00KB
  devid3 size 1.00GB used 212.75MB path /dev/vdd
  devid3 size 1.00GB used 212.75MB path /dev/vdc
  devid3 size 1.00GB used 232.75MB path /dev/vdb
  *** Some devices missing
 
 OK, we have three of four raid10 devices present. Should be fine. Let's
 mount it:
 
 mount -t btrfs /dev/vdb /test
 mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
 
 What's the kernel log got to say about it?
 
 dmesg | tail -n 4
 [  536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
 transid 7 /dev/vdb
 [  536.700515] btrfs: disk space caching is enabled
 [  536.703491] btrfs: failed to read the system array on vdd
 [  536.708337] btrfs: open_ctree failed
 
 Same behavior persists whether I create a raid1 or raid10 array, and
 whether I create it as that raid level using mkfs.btrfs or convert it
 afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
 Also persists even if I both scrub AND sync the array before shutting
 the machine down and removing one of the disks.
 
 What's up with this? This is a MASSIVE bug, and I haven't seen anybody
 else talking about it... has nobody tried actually failing out a disk
 yet, or what?

Hey Jim,

keep calm and read the wiki ;)
https://btrfs.wiki.kernel.org/

You need to mount with -o degraded to tell btrfs a disk is missing.


Joshua


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
I actually read the wiki pretty obsessively before blasting the list - 
could not successfully find anything answering the question, by scanning 
the FAQ or by Googling.


You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.

HOWEVER - this won't allow a root filesystem to mount. How do you deal 
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root 
filesystem? Few things are scarier than seeing the cannot find init 
message in GRUB and being faced with a BusyBox prompt... which is 
actually how I initially got my scare; I was trying to do a walkthrough 
for setting up a raid1 / for an article in a major online magazine and 
it wouldn't boot at all after removing a device; I backed off and tested 
with a non root filesystem before hitting the list.


I did find the -o degraded argument in the wiki now that you mentioned 
it - but it's not prominent enough if you ask me. =)




On 01/03/2014 05:43 PM, Joshua Schüler wrote:

Am 03.01.2014 23:28, schrieb Jim Salter:

I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).

I discovered to my horror during testing today that neither raid1 nor
raid10 arrays are fault tolerant of losing an actual disk.

mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo test  /test/test
btrfs filesystem sync /test
shutdown -hP now

After shutting down the VM, I can remove ANY of the drives from the
btrfs raid10 array, and be unable to mount the array. In this case, I
removed the drive that was at /dev/vde, then restarted the VM.

btrfs fi show
Label: none  uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
 Total devices 4 FS bytes used 156.00KB
  devid3 size 1.00GB used 212.75MB path /dev/vdd
  devid3 size 1.00GB used 212.75MB path /dev/vdc
  devid3 size 1.00GB used 232.75MB path /dev/vdb
  *** Some devices missing

OK, we have three of four raid10 devices present. Should be fine. Let's
mount it:

mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

What's the kernel log got to say about it?

dmesg | tail -n 4
[  536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
transid 7 /dev/vdb
[  536.700515] btrfs: disk space caching is enabled
[  536.703491] btrfs: failed to read the system array on vdd
[  536.708337] btrfs: open_ctree failed

Same behavior persists whether I create a raid1 or raid10 array, and
whether I create it as that raid level using mkfs.btrfs or convert it
afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
Also persists even if I both scrub AND sync the array before shutting
the machine down and removing one of the disks.

What's up with this? This is a MASSIVE bug, and I haven't seen anybody
else talking about it... has nobody tried actually failing out a disk
yet, or what?

Hey Jim,

keep calm and read the wiki ;)
https://btrfs.wiki.kernel.org/

You need to mount with -o degraded to tell btrfs a disk is missing.


Joshua




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Joshua Schüler
Am 03.01.2014 23:56, schrieb Jim Salter:
 I actually read the wiki pretty obsessively before blasting the list -
 could not successfully find anything answering the question, by scanning
 the FAQ or by Googling.
 
 You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.
don't forget to
btrfs device delete missing path
See
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
 
 HOWEVER - this won't allow a root filesystem to mount. How do you deal
 with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
 filesystem? Few things are scarier than seeing the cannot find init
 message in GRUB and being faced with a BusyBox prompt... which is
 actually how I initially got my scare; I was trying to do a walkthrough
 for setting up a raid1 / for an article in a major online magazine and
 it wouldn't boot at all after removing a device; I backed off and tested
 with a non root filesystem before hitting the list.
Add -o degraded to the boot-options in GRUB.

If your filesystem is more heavily corrupted then you either need the
btrfs tools in your initrd or a rescue cd
 
 I did find the -o degraded argument in the wiki now that you mentioned
 it - but it's not prominent enough if you ask me. =)
 
 

[snip]

Joshua
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Hugo Mills
On Fri, Jan 03, 2014 at 05:56:42PM -0500, Jim Salter wrote:
 I actually read the wiki pretty obsessively before blasting the list
 - could not successfully find anything answering the question, by
 scanning the FAQ or by Googling.
 
 You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.
 
 HOWEVER - this won't allow a root filesystem to mount. How do you
 deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your
 root filesystem? Few things are scarier than seeing the cannot find
 init message in GRUB and being faced with a BusyBox prompt...

   Use grub's command-line editing to add rootflags=degraded to it.

   Hugo.

 which
 is actually how I initially got my scare; I was trying to do a
 walkthrough for setting up a raid1 / for an article in a major
 online magazine and it wouldn't boot at all after removing a device;
 I backed off and tested with a non root filesystem before hitting
 the list.
 
 I did find the -o degraded argument in the wiki now that you
 mentioned it - but it's not prominent enough if you ask me. =)
 
 
 
 On 01/03/2014 05:43 PM, Joshua Schüler wrote:
 Am 03.01.2014 23:28, schrieb Jim Salter:
 I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
 btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).
 
 I discovered to my horror during testing today that neither raid1 nor
 raid10 arrays are fault tolerant of losing an actual disk.
 
 mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
 mkdir /test
 mount /dev/vdb /test
 echo test  /test/test
 btrfs filesystem sync /test
 shutdown -hP now
 
 After shutting down the VM, I can remove ANY of the drives from the
 btrfs raid10 array, and be unable to mount the array. In this case, I
 removed the drive that was at /dev/vde, then restarted the VM.
 
 btrfs fi show
 Label: none  uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
  Total devices 4 FS bytes used 156.00KB
   devid3 size 1.00GB used 212.75MB path /dev/vdd
   devid3 size 1.00GB used 212.75MB path /dev/vdc
   devid3 size 1.00GB used 232.75MB path /dev/vdb
   *** Some devices missing
 
 OK, we have three of four raid10 devices present. Should be fine. Let's
 mount it:
 
 mount -t btrfs /dev/vdb /test
 mount: wrong fs type, bad option, bad superblock on /dev/vdb,
 missing codepage or helper program, or other error
 In some cases useful info is found in syslog - try
 dmesg | tail or so
 
 What's the kernel log got to say about it?
 
 dmesg | tail -n 4
 [  536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
 transid 7 /dev/vdb
 [  536.700515] btrfs: disk space caching is enabled
 [  536.703491] btrfs: failed to read the system array on vdd
 [  536.708337] btrfs: open_ctree failed
 
 Same behavior persists whether I create a raid1 or raid10 array, and
 whether I create it as that raid level using mkfs.btrfs or convert it
 afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
 Also persists even if I both scrub AND sync the array before shutting
 the machine down and removing one of the disks.
 
 What's up with this? This is a MASSIVE bug, and I haven't seen anybody
 else talking about it... has nobody tried actually failing out a disk
 yet, or what?
 Hey Jim,
 
 keep calm and read the wiki ;)
 https://btrfs.wiki.kernel.org/
 
 You need to mount with -o degraded to tell btrfs a disk is missing.
 
 
 Joshua
 
 
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Eighth Army Push Bottles Up Germans -- WWII newspaper ---  
 headline (possibly apocryphal)  


signature.asc
Description: Digital signature


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kinda 
black magic to me, and I don't think I'm supposed to be editing it 
directly at all anymore anyway, if I remember correctly...

HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the cannot find init
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Hugo Mills
On Fri, Jan 03, 2014 at 06:13:25PM -0500, Jim Salter wrote:
 Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still
 kinda black magic to me, and I don't think I'm supposed to be
 editing it directly at all anymore anyway, if I remember
 correctly...

   You don't need to edit grub.cfg -- when you boot, grub has an edit
option, so you can do it at boot time without having to use a rescue
disk.

   Regardless, the thing you need to edit is the line starting
linux, and will look something like this:

linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a 
ro single rootflags=subvol=fs-root

   If there's a rootflags= option already (as above), add ,degraded
to the end. If there isn't, add rootflags=degraded.

   Hugo.

 HOWEVER - this won't allow a root filesystem to mount. How do you deal
 with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
 filesystem? Few things are scarier than seeing the cannot find init
 message in GRUB and being faced with a BusyBox prompt... which is
 actually how I initially got my scare; I was trying to do a walkthrough
 for setting up a raid1 / for an article in a major online magazine and
 it wouldn't boot at all after removing a device; I backed off and tested
 with a non root filesystem before hitting the list.
 Add -o degraded to the boot-options in GRUB.
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Eighth Army Push Bottles Up Germans -- WWII newspaper ---  
 headline (possibly apocryphal)  


signature.asc
Description: Digital signature


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 3:56 PM, Jim Salter j...@jrs-s.net wrote:

 I actually read the wiki pretty obsessively before blasting the list - could 
 not successfully find anything answering the question, by scanning the FAQ or 
 by Googling.
 
 You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.
 
 HOWEVER - this won't allow a root filesystem to mount. How do you deal with 
 this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root filesystem? 

I'd say that it's not ready for unattended/auto degraded mounting, that this is 
intended to be a red flag show stopper to get the attention of the user. Before 
automatic degraded mounts, which md and LVM raid do now, there probably needs 
to be notification support in desktop's, .e.g. Gnome will report degraded state 
for at least md arrays (maybe LVM too, not sure). There's also a list of other 
multiple device stuff on the to do, some of which maybe should be done before 
auto degraded mount, for example the hot spare work.

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Multiple_Devices


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
Yep - had just figured that out and successfully booted with it, and was 
in the process of typing up instructions for the list (and posterity).


One thing that concerns me is that edits made directly to grub.cfg will 
get wiped out with every kernel upgrade when update-grub is run - any 
idea where I'd put this in /etc/grub.d to have a persistent change?


I have to tell you, I'm not real thrilled with this behavior either way 
- it means I can't have the option to automatically mount degraded 
filesystems without the filesystems in question ALWAYS showing as being 
mounted degraded, whether the disks are all present and working fine or 
not. That's kind of blecchy. =\



On 01/03/2014 06:18 PM, Hugo Mills wrote:

On Fri, Jan 03, 2014 at 06:13:25PM -0500, Jim Salter wrote:

Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still
kinda black magic to me, and I don't think I'm supposed to be
editing it directly at all anymore anyway, if I remember
correctly...

You don't need to edit grub.cfg -- when you boot, grub has an edit
option, so you can do it at boot time without having to use a rescue
disk.

Regardless, the thing you need to edit is the line starting
linux, and will look something like this:

linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a 
ro single rootflags=subvol=fs-root

If there's a rootflags= option already (as above), add ,degraded
to the end. If there isn't, add rootflags=degraded.

Hugo.


HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the cannot find init
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 4:13 PM, Jim Salter j...@jrs-s.net wrote:

 Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kinda black 
 magic to me, and I don't think I'm supposed to be editing it directly at all 
 anymore anyway, if I remember correctly…

Don't edit the grub.cfg directly. At the grub menu, only highlight the entry 
you want to boot, then hit 'e', and then edit the existing linux/linuxefi line. 
If you already have rootfs on a subvolume, you'll have an existing parameter on 
that line rootflags=subvol=rootname and you can change this to 
rootflags=subvol=rootname,degraded

I would not make this option persistent by putting it permanently in the 
grub.cfg; although I don't know the consequence of always mounting with 
degraded even if not necessary it could have some negative effects (?)


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 4:25 PM, Jim Salter j...@jrs-s.net wrote:

 
 One thing that concerns me is that edits made directly to grub.cfg will get 
 wiped out with every kernel upgrade when update-grub is run - any idea where 
 I'd put this in /etc/grub.d to have a persistent change?

/etc/default/grub

I don't recommend making it persistent. At this stage of development, a disk 
failure should cause mount failure so you're alerted to the problem.

 I have to tell you, I'm not real thrilled with this behavior either way - it 
 means I can't have the option to automatically mount degraded filesystems 
 without the filesystems in question ALWAYS showing as being mounted degraded, 
 whether the disks are all present and working fine or not. That's kind of 
 blecchy. =\

If you need something that comes up degraded automatically by design as a 
supported use case, use md (or possibly LVM which uses different user space 
tools and monitoring but uses the md kernel driver code and supports raid 
0,1,5,6 - quite nifty). I haven't tried this yet, but I think that's also 
supported with the thin provisioning work, which even if you don't use thin 
provisioning gets you the significantly more efficient snapshot behavior.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck does not fix

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 12:41 PM, Hendrik Friedel hend...@friedels.name wrote:

 Hello,
 
 I ran btrfsck on my volume with the repair option. When I re-run it, I get 
 the same errors as before.

Did you try mounting with -o recovery first?
https://btrfs.wiki.kernel.org/index.php/Problem_FAQ

What messages in dmesg so you get when you use recovery?


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
For anybody else interested, if you want your system to automatically 
boot a degraded btrfs array, here are my crib notes, verified working:


* boot degraded

1. edit /etc/grub.d/10_linux, add degraded to the rootflags

GRUB_CMDLINE_LINUX=rootflags=degraded,subvol=${rootsubvol} 
${GRUB_CMDLINE_LINUX}



2. add degraded to options in /etc/fstab also

UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 /   btrfs 
defaults,degraded,subvol=@   0   1



3. Update and reinstall GRUB to all boot disks

update-grub
grub-install /dev/vda
grub-install /dev/vdb

Now you have a system which will automatically start a degraded array.


**

Side note: sorry, but I absolutely don't buy the argument that the 
system won't boot without you driving down to its physical location, 
standing in front of it, and hammering panickily at a BusyBox prompt is 
the best way to find out your array is degraded.  I'll set up a Nagios 
module to check for degraded arrays using btrfs fi list instead, thanks...



On 01/03/2014 06:06 PM, Freddie Cash wrote:
Why is manual intervention even needed?  Why isn't the filesystem 
smart enough to mount in a degraded mode automatically?​


--
Freddie Cash
fjwc...@gmail.com mailto:fjwc...@gmail.com


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter
Minor correction: you need to close the double-quotes at the end of the 
GRUB_CMDLINE_LINUX line:


GRUB_CMDLINE_LINUX=rootflags=degraded,subvol=${rootsubvol} 
${GRUB_CMDLINE_LINUX}



On 01/03/2014 06:42 PM, Jim Salter wrote:
For anybody else interested, if you want your system to automatically 
boot a degraded btrfs array, here are my crib notes, verified working:


* boot degraded

1. edit /etc/grub.d/10_linux, add degraded to the rootflags

GRUB_CMDLINE_LINUX=rootflags=degraded,subvol=${rootsubvol} 
${GRUB_CMDLINE_LINUX}



2. add degraded to options in /etc/fstab also

UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 /   btrfs 
defaults,degraded,subvol=@   0   1



3. Update and reinstall GRUB to all boot disks

update-grub
grub-install /dev/vda
grub-install /dev/vdb

Now you have a system which will automatically start a degraded array.


**

Side note: sorry, but I absolutely don't buy the argument that the 
system won't boot without you driving down to its physical location, 
standing in front of it, and hammering panickily at a BusyBox prompt 
is the best way to find out your array is degraded.  I'll set up a 
Nagios module to check for degraded arrays using btrfs fi list 
instead, thanks...



On 01/03/2014 06:06 PM, Freddie Cash wrote:
Why is manual intervention even needed? Why isn't the filesystem 
smart enough to mount in a degraded mode automatically?​


--
Freddie Cash
fjwc...@gmail.com mailto:fjwc...@gmail.com


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: coredump in btrfsck

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 5:33 AM, Marc MERLIN m...@merlins.org wrote:
 
 Would it be possible for whoever maintains btrfs-tools to change both
 the man page and the help included in the tool to clearly state that
 running the fsck tool is unlikely to be the right course of action
 and talk about btrfs-zero-log as well as mount -o recovery?

The problem FAQ doesn't even mention btrfsck so I think people are just getting 
around that page or making assumptions.
https://btrfs.wiki.kernel.org/index.php/Problem_FAQ

Should btrfs check (btrfsck without --repair) work similar to xfs_repair when 
the file system is not cleanly unmounted? If an XFS volume is not cleanly 
unmounted, running xfs_repair will instruct the user to first mount the volume 
so that the journal is replayed, then umount the volume, then run xfs_repair.

A possible variant of this for btrfs check: inform the user the first step in 
repairing a problem Btrfs volume is to use -o recovery, for more information 
see Btrfs FAQ url for additional problem solving recommendations.

?

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Chris Murphy

On Jan 3, 2014, at 4:42 PM, Jim Salter j...@jrs-s.net wrote:

 For anybody else interested, if you want your system to automatically boot a 
 degraded btrfs array, here are my crib notes, verified working:
 
 * boot degraded
 
 1. edit /etc/grub.d/10_linux, add degraded to the rootflags
 
GRUB_CMDLINE_LINUX=rootflags=degraded,subvol=${rootsubvol} 
 ${GRUB_CMDLINE_LINUX}

This is the wrong way to solve this. /etc/grub.d/10_linux is subject to being 
replaced on updates. It is not recommended it be edited, same as for grub.cfg. 
The correct way is as I already stated, which is to edit the 
GRUB_CMDLINE_LINUX= line in /etc/default/grub.


 2. add degraded to options in /etc/fstab also
 
 UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 /   btrfs 
 defaults,degraded,subvol=@   0   1


I think it's bad advice to recommend always persistently mounting a good volume 
with this option. There's a reason why degraded is not the default mount 
option, and why there isn't yet automatic degraded mount functionality. That 
fstab contains other errors.

The correct way to automate this before Btrfs developers get around to it is to 
create a systemd unit that checks for the mount failure, determines that 
there's a missing device, and generates a modified sysroot.mount job that 
includes degraded.


 Side note: sorry, but I absolutely don't buy the argument that the system 
 won't boot without you driving down to its physical location, standing in 
 front of it, and hammering panickily at a BusyBox prompt is the best way to 
 find out your array is degraded.

You're simply dissatisfied with the state of Btrfs development and are 
suggesting bad hacks as a work around. That's my argument. Again, if your use 
case requires automatic degraded mounts, use a technology that's mature and 
well tested for that use case. Don't expect a lot of sympathy if these bad 
hacks cause you problems later.


  I'll set up a Nagios module to check for degraded arrays using btrfs fi list 
 instead, thanks…

That's a good idea, except that it's show rather than list.



Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of raid5/6 in 2014?

2014-01-03 Thread Hans-Kristian Bakke
I personally consider proper RAID6 support with gracious non-intrusive
handling of failing drives and a proper warning mechanism the most
important missing feature of btrfs, and I know this view is shared by
many others with software RAID based storage systems, currently
limited by the existing choises on Linux.
But having been a (naughty) user of btrfs the last few months I fully
understand that there are important bugs, performance fixes and issues
in the existing state of btrfs that need more immediate attention as
they affect the currently installed base.

I will however stress that the faster the functionality gets
implemented the sooner users like myself can begin using it and
reporting issues, and hence btrfs gets ready for enterprise usage and
general deployment sooner.

Regards,
Hans-Kristian Bakke
Mvh

Hans-Kristian Bakke


On 3 January 2014 17:45, Dave d...@thekilempire.com wrote:
 Back in Feb 2013 there was quite a bit of press about the preliminary
 raid5/6 implementation in Btrfs.  At the time it wasn't useful for
 anything other then testing and it's my understanding that this is
 still the case.

 I've seen a few git commits and some chatter on this list but it would
 appear the developers are largely silent.  Parity based raid would be
 a powerful addition the the Btrfs feature stack and it's the feature I
 most anxiously await.  Are there any milestones planned for 2014?

 Keep up the good work...
 --
 -=[dave]=-

 Entropy isn't what it used to be.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Jim Salter


On 01/03/2014 07:27 PM, Chris Murphy wrote:
This is the wrong way to solve this. /etc/grub.d/10_linux is subject 
to being replaced on updates. It is not recommended it be edited, same 
as for grub.cfg. The correct way is as I already stated, which is to 
edit the GRUB_CMDLINE_LINUX= line in /etc/default/grub. 
Fair enough - though since I already have to monkey-patch 00_header, I 
kind of already have an eye on grub.d so it doesn't seem as onerous as 
it otherwise would. There is definitely a lot of work that needs to be 
done on the boot sequence for btrfs IMO.
I think it's bad advice to recommend always persistently mounting a 
good volume with this option. There's a reason why degraded is not the 
default mount option, and why there isn't yet automatic degraded mount 
functionality. That fstab contains other errors.
What other errors does it contain? Aside from adding the degraded 
option, that's a bone-stock fstab entry from an Ubuntu Server installation.
The correct way to automate this before Btrfs developers get around to 
it is to create a systemd unit that checks for the mount failure, 
determines that there's a missing device, and generates a modified 
sysroot.mount job that includes degraded. 
Systemd is not the boot system in use for my distribution, and using it 
would require me to build a custom kernel, among other things. We're 
going to have to agree to disagree that that's an appropriate 
workaround, I think.
You're simply dissatisfied with the state of Btrfs development and are 
suggesting bad hacks as a work around. That's my argument. Again, if 
your use case requires automatic degraded mounts, use a technology 
that's mature and well tested for that use case. Don't expect a lot of 
sympathy if these bad hacks cause you problems later. 
You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they 
don't provide the features that I need or are accustomed to (true 
snapshots, copy on write, self-correcting redundant arrays, and on down 
the line). If you're going to shoo me off, the correct way to do it is 
to wave me in the direction of ZFS, in which case I can tell you I've 
been a happy user of ZFS for 5+ years now on hundreds of systems. ZFS 
and btrfs are literally the *only* options available that do what I want 
to do, and have been doing for years now. (At least aside from 
six-figure-and-up proprietary systems, which I have neither the budget 
nor the inclination for.)


I'm testing btrfs heavily in throwaway virtual environments and in a few 
small, heavily-monitored test production instances because ZFS on 
Linux has its own set of problems, both technical and licensing, and I 
think it's clear btrfs is going to take the lead in the very near future 
- in many ways, it does already.

  I'll set up a Nagios module to check for degraded arrays using btrfs fi list 
instead, thanks…

That's a good idea, except that it's show rather than list.
Yup, that's what I meant all right. I frequently still get the syntax 
backwards between btrfs fi show and btrfs subv list.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Dave
On Fri, Jan 3, 2014 at 9:59 PM, Jim Salter j...@jrs-s.net wrote:
 You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they
 don't provide the features that I need or are accustomed to (true snapshots,
 copy on write, self-correcting redundant arrays, and on down the line). If
 you're going to shoo me off, the correct way to do it is to wave me in the
 direction of ZFS, in which case I can tell you I've been a happy user of ZFS
 for 5+ years now on hundreds of systems. ZFS and btrfs are literally the
 *only* options available that do what I want to do, and have been doing for
 years now. (At least aside from six-figure-and-up proprietary systems, which
 I have neither the budget nor the inclination for.)

Jim, there's nothing stopping you from creating a Btrfs filesystem on
top of an mdraid array.  I'm currently running three WD Red 3TB drives
in a raid5 configuration under a Btrfs filesystem.  This configuration
works pretty well and fills the feature gap you're describing.

I will say, though, that the whole tone of your email chain leaves a
bad taste in my mouth; kind of like a poorly adjusted relative who
shows up once a year for Thanksgiving and makes everyone feel
uncomfortable.  I find myself annoyed by the constant disclaimers I
read on this list, about the experimental status of Btrfs, but it's
apparent that this hasn't sunk in for everyone.  Your poor budget
doesn't a production filesytem make.

I and many others on this list who have been using Btrfs, will tell
you with no hesitation, that due to the maturity of the code, Btrfs
should be making NO assumptions in the event of a failure, and
everything should come to a screeching halt.  I've seen it all: the
infamous 120 second process hangs, csum errors, multiple separate
catastrophic failures (search me on this list).  Things are MOSTLY
stable but you simply have to glance at a few weeks of history on this
list to see the experimental status is fully justified.  I use Btrfs
because of its intoxicating feature set.  As an IT director though,
I'd never subject my company to these rigors.  If Btrfs on mdraid
isn't an acceptable solution for you, then ZFS is the only responsible
alternative.
-- 
-=[dave]=-

Entropy isn't what it used to be.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

2014-01-03 Thread Duncan
Chris Murphy posted on Fri, 03 Jan 2014 16:22:44 -0700 as excerpted:

 I would not make this option persistent by putting it permanently in the
 grub.cfg; although I don't know the consequence of always mounting with
 degraded even if not necessary it could have some negative effects (?)

Degraded only actually does anything if it's actually needed.  On a 
normal array it'll be a NOOP, so should be entirely safe for /normal/ 
operation, but that doesn't mean I'd /recommend/ it for normal operation, 
since it bypasses checks that are there for a reason, thus silently 
bypassing information that an admin needs to know before he boots it 
anyway, in ordered to recover.

However, I've some other comments to add:

1) As you I'm uncomfortable with the whole idea of adding degraded 
permanently at this point.

Mention was made of having to drive down to the data center and actually 
stand in front of the box if something goes wrong, otherwise.  At the 
moment, for btrfs' development state at this point, fine.  Btrfs remains 
under development and there are clear warnings about using it without 
backups one hasn't tested recovery from or are not otherwise prepared to 
actually use.  It's stated in multiple locations on the wiki; it's stated 
on the kernel btrfs config option, and it's stated in mkfs.btrfs output 
when you create the filesystem.  If after all that people are using it in 
a remote situation where they're not prepared to drive down to the data 
center and stab at the keys if they have to, they're using possibly the 
right filesystem, but at the wrong too early point in its development, 
for their needs at this moment.


2) As the wiki explains, certain configurations require at least a 
minimum number of devices in ordered to work undegraded.  The example 
given in the OP was of a 4-device raid10, already the minimum number to 
work undegraded, with one device dropped out, to below the minimum 
required number to mount undegraded, so of /course/ it wouldn't mount 
without that option.

If five or six devices would have been used, a device could have been 
dropped and the remaining number of devices would still be greater than 
or equal to the minimum number of devices to run an undegraded raid10, 
and the result would likely have been different, since there's still 
enough devices to mount writable with proper redundancy, even if existing 
information doesn't have that redundancy until a rebalance is done to 
take care of the missing device.

Similarly with a raid1 and its minimum two devices.  Configure with 
three, then drop one, and it should still work as it's above the two 
minimum for raid1 configuration.  Configure with two and drop one, and 
you'll have to mount degraded (and it'll drop to read-only if it happens 
in operation) since there's no second device to write the second copy to, 
as required by raid1.

3) Frankly, this whole thread smells of going off half cocked, posting 
before doing the proper research.  I know when I took a look at btrfs 
here, I read up on the wiki, reading the multiple devices stuff, the faq, 
the problem faq, the gotchas, the use cases, the sysadmin guide, the 
getting started and mount options... loading the pages multiple times as 
I followed links back and forth between them.

Because I care about my data and want to understand what I'm doing with 
it before I do it!

And even now I often reread specific parts as I'm trying to help others 
with questions on this list

Then I still had some questions about how it worked that I couldn't find 
answers for on the wiki, and as traditional with mailing lists and 
newsgroups before them, I read several weeks worth of posts (on an 
archive for lists) before actually posting my questions, to see if they 
were FAQs already answered on the list.

Then and only then did I post the questions to the list, and when I did, 
it was, Questions I haven't found answers for on the wiki or list, not 
THE WORLD IS GOING TO END, OH NOS!!111!!11!111!!!

Now later on I did post some behavior that had me rather upset, but that 
was AFTER I had already engaged the list in general, and was pretty sure 
by that point that what I was seeing was NOT covered on the wiki, and was 
reasonably new information for at least SOME list users.

4) As a matter of fact, AFAIK that behavior remains relevant today, and 
may well be of interest to the OP.

FWIW my background was Linux kernel md/raid, so I approached the btrfs 
raid expecting similar behavior.  What I found in my testing (and NOT 
covered on the WIKI or in the various documentation other than in a few 
threads on list to this day, AFAIK) , however...

Test:  

a) Create a two device btrfs raid1.

b) Mount it and write some data to it.

c) Unmount it, unplug one device, mount degraded the remaining device.

d) Write some data to a test file on it, noting the path/filename and 
data.

e) Unmount again, switch plugged devices so the formerly unplugged one is 
now the plugged