Re: Scrub priority, am I using it wrong?

2016-04-04 Thread Duncan
Gareth Pye posted on Tue, 05 Apr 2016 13:45:11 +1000 as excerpted:

> On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> CPU bound, 0% IOWait even at idle IO priority, in addition to the
>> hundreds of M/s values per thread/device, here.  You OTOH are showing
>> under 20 M/s per thread/device on spinning rust, with an IOWait near
>> 90%,
>> thus making it IO bound.
> 
> 
> And yes I'd love to switch to SSD, but 12 2TB drives is a bit pricey
> still

No kidding.  That's why my media partition remains spinning rust.  (Tho 
FWIW, not btrfs, I use btrfs only on my ssds, and still use the old and 
stable reiserfs on my spinning rust.)

But my media partition is small enough, and ssd prices now low enough up 
to the 1 TB level, that when I upgrade I'll probably switch to ssd for 
the media partition as well, and leave spinning rust only as second or 
third level backups.

But that's because it all, including first level backups, fits in under a 
TB (and if pressed I could do it under a half TB).  Multi-TB, as you 
have, definitely still spinning rust, for me too.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-04 Thread Duncan
Gareth Pye posted on Tue, 05 Apr 2016 13:44:05 +1000 as excerpted:

> On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> 1) It appears btrfs scrub start's -c option only takes numeric class,
>> so try -c3 instead of -c idle.
> 
> 
> Does it count as a bug if it silently accepts the way I was doing it?
> 
> I've switched to -c3 and at least now the idle class listed in iotop is
> idle, so I hope that means it will be more friendly to other processes.

I'd say yes, particularly given that the value must be the numeric class 
isn't documented in the manpage at all.

Whether the bug is then one of documentation (say it must be numeric) or 
of implementation (take the class name as well) is then up for debate.  
I'd call fixing either one a fix.  If it must be numeric, document that 
(and optionally change the implementation to error in some way if a 
numeric parameter isn't supplied for -c), else change the implementation 
so the name can be taken as well (and optionally change the documentation 
to explicitly mention that either one can be used).  Doesn't matter to me 
which.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-04 Thread Qu Wenruo



Alex Lyakas wrote on 2016/04/03 10:22 +0200:

Hello Qu, Wang,

On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo  wrote:



Alex Lyakas wrote on 2016/03/29 19:22 +0200:


Greetings Qu Wenruo,

I have reviewed the dedup patchset found in the github account you
mentioned. I have several questions. Please note that by all means I
am not criticizing your design or code. I just want to make sure that
my understanding of the code is proper.



It's OK to criticize the design or code, and that's how review works.



1) You mentioned in several emails that at some point byte-to-byte
comparison is to be performed. However, I do not see this in the code.
It seems that generic_search() only looks for the hash value match. If
there is a match, it goes ahead and adds a delayed ref.



I mentioned byte-to-byte comparison as, "not to be implemented in any time
soon".

Considering the lack of facility to read out extent contents without any
inode structure, it's not going to be done in any time soon.



2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
mutex and proceed with the normal COW. What happens if there are
several IO streams to different files writing an identical block, but
we don't have such block in our dedup DB? Then all
btrfs_dedupe_search() calls will not find a match, so all streams will
allocate space for their block (which are all identical). At some
point, they will call insert_reserved_file_extent() and will call
btrfs_dedupe_add(). Since there is a global mutex, the first stream
will insert the dedup hash entries into the DB, and all other streams
will find that such hash entry already exists. So the end result is
that we have the hash entry in the DB, but still we have multiple
copies of the same block allocated, due to timing issues. Is this
correct?



That's right, and that's also unavoidable for the hash initializing stage.



3) generic_search() competes with __btrfs_free_extent(). Meaning that
generic_search() wants to add a delayed ref to an existing extent,
whereas __btrfs_free_extent() wants to delete an entry from the dedup
DB. The race is resolved as follows:
- generic_search attempts to lock the delayed ref head
- if it succeeds to lock, then __btrfs_free_extent() is not running
right now. So we can add a delayed ref. Later, when delayed ref head
will be run, it will figure out what needs to be done (free the extent
or not)
- if we fail to lock, then there is a delayed ref processing for this
bytenr. We drop all locks and redo the search from the top. If
__btrfs_free_extent() has deleted the dedup hash meanwhile, we will
not find it, and proceed with normal COW.
Is my understanding correct?



Yes that's correct.


Reviewing the code again, it seems that I still lack understanding.
What is special about the dedup code adding a delayed data ref versus
other places doing that? In other places, we do not insist on locking
the delayed ref head, but in dedup we do. For example,
__btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the
ref head. I know that one of your purposes was to draw attention to
delayed ref processing, so you have succeeded.


In the patchset, the delayed_ref related part is not only to draw 
attention, it's to resolve problems.


For example, there is a case where an extent has a ref in extent tree, 
while it's going to be freed, which means there is a DROP ref in 
delayed_refs:


For extent A:
Extent tree| Delayed refs
1  | -1 (Drop ref)

While we call dedupe_del() only at __btrfs_free_extents() time, which 
means unless delayed_refs are run, we still have the hash for Extent A.


If we don't lock delayed_ref_head, the following case may happen:

Dedupe routine | run_delayed_refs()
dedupe_search()|
  |- Found hash|
  || btrfs_delayed_ref_lock()
  || |- btrfs_delayed_ref_lock()
  || |- run_one_delayed_ref
  || |  |- __btrfs_free_extent()
  || |- btrfs_delayed_ref_unlock()
  |- btrfs_inc_extent_ref()|

In that case, we will increase extent ref to a non-exist extent.
It will cause next run_delayed_refs() return -ENOENT and cause abort 
transaction.

We have hit such problem several times in our test.

If we lock delayed ref head, we will ensure the delayed ref of that 
extent won't be run.


Either we increase extent ref before run_one_delayed_ref, or after it.

If we run before run delayed ref on that extent, we will 
increase_extent_ref() and won't go to __btrfs_free_extent(), that extent 
will still be there.


If we run after run delayed ref, we will not find the hash, and cause a 
hash miss and continue to write the data into disk.



In case we can't find delayed_ref_head, which means there is not delayed 
refs for that data extent yet.
We directly insert delayed_data_ref while holding delayed_refs->lock, to 
avoid any possible concurrency.



B

Re: Scrub priority, am I using it wrong?

2016-04-04 Thread Gareth Pye
On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> CPU bound, 0% IOWait even at idle IO priority, in addition to the
> hundreds of M/s values per thread/device, here.  You OTOH are showing
> under 20 M/s per thread/device on spinning rust, with an IOWait near 90%,
> thus making it IO bound.


And yes I'd love to switch to SSD, but 12 2TB drives is a bit pricey still

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-04 Thread Gareth Pye
On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> 1) It appears btrfs scrub start's -c option only takes numeric class, so
> try -c3 instead of -c idle.


Does it count as a bug if it silently accepts the way I was doing it?

I've switched to -c3 and at least now the idle class listed in iotop
is idle, so I hope that means it will be more friendly to other
processes.

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-04 Thread Qu Wenruo



David Sterba wrote on 2016/04/04 18:55 +0200:

On Fri, Mar 25, 2016 at 09:38:50AM +0800, Qu Wenruo wrote:

Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
key type. As this is the second user of that item, there's no precendent
how to select the subtype. Right now 0 is for the dev stats item, but
I'd like to leave some space between them, so it should be 256 at best.
The space is 64bit so there's enough room but this also means defining
the on-disk format.


After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is
larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid
of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.

Although that's not a big problem, but for user using debug-tree, it
would be quite annoying to find it located among tons of other hashes.


You can alternatively store it in the tree_root, but I don't know how
frquently it's supposed to be changed.


Storing in tree root sounds pretty good.
As such status doesn't change until we enable/disable (including 
configure), so tree root seems good.


But we still need to consider the later dedupe rate statistics key order.
In that case, I hope to restore them both into dedupe tree.




So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep
objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value,
to ensure dedupe status to be the first item of dedupe tree.


0 is unfortnuatelly taken by BTRFS_DEV_STATS_OBJECTID, but I don't see
problem with the ordering. DEDUPE_BYTENR/HASH_ITEM_KEY store a large
number in the objectid, either part of a hash, that's unlikely to be
almost-all zeros and bytenr which will be larger than 1MB.


OK, as long as we can search the status item with exactly match key, it 
shouldn't cause big problem.





4) Ioctl interface with persist dedup status


I'd like to see the ioctl specified in more detail. So far there's
enable, disable and status. I'd expect some way to control the in-memory
limits, let it "forget" current hash cache, specify the dedupe chunk
size, maybe sync of the in-memory hash cache to disk.


So current and planned ioctl should be the following, with some details
related to your in-memory limit control concerns.

1) Enable
  Enable dedupe if it's not enabled already. (disabled -> enabled)


Ok, so it should also take a parameter which bckend is about to be
enabled.


It already has.
It also has limit_nr and limit_mem parameter for in-memory backend.




  Or change current dedupe setting to another. (re-configure)


Doing that in 'enable' sounds confusing, any changes belong to a
separate command.


This depends the aspect of view.

For "Enable/config/disable" case, it will introduce a state machine for
end-user.


Yes, that's exacly my point.


Personally, I doesn't state machine for end user. Yes, I also hate
merging play and pause button together on music player.


I don't see this reference relevant, we're not designing a music player.


If using state machine, user must ensure the dedupe is enabled before
doing any configuration.


For user convenience we can copy the configuration options to the dedup
enable subcommand, but it will still do separate enable and configure
ioctl calls.


So, that's to say, user can assume there is a state machine, and to do 
enable-configure method.

And other user can use the state-less enable-enable method.

If so, I'm OK to add a configure ioctl interface.
(As it's still enable-enable stateless one beneath the stateful ioctl)

But in that case, if user forget to enable dedupe and call configure 
directly, btrfs won't give any warning and just enable dedupe.


Will that design be OK for you? Or we need to share most part of enable 
and configure ioctl, but configure ioctl will do extra check?






For me, user only need to care the result of the operation. User can now
configure dedupe to their need without need to know previous setting.
  From this aspect of view, "Enable/Disable" is much easier than
"Enable/Config/Disable".


Getting the usability is hard and that's why we're having this
dicussion. What suites you does not suite others, we have different
habits, expectations and there are existing usage patterns. We better
stick to something that's not too surprising yet still flexible enough
to cover broad needs. I'm leaving this open, but I strongly disagree
with the current interface proposal.


I'm still open to new ioctl interface design, as long as we can re-use 
most of current code.


Anyway, just as you pointed, the stateless one is just my personal taste.




  For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
  will disable dedupe(dropping all hash) and then enable with new
  setting.

  For in-memory backend, if only limit is different from previous
  setting, limit can be changed on the fly without dropping any hash.


This is obviously misplaced in 'enable'.


Then, changing the 'enable' to 'configure' or other pr

Re: [PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace

2016-04-04 Thread Duncan
Kai Krakow posted on Mon, 04 Apr 2016 22:15:13 +0200 as excerpted:

> Your argument would be less important if it did copy-back, tho... ;-)

FWIW, I completely misunderstood your description of copy-back in my 
original reply, and didn't realize what you meant (and thus my mistaken 
understanding) until I read some of the other replies today.

What I /thought/ you meant was some totally nonsense/WTF idea of keeping 
the newly substituted hot-spare in place, and taking the newly vacated 
"defective" device and putting it back in the the hot-spare list.

That rightly seemed stupid to me (it's a device just replaced as 
defective, now you're putting it back as a hot-spare? WTF?), but that's 
how I read what you were asking for and saying that other solutions did, 
so...

Of course today when I read the other replies and realized what you were 
/actually/ describing, returning the hot-spare to hot-spare status after 
physically replacing the actually failed drive with a new one and 
logically replacing the hot-spare with it in the filesystem, thereby 
making the hot-spare a spare once again, my reaction was "DUH!! NOW it 
makes sense!"  But I was just going to let it go and go hide my original 
misunderstanding in a hole somewhere.

But now you replied to my reply, so I figured I would reply back, 
explaining what on earth I was thinking when I wrote it, and why it must 
have seemed rather out of left field and didn't make much sense -- 
because what I was thinking you were suggesting /didn't/ make sense, but 
of course that's because I totally misunderstood what you were suggesting.

So now my very-much-former misunderstanding is out of the hole and posted 
for everyone to see and have a good laugh at, and I'm much the wiser on 
what copy-back actually entails. =:^)

Tho it seems I was correct in the one aspect, currently ENotImplemented, 
even if my idea of what you were asking to be implemented was totally and 
completely off-the-wall wrong.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub priority, am I using it wrong?

2016-04-04 Thread Duncan
Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted:

> I've got a btrfs file system set up on 6 drbd disks running on 2Tb
> spinning disks. The server is moderately loaded with various regular
> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs
> scrub for the best quiet time in the week.
> 
> The command that is run is:
> /usr/local/bin/btrfs scrub start -Bd -c idle /data
> 
> Which is my best attempt to try and get it to have a low impact on user
> operations
> 
> But iotop shows me:
> 
> 1765 be/4 root   14.84 M/s0.00 B/s  0.00 % 96.65 % btrfs scrub
> start -Bd -c idle /data
>  1767 be/4 root   14.70 M/s0.00 B/s  0.00 % 95.35 % btrfs
> scrub start -Bd -c idle /data
>  1768 be/4 root   13.47 M/s0.00 B/s  0.00 % 92.59 % btrfs
> scrub start -Bd -c idle /data
>  1764 be/4 root   12.61 M/s0.00 B/s  0.00 % 88.77 % btrfs
> scrub start -Bd -c idle /data
>  1766 be/4 root   11.24 M/s0.00 B/s  0.00 % 85.18 % btrfs
> scrub start -Bd -c idle /data
>  1763 be/4 root7.79 M/s0.00 B/s  0.00 % 63.30 % btrfs
> scrub start -Bd -c idle /data
> 28858 be/4 root0.00 B/s  810.50 B/s  0.00 % 61.32 % [kworker/
u16:25]
> 
> 
> Which doesn't look like an idle priority to me. And the system sure
> feels like a system with a lot of heavy io going on. Is there something
> I'm doing wrong?

Two points:

1) It appears btrfs scrub start's -c option only takes numeric class, so 
try -c3 instead of -c idle.

Works for me with the numeric class (same results as you with spelled out 
class), tho I'm on ssd with multiple independent btrfs on partitions, the 
biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of 
20 seconds, so I don't need and hadn't tried the -c option at all until 
now. 

2) What a difference an ssd makes!

$$ sudo btrfs scrub start -c3 /p
scrub started on /p, [...]

$$ sudo iotop -obn1
Total DISK READ : 626.53 M/s | Total DISK WRITE :   0.00 B/s
Actual DISK READ: 596.93 M/s | Actual DISK WRITE:   0.00 B/s
 TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
 872 idle root  268.40 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub 
start -c3 /p
 873 idle root  358.13 M/s0.00 B/s  0.00 %  0.00 % btrfs scrub 
start -c3 /p

CPU bound, 0% IOWait even at idle IO priority, in addition to the 
hundreds of M/s values per thread/device, here.  You OTOH are showing 
under 20 M/s per thread/device on spinning rust, with an IOWait near 90%, 
thus making it IO bound.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Duncan
Kai Krakow posted on Mon, 04 Apr 2016 21:26:28 +0200 as excerpted:

> I'll go test the soon-to-die SSD as soon as it replaced. I think it's
> still far from failing with bitrot. It was overprovisioned by 30% most
> of the time, with the spare space trimmed.

Same here, FWIW.  In fact, I had expected to get ~128 GB SSDs and ended 
up getting 256 GB, such that I was only using about 130 GiB, so depending 
on relative to what the overprovisioning percentage is calculated 
against, I was and am near 50% or 100% overprovisioned.

So in my case I think the SSD was simply defective, such that the 
overprovisioning and trim simply didn't help.  Tho the other two 
identical brand and model devices I bought from the same store at the 
same time, so very likely the same manufacturing lot, were and are just 
fine (tho one is showing a trivial non-zero raw value for 5, reallocated 
sector count, and 182, erase fail count total, but both remain at 100% 
"cooked" value, but absolutely no issues on the other one, actually the 
one that wasn't replaced of the original pair, at all).

But based on that experience, while overprovisioning may help in terms of 
normal wearout, it doesn't necessarily help at all if the device is 
actually going bad.

> It certainly should have a
> lot of sectors for wear levelling. In addition, smartctl shows no sector
> errors at all - except for one: raw_read_error_rate. I'm not sure what
> all those sensors tell me, but that one I'm also seeing on hard disks
> which show absolutely no data damage.
> 
> In fact, I see those counters for my hard disks. But dd to /dev/null of
> the complete raw hard disk shows no sector errors. It seems good. But
> well, counting 1+1 together: I currently see data damage. But I guess
> that's unrelated.
> 
> Is there some documentation somewhere what each of those sensors
> technically mean and how to read the raw values and thresh values?

Nothing user/admin level that I'm aware of.  I'm sure there's some smart 
docs somewhere that describe them as part of the standard, but they could 
easily be effectively unavailable for those unwilling to pay a big-
corporate-sized consortium membership fee (as was the case with one of 
the CompactDisc specs, Orange Book IIRC, at one point).

I know there's some discussion by allusion in the smartctl manpage and 
docs, but many attributes appear to be manufacturer specific and/or to 
have been reverse-engineered by the smartctl devs, meaning even /they/ 
don't really have access to proper documentation for at least some 
attributes.

Which is sad, but in a majority proprietary or at best don't-care 
market...

> I'm also seeing multi_zone_error_rate on my spinning rust.

> According to smartctl health check and smartctl extended selftest,
> there's no problems at all - and the smart error log is empty. There has
> never been an ATA error in dmesg... No relocated sectors... From my
> naive view the drives still look good.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/22] btrfs-progs: convert: Introduce functions to read used space

2016-04-04 Thread Qu Wenruo



David Sterba wrote on 2016/04/04 15:35 +0200:

On Fri, Jan 29, 2016 at 01:03:11PM +0800, Qu Wenruo wrote:

Before we do real convert, we need to read and build up used space cache
tree for later data/meta separate chunk layout.

This patch will iterate all used blocks in ext2 filesystem and record it
into cctx->used cache tree, for later use.

This provides the very basic of later btrfs-convert rework.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
  btrfs-convert.c | 80 +
  1 file changed, 80 insertions(+)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index 4baa68e..65841bd 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -81,6 +81,7 @@ struct btrfs_convert_context;
  struct btrfs_convert_operations {
const char *name;
int (*open_fs)(struct btrfs_convert_context *cctx, const char *devname);
+   int (*read_used_space)(struct btrfs_convert_context *cctx);
int (*alloc_block)(struct btrfs_convert_context *cctx, u64 goal,
   u64 *block_ret);
int (*alloc_block_range)(struct btrfs_convert_context *cctx, u64 goal,
@@ -230,6 +231,73 @@ fail:
return -1;
  }

+static int __ext2_add_one_block(ext2_filsys fs, char *bitmap,
+   unsigned long group_nr, struct cache_tree *used)
+{
+   unsigned long offset;
+   unsigned i;
+   int ret = 0;
+
+   offset = fs->super->s_first_data_block;
+   offset /= EXT2FS_CLUSTER_RATIO(fs);


This macro does not exist on my reference host for old distros. The
e2fsprogs version is 1.41.14 and I'd like to keep the compatibility at
least at that level.

The clustering has been added in 1.42 but can we add some compatibility
layer that will work on both version?


No problem.

It's a simple macro. For older version which doesn't provide it, we can 
just define it in btrfs-covert.c.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-04 Thread Qu Wenruo



David Sterba wrote on 2016/04/04 13:18 +0200:

On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:

After another look, why don't we use nodesize directly? Or stripesize
where applies. With max_size == 0 the test does not make sense, we ought
to know the alignment.



Yes, my first though is also to use nodesize directly, which should be
always correct.

But the problem is, the related function call stack doesn't have any
member to reach btrfs_root or btrfs_fs_info.


JFYI, there's global_info avalaible, so it's not necessary to pass
fs_info down the callstacks.



Oh, that's a good news.

Do I need to re-submit the patch to use fs_info->tree_root->nodesize to 
avoid false alert?

Or wait for your refactor?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups wrong after snapshot create

2016-04-04 Thread Qu Wenruo

Hi,

Thanks for the report.

Mark Fasheh wrote on 2016/04/04 16:06 -0700:

Hi,

Making a snapshot gets us the wrong qgroup numbers. This is very easy to
reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
snapshot. In this example we have mounted a newly created fresh filesystem
and mounted it at /btrfs:

# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
# btrfs qg show /btrfs

qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
0/25716.00KiB 16.00KiB



Also reproduced it.

My first idea is, old snapshot qgroup hack is involved.

Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty hack to 
handle it:

Copy rfer from source subvolume, and directly set excl to nodesize.

If such work is before adding snapshot inode into src subvolume, it may 
be the reason causing the bug.




In the example above, the default subvolume (0/5) should read 16KiB
referenced and 16KiB exclusive.

A rescan fixes things, so we know the rescan process is doing the math
right:

# btrfs quota rescan /btrfs
# btrfs qgroup show /btrfs
qgroupid rfer excl
  
0/5  16.00KiB 16.00KiB
0/25716.00KiB 16.00KiB



So the base of qgroup code is not affected, or we may need another 
painful rework.





The last kernel to get this right was v4.1:

# uname -r
4.1.20
# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
Create a snapshot of '/btrfs/' in '/btrfs/snap1'
# btrfs qg show /btrfs
qgroupid rfer excl
  
0/5  16.00KiB 16.00KiB
0/25716.00KiB 16.00KiB


Which leads me to believe that this was a regression introduced by Qu's
rewrite as that is the biggest change to qgroups during that development
period.


Going back to upstream, I applied my tracing patch from this list
( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
couple changes - I'm printing the rfer/excl bytecounts in
qgroup_update_counters AND I print them twice - once before we make any
changes and once after the changes. If I enable tracing in
btrfs_qgroup_account_extent and qgroup_update_counters just before the
snapshot creation, we get the following trace:


# btrfs quota enable /btrfs
# 
# echo 1 > 
/sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
# echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
# btrfs sub sna /btrfs/ /btrfs/snap2
Create a snapshot of '/btrfs/' in '/btrfs/snap2'
# btrfs qg show /btrfs
qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
0/25716.00KiB 16.00KiB
# fstest1:~ # cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:2
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
btrfs-10233 [001]  260298.823339: btrfs_qgroup_account_extent: 
bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid 
= 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid 
= 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
btrfs-10233 [001]  260298.823343: btrfs_qgroup_account_extent: 
bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
btrfs-10233 [001]  260298.823345: btrfs_qgroup_account_extent: 
bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
btrfs-10233 [001]  260298.823347: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1


Now, for extent 29786112, its nr_new_roots is 1.


btrfs-10233 [001]  260298.823347: qgroup_update_counters: qgid 
= 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
btrfs-10233 [001]  260298.823348: qgroup_update_counters: qgid 
= 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs-10233 [001]  260298.823421: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0


Now the problem is here, nr_old_roots should be 1, not 0.
Just as previous trace shows, we increased extent ref on that extent, 
but now it dropped back to 0.


Since its old_root == new_root == 0, qgroup code doesn't do anything on it.
If its nr_old_roots is 1, qgroup will drop it's excl/rfer to 0, and then 
accounting may goes back to normal.

[PATCH] Btrfs: fix missing s_id setting

2016-04-04 Thread Tsutomu Itoh
When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has
not been updated.
As a result, the deleted device name is displayed by btrfs_printk.

[before fix]
 # btrfs dev del /dev/sdc4 /mnt2
 # btrfs dev add /dev/sdb6 /mnt2

 [  217.458249] BTRFS info (device sdc4): found 1 extents
 [  217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4
 [  217.941284] BTRFS info (device sdc4): disk added /dev/sdb6

[after fix]
 # btrfs dev del /dev/sdc4 /mnt2
 # btrfs dev add /dev/sdb6 /mnt2

 [   83.835072] BTRFS info (device sdc4): found 1 extents
 [   84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4
 [   84.401951] BTRFS info (device sdc3): disk added /dev/sdb6

Signed-off-by: Tsutomu Itoh 
---
 fs/btrfs/dev-replace.c |  5 -
 fs/btrfs/volumes.c | 11 +--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a1d6652..11c4198 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
tgt_device->commit_bytes_used = src_device->bytes_used;
if (fs_info->sb->s_bdev == src_device->bdev)
fs_info->sb->s_bdev = tgt_device->bdev;
-   if (fs_info->fs_devices->latest_bdev == src_device->bdev)
+   if (fs_info->fs_devices->latest_bdev == src_device->bdev) {
fs_info->fs_devices->latest_bdev = tgt_device->bdev;
+   snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+tgt_device->bdev);
+   }
list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
fs_info->fs_devices->rw_devices++;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d5..a471385 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
 struct btrfs_device, dev_list);
if (device->bdev == root->fs_info->sb->s_bdev)
root->fs_info->sb->s_bdev = next_device->bdev;
-   if (device->bdev == root->fs_info->fs_devices->latest_bdev)
+   if (device->bdev == root->fs_info->fs_devices->latest_bdev) {
root->fs_info->fs_devices->latest_bdev = next_device->bdev;
+   snprintf(root->fs_info->sb->s_id,
+sizeof(root->fs_info->sb->s_id), "%pg",
+next_device->bdev);
+   }
 
if (device->bdev) {
device->fs_devices->open_devices--;
@@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
 struct btrfs_device, dev_list);
if (tgtdev->bdev == fs_info->sb->s_bdev)
fs_info->sb->s_bdev = next_device->bdev;
-   if (tgtdev->bdev == fs_info->fs_devices->latest_bdev)
+   if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) {
fs_info->fs_devices->latest_bdev = next_device->bdev;
+   snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg",
+next_device->bdev);
+   }
list_del_rcu(&tgtdev->dev_list);
 
call_rcu(&tgtdev->rcu, free_device);
-- 
2.6.4


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Scrub priority, am I using it wrong?

2016-04-04 Thread Gareth Pye
I've got a btrfs file system set up on 6 drbd disks running on 2Tb
spinning disks. The server is moderately loaded with various regular
tasks that use a fair bit of disk IO, but I've scheduled my weekly
btrfs scrub for the best quiet time in the week.

The command that is run is:
/usr/local/bin/btrfs scrub start -Bd -c idle /data

Which is my best attempt to try and get it to have a low impact on
user operations

But iotop shows me:

1765 be/4 root   14.84 M/s0.00 B/s  0.00 % 96.65 % btrfs scrub
start -Bd -c idle /data
 1767 be/4 root   14.70 M/s0.00 B/s  0.00 % 95.35 % btrfs
scrub start -Bd -c idle /data
 1768 be/4 root   13.47 M/s0.00 B/s  0.00 % 92.59 % btrfs
scrub start -Bd -c idle /data
 1764 be/4 root   12.61 M/s0.00 B/s  0.00 % 88.77 % btrfs
scrub start -Bd -c idle /data
 1766 be/4 root   11.24 M/s0.00 B/s  0.00 % 85.18 % btrfs
scrub start -Bd -c idle /data
 1763 be/4 root7.79 M/s0.00 B/s  0.00 % 63.30 % btrfs
scrub start -Bd -c idle /data
28858 be/4 root0.00 B/s  810.50 B/s  0.00 % 61.32 % [kworker/u16:25]


Which doesn't look like an idle priority to me. And the system sure
feels like a system with a lot of heavy io going on. Is there
something I'm doing wrong?

System details:

# uname -a
Linux emile 4.4.3-040403-generic #201602251634 SMP Thu Feb 25 21:36:25
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# /usr/local/bin/btrfs --version
btrfs-progs v4.4.1

I'm waiting on the ppa version of 4.5.1 before upgrading, that is my
usual kernel update strategy.

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.4 LTS"

Any other details that people would like to see that are relevant to
this question?

-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Chris Murphy
On Mon, Apr 4, 2016 at 2:50 PM, Kai Krakow  wrote:

>> Anyway the 2nd 4 is not possible. The seed is ro by definition so you
>> can't remove snapshots from the seed. If you remove them from the
>> mounted rw sprout volume, they're removed from the sprout, not the
>> seed. If you want them on the sprout, but not on the seed, you need to
>> delete snapshots only after the seed is a.) removed from the sprout
>> and b.) made no longer a seed with btrfstune -S 0 and c.) mounted rw.
>
> If I understand right, the seed device won't change? So whatever action
> I apply to the sprout pool, I can later remove the seed from the pool
> and it will still be kind of untouched. Except, I'll have to return it
> no non-seed mode (step b).

Correct. In a sense, making a volume a seed is like making it a
volume-wide read-only snapshot. Any changes are applied via COW only
to added device(s).

>
> Why couldn't/shouldn't I remove snapshots before detaching the seed
> device? I want to keep them on the seed but they are useless to me on
> the sprout.

You can remove snapshots before or after detaching the seed device, it
doesn't matter, but such snapshot removal only affects the sprout. You
wrote:

"remove all left-over snapshots from the seed"

The seed is read only, you can't modify the contents of the seed device.

What you should do is just delete the snapshots you don't want
migrated over to the sprout right away before you even do the balance
-dconvert -mconvert. That way you aren't wasting time moving things
over that you don't want. To be clear:

btrfstune -S 0
mount /dev/seed /mnt/
btrfs dev add /dev/new1
btrfs dev add /dev/new2
mount -o remount,rw /mnt/
btrfs sub del blah/ blah2/ blah3/ blah4/
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/
btrfs dev del /dev/seed /mnt/

If you're doing any backsup once remounting rw, note those backups
will only be on the sprout. Backups will not be on the seed because
it's read-only.


>
> What happens to the UUIDs when I separate seed and sprout?

Nothing. They remain intact and unique, per volume.




>
> I'd now reboot into the system to see if it's working.

Note you'll need to change grub.cfg, possibly fstab, and possibly the
initramfs, all three of which may be referencing the old volume.


> By then, it's
> time for some cleanup (remove the previously deferred "trashes" and
> retention snapshots), then separate the seed from the sprout. During
> that time, I could already use my system again while it's migrating for
> me in the background.
>
> I'd then return the seed back to non-seed, so it can take the role of
> my backup storage again. I'd do a rebalance now.

OK? I don't know why you need to balance the seed at all, let alone
afterward, but it seems like it might be a more efficient replication
if you balanced before making it a seed?


>
> During the whole process, the backup storage will still stay safe for
> me. If something goes wrong, I could easily start over.
>
> Did I miss something? Is it too much of an experimental kind of stuff?

I'm not sure where all the bugs are. It's good to find bugs though and
get them squashed. I have an idea of making live media use Btrfs
instead of using a loop mounted file to back a rw lvm snapshot device
(persistent overlay), which I think is really fragile and a lot more
complicated in the initramfs. It's also good to take advantage of
checksumming after having written an ISO to flash media, where users
often don't verify or something can mount the USB stick rw and
immediately modify the stick in such a way that media verification
will fail anyway. So, a number of plusses, I'd like to see the seed
device be robust.


>
> BTW: The way it is arranged now, the backup storage is bootable by
> setting the scratch area subvolume as the rootfs on kernel cmdline,
> USB drivers are included in the kernel, it's tested and works. I guess,
> this isn't possible while the backup storage acts as a seed device? But
> I have an initrd with latest btrfs-progs on my boot device (which is an
> UEFI ESP, so not related to btrfs at all), I should be able to use that
> to revert changes preventing me from booting.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Qgroups wrong after snapshot create

2016-04-04 Thread Mark Fasheh
Hi,

Making a snapshot gets us the wrong qgroup numbers. This is very easy to
reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a
snapshot. In this example we have mounted a newly created fresh filesystem
and mounted it at /btrfs:

# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
# btrfs qg show /btrfs

qgroupid rfer excl 
   
0/5  32.00KiB 32.00KiB 
0/25716.00KiB 16.00KiB 


In the example above, the default subvolume (0/5) should read 16KiB
referenced and 16KiB exclusive.

A rescan fixes things, so we know the rescan process is doing the math
right:

# btrfs quota rescan /btrfs
# btrfs qgroup show /btrfs
qgroupid rfer excl 
   
0/5  16.00KiB 16.00KiB 
0/25716.00KiB 16.00KiB 



The last kernel to get this right was v4.1:

# uname -r
4.1.20
# btrfs quota enable /btrfs
# btrfs sub sna /btrfs/ /btrfs/snap1
Create a snapshot of '/btrfs/' in '/btrfs/snap1'
# btrfs qg show /btrfs
qgroupid rfer excl 
   
0/5  16.00KiB 16.00KiB 
0/25716.00KiB 16.00KiB 


Which leads me to believe that this was a regression introduced by Qu's
rewrite as that is the biggest change to qgroups during that development
period.


Going back to upstream, I applied my tracing patch from this list
( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a
couple changes - I'm printing the rfer/excl bytecounts in
qgroup_update_counters AND I print them twice - once before we make any
changes and once after the changes. If I enable tracing in
btrfs_qgroup_account_extent and qgroup_update_counters just before the
snapshot creation, we get the following trace:


# btrfs quota enable /btrfs
# 
# echo 1 > 
/sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable
# echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable
# btrfs sub sna /btrfs/ /btrfs/snap2
Create a snapshot of '/btrfs/' in '/btrfs/snap2'
# btrfs qg show /btrfs
qgroupid rfer excl 
   
0/5  32.00KiB 32.00KiB 
0/25716.00KiB 16.00KiB 
# fstest1:~ # cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 13/13   #P:2
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
   btrfs-10233 [001]  260298.823339: btrfs_qgroup_account_extent: 
bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0
   btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid = 
5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384
   btrfs-10233 [001]  260298.823342: qgroup_update_counters: qgid = 
5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0
   btrfs-10233 [001]  260298.823343: btrfs_qgroup_account_extent: 
bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
   btrfs-10233 [001]  260298.823345: btrfs_qgroup_account_extent: 
bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
   btrfs-10233 [001]  260298.823347: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
   btrfs-10233 [001]  260298.823347: qgroup_update_counters: qgid = 
5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
   btrfs-10233 [001]  260298.823348: qgroup_update_counters: qgid = 
5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
   btrfs-10233 [001]  260298.823421: btrfs_qgroup_account_extent: 
bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
   btrfs-10233 [001]  260298.823422: btrfs_qgroup_account_extent: 
bytenr = 29835264, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
   btrfs-10233 [001]  260298.823425: btrfs_qgroup_account_extent: 
bytenr = 29851648, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
   btrfs-10233 [001]  260298.823426: qgroup_update_counters: qgid = 
5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
   btrfs-10233 [001]  260298.823426: qgroup_update_counters: qgid = 
5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768

If you read through the whole log we do some ... interesting.. things - at
the start, we *subtract* from qgroup 5, making it's count go to zero. I want
to say that this is kind of unexpected for a snapshot create but perhaps
there's something I'm missing.

Remember that I'm printing each qgroup tw

Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 22:50:18 +0200
schrieb Kai Krakow :

> Am Mon, 4 Apr 2016 13:57:50 -0600
> schrieb Chris Murphy :
> 
> > On Mon, Apr 4, 2016 at 1:36 PM, Kai Krakow 
> > wrote:
> >   
> > >
> >  [...]
>  [...]  
> > >
> > > In the following sense: I should disable the automounter and
> > > backup job for the seed device while I let my data migrate back
> > > to main storage in the background...
> > 
> > The sprout can be written to just fine by the backup, just
> > understand that the seed and sprout volume UUID are different. Your
> > automounter is probably looking for the seed's UUID, and that seed
> > can only be mounted ro. The sprout UUID however can be mounted rw.
> > 
> > I would probably skip the automounter. Do the seed setup, mount it,
> > add all devices you're planning to add, then -o
> > remount,rw,compress... , and then activate the backup. But maybe
> > your backup also is looking for UUID? If so, that needs to be
> > updated first. Once the balance -dconvert=raid1 and -mconvert=raid1
> > is finished, then you can remove the seed device. And now might be
> > a good time to give the raid1 a new label, I think it inherits the
> > label of the seed but I'm not certain of this.
> > 
> >   
> > > My intention is to use fully my system while btrfs migrates the
> > > data from seed to main storage. Then, afterwards I'd like to
> > > continue using the seed device for backups.
> > >
> > > I'd probably do the following:
> > >
> > > 1. create btrfs pool, attach seed
> > 
> > I don't understand that step in terms of commands. Sprouts are made
> > with btrfs dev add, not with mkfs. There is no pool creation. You
> > make a seed. You mount it. Add devices to it. Then remount it.  
> 
> Hmm, yes. I didn't think this through into detail yet. It actually
> works that way. I more commonly referenced to the general approach.
> 
> But I think this answers my question... ;-)
> 
> > > 2. recreate my original subvolume structure by snapshotting the
> > > backup scratch area multiple times into each subvolume
> > > 3. rearrange the files in each subvolume to match their intended
> > > use by using rm and mv
> > > 4. reboot into full system
> > > 4. remove all left-over snapshots from the seed
> > > 5. remove (detach) the seed device
> > 
> > You have two 4's.  
> 
> Oh... Sorry... I think one week of 80 work hours, and another of 60
> was a bit too much... ;-)
> 
> > Anyway the 2nd 4 is not possible. The seed is ro by definition so
> > you can't remove snapshots from the seed. If you remove them from
> > the mounted rw sprout volume, they're removed from the sprout, not
> > the seed. If you want them on the sprout, but not on the seed, you
> > need to delete snapshots only after the seed is a.) removed from
> > the sprout and b.) made no longer a seed with btrfstune -S 0 and
> > c.) mounted rw.  
> 
> If I understand right, the seed device won't change? So whatever
> action I apply to the sprout pool, I can later remove the seed from
> the pool and it will still be kind of untouched. Except, I'll have to
> return it no non-seed mode (step b).
> 
> Why couldn't/shouldn't I remove snapshots before detaching the seed
> device? I want to keep them on the seed but they are useless to me on
> the sprout.
> 
> What happens to the UUIDs when I separate seed and sprout?
> 
> This is my layout:
> 
> /dev/sde1 contains my backup storage: btrfs with multiple weeks worth
> of retention in form of ro snapshots, and one scratch area in which
> the backup is performed. Snapshots are created from the scratch area.
> The scratch area is one single subvolume updated by rsync.
> 
> I want to turn this into a seed for my newly created btrfs pool. This
> one has subvolumes for /home, /home/my_user, /distribution_name/rootfs
> and a few more (like var/log etc).
> 
> Since the backup is not split by those subvolumes but contains just
> the single runtime view of my system rootfs, I'm planning to clone
> this single subvolume back into each of my previously used subvolumes
> which in turn of course now contain all the same complete filesystem
> tree. Thus, in the next step, I'm planning to mv/rm the contents to
> get back to the original subvolume structure - mv should be a fast
> operation here, rm probably not so but I don't bother. I could defer
> that until later by moving those rm-candidates into some trash folder
> per subvolume.
> 
> Now, I still have the ro-snapshots worth of multiple weeks of
> retention. I only need those in my backup storage, not in the storage
> proposed to become my bootable system. So I'd simply remove them. I
> could also defer that until later easily.
> 
> This should get my system back into working state pretty fast and
> easily if I didn't miss a point.
> 
> I'd now reboot into the system to see if it's working. By then, it's
> time for some cleanup (remove the previously deferred "trashes" and
> retention snapshots), then separate the seed from the sprout. During
> that time, I co

Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 13:57:50 -0600
schrieb Chris Murphy :

> On Mon, Apr 4, 2016 at 1:36 PM, Kai Krakow 
> wrote:
> 
> >  
>  [...]  
> >>
> >> ?  
> >
> > In the following sense: I should disable the automounter and backup
> > job for the seed device while I let my data migrate back to main
> > storage in the background...  
> 
> The sprout can be written to just fine by the backup, just understand
> that the seed and sprout volume UUID are different. Your automounter
> is probably looking for the seed's UUID, and that seed can only be
> mounted ro. The sprout UUID however can be mounted rw.
> 
> I would probably skip the automounter. Do the seed setup, mount it,
> add all devices you're planning to add, then -o remount,rw,compress...
> , and then activate the backup. But maybe your backup also is looking
> for UUID? If so, that needs to be updated first. Once the balance
> -dconvert=raid1 and -mconvert=raid1 is finished, then you can remove
> the seed device. And now might be a good time to give the raid1 a new
> label, I think it inherits the label of the seed but I'm not certain
> of this.
> 
> 
> > My intention is to use fully my system while btrfs migrates the data
> > from seed to main storage. Then, afterwards I'd like to continue
> > using the seed device for backups.
> >
> > I'd probably do the following:
> >
> > 1. create btrfs pool, attach seed  
> 
> I don't understand that step in terms of commands. Sprouts are made
> with btrfs dev add, not with mkfs. There is no pool creation. You make
> a seed. You mount it. Add devices to it. Then remount it.

Hmm, yes. I didn't think this through into detail yet. It actually
works that way. I more commonly referenced to the general approach.

But I think this answers my question... ;-)

> > 2. recreate my original subvolume structure by snapshotting the
> > backup scratch area multiple times into each subvolume
> > 3. rearrange the files in each subvolume to match their intended
> > use by using rm and mv
> > 4. reboot into full system
> > 4. remove all left-over snapshots from the seed
> > 5. remove (detach) the seed device  
> 
> You have two 4's.

Oh... Sorry... I think one week of 80 work hours, and another of 60 was
a bit too much... ;-)

> Anyway the 2nd 4 is not possible. The seed is ro by definition so you
> can't remove snapshots from the seed. If you remove them from the
> mounted rw sprout volume, they're removed from the sprout, not the
> seed. If you want them on the sprout, but not on the seed, you need to
> delete snapshots only after the seed is a.) removed from the sprout
> and b.) made no longer a seed with btrfstune -S 0 and c.) mounted rw.

If I understand right, the seed device won't change? So whatever action
I apply to the sprout pool, I can later remove the seed from the pool
and it will still be kind of untouched. Except, I'll have to return it
no non-seed mode (step b).

Why couldn't/shouldn't I remove snapshots before detaching the seed
device? I want to keep them on the seed but they are useless to me on
the sprout.

What happens to the UUIDs when I separate seed and sprout?

This is my layout:

/dev/sde1 contains my backup storage: btrfs with multiple weeks worth
of retention in form of ro snapshots, and one scratch area in which the
backup is performed. Snapshots are created from the scratch area. The
scratch area is one single subvolume updated by rsync.

I want to turn this into a seed for my newly created btrfs pool. This
one has subvolumes for /home, /home/my_user, /distribution_name/rootfs
and a few more (like var/log etc).

Since the backup is not split by those subvolumes but contains just the
single runtime view of my system rootfs, I'm planning to clone this
single subvolume back into each of my previously used subvolumes which
in turn of course now contain all the same complete filesystem tree.
Thus, in the next step, I'm planning to mv/rm the contents to get back
to the original subvolume structure - mv should be a fast operation
here, rm probably not so but I don't bother. I could defer that until
later by moving those rm-candidates into some trash folder per
subvolume.

Now, I still have the ro-snapshots worth of multiple weeks of
retention. I only need those in my backup storage, not in the storage
proposed to become my bootable system. So I'd simply remove them. I
could also defer that until later easily.

This should get my system back into working state pretty fast and
easily if I didn't miss a point.

I'd now reboot into the system to see if it's working. By then, it's
time for some cleanup (remove the previously deferred "trashes" and
retention snapshots), then separate the seed from the sprout. During
that time, I could already use my system again while it's migrating for
me in the background.

I'd then return the seed back to non-seed, so it can take the role of
my backup storage again. I'd do a rebalance now.

During the whole process, the backup storage will still stay safe for
me. If something goes wro

Re: csum failed on innexistent inode

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 03:50:54 -0400
schrieb Jérôme Poulin :

> How is it possible to get rid of the referenced csum errors if they do
> not exist? Also, the expected checksum looks suspiciously the same for
> multiple errors. Could it be bad RAM in that case? Can I convince
> BTRFS to update the csum?
> 
> # btrfs inspect-internal logical-resolve -v 1809149952 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory
> # btrfs inspect-internal inode-resolve -v 296 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory

I fell into that pitfall, too. If you have multiple subvolumes, you
need to pass the correct subvolume path for the inode to properly
resolve.

Maybe that's the case for you?

First, take a look at what "btrfs subvol list /mnt/btrfs" shows you.

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 04:45:16 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Kai Krakow posted on Mon, 04 Apr 2016 02:00:43 +0200 as excerpted:
> 
> > Does this also implement "copy-back" - thus, it returns the
> > hot-spare device to global hot-spares when the failed device has
> > been replaced?  
> 
> I don't believe it does that in this initial implementation, anyway.
> 
> There's a number of issues with the initial implementation, including
> the fact that the hot-spare is global only and can't be specifically
> assigned to a filesystem or set of filesystems, which means, if you
> have multiple filesystems using different sized devices, the
> hot-spares must be sized to match the largest device they could
> replace, and thus would be mostly wasted if they ended up replacing a
> far smaller device.  If the spares could be associated with specific
> filesystems, then specifically sized spares could be associated
> appropriately, avoiding that waste. Additionally, it would then be
> possible to queue up say 20 spares on an important filesystem, with
> no spares on another that you'd rather just go down if a device fails.
> 
> So obviously the initial implementation isn't seriously
> enterprise-ready and is sub-optimal in many ways, but it's better
> than what is currently available (no automated spare handling at
> all), and an implementation must start somewhere, so as long as it's
> designed to be improved and extended with the missing features over
> time, as has been indicated, it's a reasonable first-implementation.

Your argument would be less important if it did copy-back, tho... ;-)

It's a very welcome and good start, I didn't mean to talk it useless.
By no way.

But to handle it right, that point should be clear. Currently, if the
global spare jumps in, you can always simulate "hot spare" by manually
putting back a correctly sized drive, then remove the spare again to
simulate copy-back, then make it global spare again.

Since such an incident needs manual investigation anyways, it's totally
reasonable to start with this implementation.

This sort of handling could be made into a guide within the docs.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread Dan Carpenter
On Mon, Apr 04, 2016 at 05:02:38PM +0100, Filipe Manana wrote:
> It's not serious if it doesn't have all the proper error handling
> and etc, it's just something for debugging purposes.

I'm slowly trying to remove static checker warnings so that we can
detect real bugs.  People sometimes leave little messages for me in
their code because they know I will review the new warning:

foo = kmalloc();
/* error handling deliberately left out */

It make me quite annoyed because it's like "Oh.  No if we added error
handling that would take 40 extra bytes of memory!  Such a waste!"  But
we could use instead use___GFP_NOFAIL instead.  Or BUG_ON(!foo)".  I
have gotten distracted.  What was the question again?

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 14:19:23 +0800
schrieb Anand Jain :

> > Otherwise, I find "hot spare" misleading and it should be renamed.  
> 
>   I never thought hot spare would be narrowed to such a specifics.
[...]
>   About the naming.. the progs called it 'global spare' (device),
>   kernel calls is 'spare'. Sorry this email thread called it
>   hot spare. I should have paid little more attention here to maintain
>   consistency.
> 
>   Thanks for the note.

I think that's okay. Maybe man pages / doc should put a note that
there's no copy-back and that the spare takes a permanent replacement
role.

Side note: When I started managing hardware RAIDs a few years back,
"hot spare" wasn't very clear to me, and I didn't understand why there
is a copy-back operation (given that "useless" +1x IO). But in the long
term it keeps drive arrangement at expectations - which is good.

RAID board manufacturers seem to differentiate between those two
replacement strategies - and "hot spare" always involved copy-back for
me: The spare drive automatically returns to its hot spare role. I
learned to like this strategy. It has some advantages.

You could instead assign a replacement drive - then drives will become
rearranged in the array. This is usually done by just onlining one
spare disk, start a replace action, then offline the old drive and pull
it from the array. It's not "hot" in that sense. It's been unconfigured
good. Not sure if this could be automated - I did it this way only when
the array hasn't been equipped with a spare inside the enclosure but
the drive being still in its original box. Other than that, I always
used the hot spare method.

That's why I stumbled across...

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Chris Murphy
On Mon, Apr 4, 2016 at 1:36 PM, Kai Krakow  wrote:

>
>> > I guess the
>> > seed source cannot be mounted or modified...
>>
>> ?
>
> In the following sense: I should disable the automounter and backup job
> for the seed device while I let my data migrate back to main storage in
> the background...

The sprout can be written to just fine by the backup, just understand
that the seed and sprout volume UUID are different. Your automounter
is probably looking for the seed's UUID, and that seed can only be
mounted ro. The sprout UUID however can be mounted rw.

I would probably skip the automounter. Do the seed setup, mount it,
add all devices you're planning to add, then -o remount,rw,compress...
, and then activate the backup. But maybe your backup also is looking
for UUID? If so, that needs to be updated first. Once the balance
-dconvert=raid1 and -mconvert=raid1 is finished, then you can remove
the seed device. And now might be a good time to give the raid1 a new
label, I think it inherits the label of the seed but I'm not certain
of this.


> My intention is to use fully my system while btrfs migrates the data
> from seed to main storage. Then, afterwards I'd like to continue using
> the seed device for backups.
>
> I'd probably do the following:
>
> 1. create btrfs pool, attach seed

I don't understand that step in terms of commands. Sprouts are made
with btrfs dev add, not with mkfs. There is no pool creation. You make
a seed. You mount it. Add devices to it. Then remount it.


> 2. recreate my original subvolume structure by snapshotting the backup
>scratch area multiple times into each subvolume
> 3. rearrange the files in each subvolume to match their intended use by
>using rm and mv
> 4. reboot into full system
> 4. remove all left-over snapshots from the seed
> 5. remove (detach) the seed device

You have two 4's.

Anyway the 2nd 4 is not possible. The seed is ro by definition so you
can't remove snapshots from the seed. If you remove them from the
mounted rw sprout volume, they're removed from the sprout, not the
seed. If you want them on the sprout, but not on the seed, you need to
delete snapshots only after the seed is a.) removed from the sprout
and b.) made no longer a seed with btrfstune -S 0 and c.) mounted rw.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Kai Krakow
Am Sun, 3 Apr 2016 18:51:07 -0600
schrieb Chris Murphy :

> > BTW: Is it possible to use my backup drive (it's btrfs single-data
> > dup-metadata, single device) as a seed device for my newly created
> > btrfs pool (raid0-data, raid1-metadata, three devices)?  
> 
> Yes.
> 
> I just tried doing the conversion to raid1 before and after seed
> removal, but with the small amount of data (4GiB) I can't tell a
> difference. It seems like -dconvert=raid with seed still connected
> makes two rw copies (i.e. there's a ro copy which is the original, and
> then two rw copies on 2 of the 3 devices I added all at the same time
> to the seed), and the 'btrfs dev remove' command to remove the seed
> happened immediately, suggested the prior balances had already
> migrated copies off the seed. This may or may not be optimal for your
> case.
> 
> Two gotchas.
> 
> I ran into this bug:
> btrfs fi usage crash when volume contains seed device
> https://bugzilla.kernel.org/show_bug.cgi?id=115851
> 
> And there is a phantom single chunk on one of the new rw devices that
> was added. Data,single: Size:1.00GiB, Used:0.00B
>/dev/dm-8   1.00GiB
> 
> It's still there after the -dconvert=raid1 and separate -mconvert=raid
> and after seed device removal. A balance start without filters removes
> it, chances are had I used -dconvert=raid1,soft it would have vanished
> also but I didn't retest for that.

Good to know, thanks.

> > I guess the
> > seed source cannot be mounted or modified...  
> 
> ?

In the following sense: I should disable the automounter and backup job
for the seed device while I let my data migrate back to main storage in
the background...

My intention is to use fully my system while btrfs migrates the data
from seed to main storage. Then, afterwards I'd like to continue using
the seed device for backups.

I'd probably do the following:

1. create btrfs pool, attach seed
2. recreate my original subvolume structure by snapshotting the backup
   scratch area multiple times into each subvolume
3. rearrange the files in each subvolume to match their intended use by
   using rm and mv
4. reboot into full system
4. remove all left-over snapshots from the seed
5. remove (detach) the seed device
6. rebalance
7. switch bcache to write-back mode (or attach bcache only now)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Global hotspare functionality

2016-04-04 Thread Yauhen Kharuzhy
2016-04-01 18:15 GMT-07:00 Anand Jain :
 Issue 2.
 At start of autoreplacig drive by hotspare, kernel craches in
 transaction
 handling code (inside of btrfs_commit_transaction() called by
 autoreplace initiating
 routines). I 'fixed' this by removing of closing of bdev in
 btrfs_close_one_device_dont_free(), see

 https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
 (oops text is attached also). Bdev is closed after replacing by
 btrfs_dev_replace_finishing(), so this is safe but doesn't seem
 to be right way.
>>>
>>>
>>>   I have sent out V2. I don't see that issue with this,
>>>   could you pls try ?
>>
>>
>> Yes, it reproduced on v4.4.5 kernel. I will try with current
>> 'for-linus-4.6' Chris' tree soon.
>>
>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
>> can be freed by kernel after releasing of all references to it.
>
>
>   So far the raid group profile would adapt to lower suitable
>   group profile when device is missing/failed. This appears to
>   be not happening with RAID56 OR there are stale IO which wasn't
>   flushed out. Anyway to have this fixed I am moving the patch
>btrfs: introduce device dynamic state transition to offline or failed
>   to the top in v3 for any potential changes.
>   But firstly we need a reliable test case, or a very carefully
>   crafted test case which can create this situation
>
>   Here below is the dm-error that I am using for testing, which
>   apparently doesn't report this issue. Could you please try on V3. ?
>   (pls note the device names are hard coded in the test script
>   sorry about that) This would eventually be fstests script.

Hi,

I have reproduced this oops with attached script. I don't use any dm
layer, but just detach drive at scsi layer as xfstests do (device
management functions were copy-pasted from it).


test-autoreplace2-mainline.sh
Description: Bourne shell script


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-04 Thread Kai Krakow
Am Mon, 4 Apr 2016 04:34:54 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Meanwhile, putting bcache into write-around mode, so it makes no
> further changes to the ssd and only uses it for reads, is probably
> wise, and should help limit further damage.  Tho if in that mode
> bcache still does writeback of existing dirty and cached data to the
> backing store, some further damage could occur from that.  But I
> don't know enough about bcache to know what its behavior and level of
> available configuration in that regard actually are.  As long as it's
> not trying to write anything from the ssd to the backing store, I
> think further damage should be very limited.

bcache has 0 for dirty data most of the time for me - even in write
back mode. It does write back during idle time and at reduced rate,
usually that finishes within a few minutes.

Switching the cache to write-around initiates instant write-back of all
dirty data, so within seconds it goes down to zero and the cache
becomes detachable.

I'll go test the soon-to-die SSD as soon as it replaced. I think it's
still far from failing with bitrot. It was overprovisioned by 30% most
of the time, with the spare space trimmed. It certainly should have a
lot of sectors for wear levelling. In addition, smartctl shows no
sector errors at all - except for one: raw_read_error_rate. I'm not
sure what all those sensors tell me, but that one I'm also seeing on
hard disks which show absolutely no data damage.

In fact, I see those counters for my hard disks. But dd to /dev/null of
the complete raw hard disk shows no sector errors. It seems good. But
well, counting 1+1 together: I currently see data damage. But I guess
that's unrelated.

Is there some documentation somewhere what each of those sensors
technically mean and how to read the raw values and thresh values?

I'm also seeing multi_zone_error_rate on my spinning rust.

According to smartctl health check and smartctl extended selftest,
there's no problems at all - and the smart error log is empty. There
has never been an ATA error in dmesg... No relocated sectors... From my
naive view the drives still look good.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-04 Thread David Sterba
On Fri, Mar 25, 2016 at 09:38:50AM +0800, Qu Wenruo wrote:
> > Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
> > key type. As this is the second user of that item, there's no precendent
> > how to select the subtype. Right now 0 is for the dev stats item, but
> > I'd like to leave some space between them, so it should be 256 at best.
> > The space is 64bit so there's enough room but this also means defining
> > the on-disk format.
> 
> After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is 
> larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid 
> of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.
> 
> Although that's not a big problem, but for user using debug-tree, it 
> would be quite annoying to find it located among tons of other hashes.

You can alternatively store it in the tree_root, but I don't know how
frquently it's supposed to be changed.

> So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep 
> objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value, 
> to ensure dedupe status to be the first item of dedupe tree.

0 is unfortnuatelly taken by BTRFS_DEV_STATS_OBJECTID, but I don't see
problem with the ordering. DEDUPE_BYTENR/HASH_ITEM_KEY store a large
number in the objectid, either part of a hash, that's unlikely to be
almost-all zeros and bytenr which will be larger than 1MB.

>  4) Ioctl interface with persist dedup status
> >>>
> >>> I'd like to see the ioctl specified in more detail. So far there's
> >>> enable, disable and status. I'd expect some way to control the in-memory
> >>> limits, let it "forget" current hash cache, specify the dedupe chunk
> >>> size, maybe sync of the in-memory hash cache to disk.
> >>
> >> So current and planned ioctl should be the following, with some details
> >> related to your in-memory limit control concerns.
> >>
> >> 1) Enable
> >>  Enable dedupe if it's not enabled already. (disabled -> enabled)
> >
> > Ok, so it should also take a parameter which bckend is about to be
> > enabled.
> 
> It already has.
> It also has limit_nr and limit_mem parameter for in-memory backend.
> 
> >
> >>  Or change current dedupe setting to another. (re-configure)
> >
> > Doing that in 'enable' sounds confusing, any changes belong to a
> > separate command.
> 
> This depends the aspect of view.
> 
> For "Enable/config/disable" case, it will introduce a state machine for 
> end-user.

Yes, that's exacly my point.

> Personally, I doesn't state machine for end user. Yes, I also hate 
> merging play and pause button together on music player.

I don't see this reference relevant, we're not designing a music player.

> If using state machine, user must ensure the dedupe is enabled before 
> doing any configuration.

For user convenience we can copy the configuration options to the dedup
enable subcommand, but it will still do separate enable and configure
ioctl calls.

> For me, user only need to care the result of the operation. User can now 
> configure dedupe to their need without need to know previous setting.
>  From this aspect of view, "Enable/Disable" is much easier than 
> "Enable/Config/Disable".

Getting the usability is hard and that's why we're having this
dicussion. What suites you does not suite others, we have different
habits, expectations and there are existing usage patterns. We better
stick to something that's not too surprising yet still flexible enough
to cover broad needs. I'm leaving this open, but I strongly disagree
with the current interface proposal.

> >>  For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
> >>  will disable dedupe(dropping all hash) and then enable with new
> >>  setting.
> >>
> >>  For in-memory backend, if only limit is different from previous
> >>  setting, limit can be changed on the fly without dropping any hash.
> >
> > This is obviously misplaced in 'enable'.
> 
> Then, changing the 'enable' to 'configure' or other proper naming would 
> be better.
> 
> The point is, user only need to care what they want to do, not previous 
> setup.
> 
> >
> >> 2) Disable
> >>  Disable will drop all hash and delete the dedupe tree if it exists.
> >>  Imply a full sync_fs().
> >
> > That is again combining too many things into one. Say I want to disable
> > deduplication and want to enable it later. And not lose the whole state
> > between that. Not to say deleting the dedup tree.
> >
> > IOW, deleting the tree belongs to a separate command, though in the
> > userspace tools it could be done in one command, but we're talking about
> > the kernel ioctls now.
> >
> > I'm not sure if the sync is required, but it's acceptable for first
> > implementation.
> 
> The design is just to to reduce complexity.
> If want to keep hash but disable dedupe, it will make dedupe only handle 
> extent remove, but ignore any new coming write.
> 
> It will introduce a new state for dedupe, other than current s

Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread Holger Hoffstätte
On 04/04/16 18:02, Filipe Manana wrote:
> I use this function frequently during development, and there's a good
> reason to use it instead of the user space tool btrfs-debug-tree.

Good to know, that's why I asked. Printing unwritten extents makes sense.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread Filipe Manana
On Mon, Apr 4, 2016 at 4:54 PM, Holger Hoffstätte
 wrote:
> On 04/04/16 15:56, David Sterba wrote:
>> On Fri, Mar 25, 2016 at 03:53:17PM +0100, Holger Hoffstätte wrote:
>>> Dan Carpenter's static checker recently found missing IS_ERR handling
>>> in print-tree.c:btrfs_print_tree(). While looking into this I found that
>>> this function is no longer called anywhere and was moved to btrfs-progs
>>> long ago. It can simply be removed.
>>
>> I'm not sure, the function could be used for debugging, and it's hard to
>
> ..but is it? So far nobody has complained.

I will complain.
I use this function frequently during development, and there's a good
reason to use it
instead of the user space tool btrfs-debug-tree.

>
>> say if we'll ever need it.  Printing the whole tree to the system log
>> would produce a lot of text so some manual filtering would be required,
>> the function could serve as a template.
>
> The original problem of missing error handling from btrfs_read_tree_block()
> remains as well. I don't remember if that also was true for the btrfs-progs
> counterpart, but in in any case I didn't really know what to do there.
> Print an error? silently ignore the stripe? abort? When I realized that the
> function was not called anywhere, deleting it seemed more effective.
>
> Under what circumstances would the in-kernel function be more
> practical or useful than the userland tool?

The user land tool requires the btree nodes to be on disk. With the in
kernel function we can print nodes
that are not yet on disk, very useful during development.

So no, we should not delete it in my opinion. It's not serious if it
doesn't have all the proper error handling
and etc, it's just something for debugging purposes.

 It does the same, won't disturb
> or wedge the kernel further, is up-to-date and can be scripted.
> I agree that in-place filtering (while iterating) would be nice to have,
> but that's also a whole different problem and would IMHO also be better
> suited for userland.
>
> When in doubt cut it out.

When in doubt leave it alone.

>
> Holger
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread Holger Hoffstätte
On 04/04/16 15:56, David Sterba wrote:
> On Fri, Mar 25, 2016 at 03:53:17PM +0100, Holger Hoffstätte wrote:
>> Dan Carpenter's static checker recently found missing IS_ERR handling
>> in print-tree.c:btrfs_print_tree(). While looking into this I found that
>> this function is no longer called anywhere and was moved to btrfs-progs
>> long ago. It can simply be removed.
> 
> I'm not sure, the function could be used for debugging, and it's hard to

..but is it? So far nobody has complained.

> say if we'll ever need it.  Printing the whole tree to the system log
> would produce a lot of text so some manual filtering would be required,
> the function could serve as a template.

The original problem of missing error handling from btrfs_read_tree_block()
remains as well. I don't remember if that also was true for the btrfs-progs
counterpart, but in in any case I didn't really know what to do there.
Print an error? silently ignore the stripe? abort? When I realized that the
function was not called anywhere, deleting it seemed more effective.

Under what circumstances would the in-kernel function be more
practical or useful than the userland tool? It does the same, won't disturb
or wedge the kernel further, is up-to-date and can be scripted.
I agree that in-place filtering (while iterating) would be nice to have,
but that's also a whole different problem and would IMHO also be better
suited for userland.

When in doubt cut it out.

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Misc fixes for 4.6, part 2

2016-04-04 Thread David Sterba
Hi,

please pull the following patches to 4.6. They fix some user visible problems,
improve error handling and there are two debugging enhancements. Thanks.


The following changes since commit 232cad8413a0bfbd25f11cc19fd13dfd85e1d8ad:

  Merge branch 'misc-4.6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 
(2016-03-24 17:36:13 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git misc-4.6

for you to fetch changes up to 7ccefb98ce3e5c4493cd213cd03714b7149cf0cb:

  btrfs: Reset IO error counters before start of device replacing (2016-04-04 
16:29:22 +0200)


David Sterba (1):
  btrfs: fallback to vmalloc in btrfs_compare_tree

Davide Italiano (1):
  Btrfs: Improve FL_KEEP_SIZE handling in fallocate

Josef Bacik (1):
  Btrfs: don't use src fd for printk

Liu Bo (1):
  Btrfs: fix invalid reference in replace_path

Mark Fasheh (2):
  btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
  btrfs: Add qgroup tracing

Qu Wenruo (1):
  btrfs: Output more info for enospc_debug mount option

Yauhen Kharuzhy (1):
  btrfs: Reset IO error counters before start of device replacing

 fs/btrfs/ctree.c | 12 --
 fs/btrfs/dev-replace.c   |  2 +
 fs/btrfs/extent-tree.c   | 21 ++-
 fs/btrfs/file.c  |  9 +++--
 fs/btrfs/ioctl.c |  2 +-
 fs/btrfs/qgroup.c| 63 ---
 fs/btrfs/relocation.c|  1 +
 include/trace/events/btrfs.h | 89 +++-
 8 files changed, 166 insertions(+), 33 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread Luis de Bethencourt
On 04/04/16 15:45, David Sterba wrote:
> On Mon, Apr 04, 2016 at 03:31:22PM +0100, Luis de Bethencourt wrote:
>> Correct a typo in the chunk_mutex name to make it grepable.
>>
>> Since it is better to fix several typos at once, fixing the 2 more in the
>> same file.
>>
>> Signed-off-by: Luis de Bethencourt 
> 
> Now the subject does not match the patch contents, but I can fix that so
> you don't have to resend it again.
> 

Sorry David. That was a poor decision by me of keeping the subject by
considering the typo in btrfs_statfs() as the core and the other two as
appended corrections.

Thank you for fixing it. I understand what you mean.

In an unrelated note, do you think this bug would be a good one for me
to tackle?
https://bugzilla.kernel.org/show_bug.cgi?id=115851

Thanks,
Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread David Sterba
On Mon, Apr 04, 2016 at 03:31:22PM +0100, Luis de Bethencourt wrote:
> Correct a typo in the chunk_mutex name to make it grepable.
> 
> Since it is better to fix several typos at once, fixing the 2 more in the
> same file.
> 
> Signed-off-by: Luis de Bethencourt 

Now the subject does not match the patch contents, but I can fix that so
you don't have to resend it again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread Luis de Bethencourt
Correct a typo in the chunk_mutex name to make it grepable.

Since it is better to fix several typos at once, fixing the 2 more in the
same file.

Signed-off-by: Luis de Bethencourt 
---

Hi,

Sorry for sending again. Previous version had a line over 80 characters.

Explanation from previous patch:
David recommended I look around the rest of the file for other typos to fix.

These two more are all I see in the rest of the file without nitpicking.

Thanks and apologies for sending v3 without thoroughly checking,
Luis


 fs/btrfs/super.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7e766ffc..bc060cf 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1484,10 +1484,10 @@ static int setup_security_options(struct btrfs_fs_info 
*fs_info,
memcpy(&fs_info->security_opts, sec_opts, sizeof(*sec_opts));
} else {
/*
-* Since SELinux(the only one supports security_mnt_opts) does
-* NOT support changing context during remount/mount same sb,
-* This must be the same or part of the same security options,
-* just free it.
+* Since SELinux (the only one supporting security_mnt_opts)
+* does NOT support changing context during remount/mount of
+* the same sb, this must be the same or part of the same
+* security options, just free it.
 */
security_free_mnt_opts(sec_opts);
}
@@ -1665,8 +1665,8 @@ static inline void btrfs_remount_cleanup(struct 
btrfs_fs_info *fs_info,
 unsigned long old_opts)
 {
/*
-* We need cleanup all defragable inodes if the autodefragment is
-* close or the fs is R/O.
+* We need to cleanup all defragable inodes if the autodefragment is
+* close or the filesystem is read only.
 */
if (btrfs_raw_test_opt(old_opts, AUTO_DEFRAG) &&
(!btrfs_raw_test_opt(fs_info->mount_opt, AUTO_DEFRAG) ||
@@ -2050,7 +2050,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
int mixed = 0;
 
/*
-* holding chunk_muext to avoid allocating new chunks, holding
+* holding chunk_mutex to avoid allocating new chunks, holding
 * device_list_mutex to avoid the device being removed
 */
rcu_read_lock();
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread Luis de Bethencourt
Correct a typo in the chunk_mutex name to make it grepable.

Since it is better to fix several typos at once, fixing the 2 more in the
same file.

Signed-off-by: Luis de Bethencourt 
---

Hi,

David recommended I look around the rest of the file for other typos to fix.

These two more are all I see in the rest of the file without nitpicking.

Thanks,
Luis

 fs/btrfs/super.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7e766ffc..73bdfd4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1484,9 +1484,9 @@ static int setup_security_options(struct btrfs_fs_info 
*fs_info,
memcpy(&fs_info->security_opts, sec_opts, sizeof(*sec_opts));
} else {
/*
-* Since SELinux(the only one supports security_mnt_opts) does
-* NOT support changing context during remount/mount same sb,
-* This must be the same or part of the same security options,
+* Since SELinux (the only one supporting security_mnt_opts) 
does
+* NOT support changing context during remount/mount of the 
same sb,
+* this must be the same or part of the same security options,
 * just free it.
 */
security_free_mnt_opts(sec_opts);
@@ -1665,8 +1665,8 @@ static inline void btrfs_remount_cleanup(struct 
btrfs_fs_info *fs_info,
 unsigned long old_opts)
 {
/*
-* We need cleanup all defragable inodes if the autodefragment is
-* close or the fs is R/O.
+* We need to cleanup all defragable inodes if the autodefragment is
+* close or the filesystem is read only.
 */
if (btrfs_raw_test_opt(old_opts, AUTO_DEFRAG) &&
(!btrfs_raw_test_opt(fs_info->mount_opt, AUTO_DEFRAG) ||
@@ -2050,7 +2050,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
int mixed = 0;
 
/*
-* holding chunk_muext to avoid allocating new chunks, holding
+* holding chunk_mutex to avoid allocating new chunks, holding
 * device_list_mutex to avoid the device being removed
 */
rcu_read_lock();
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] delete obsolete function btrfs_print_tree()

2016-04-04 Thread David Sterba
On Fri, Mar 25, 2016 at 03:53:17PM +0100, Holger Hoffstätte wrote:
> Dan Carpenter's static checker recently found missing IS_ERR handling
> in print-tree.c:btrfs_print_tree(). While looking into this I found that
> this function is no longer called anywhere and was moved to btrfs-progs
> long ago. It can simply be removed.

I'm not sure, the function could be used for debugging, and it's hard to
say if we'll ever need it.  Printing the whole tree to the system log
would produce a lot of text so some manual filtering would be required,
the function could serve as a template.

The function is not that big that it would save bytes, but putting it
under the debug config would help a bit.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/22] btrfs-progs: convert: Introduce functions to read used space

2016-04-04 Thread David Sterba
On Fri, Jan 29, 2016 at 01:03:11PM +0800, Qu Wenruo wrote:
> Before we do real convert, we need to read and build up used space cache
> tree for later data/meta separate chunk layout.
> 
> This patch will iterate all used blocks in ext2 filesystem and record it
> into cctx->used cache tree, for later use.
> 
> This provides the very basic of later btrfs-convert rework.
> 
> Signed-off-by: Qu Wenruo 
> Signed-off-by: David Sterba 
> ---
>  btrfs-convert.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/btrfs-convert.c b/btrfs-convert.c
> index 4baa68e..65841bd 100644
> --- a/btrfs-convert.c
> +++ b/btrfs-convert.c
> @@ -81,6 +81,7 @@ struct btrfs_convert_context;
>  struct btrfs_convert_operations {
>   const char *name;
>   int (*open_fs)(struct btrfs_convert_context *cctx, const char *devname);
> + int (*read_used_space)(struct btrfs_convert_context *cctx);
>   int (*alloc_block)(struct btrfs_convert_context *cctx, u64 goal,
>  u64 *block_ret);
>   int (*alloc_block_range)(struct btrfs_convert_context *cctx, u64 goal,
> @@ -230,6 +231,73 @@ fail:
>   return -1;
>  }
>  
> +static int __ext2_add_one_block(ext2_filsys fs, char *bitmap,
> + unsigned long group_nr, struct cache_tree *used)
> +{
> + unsigned long offset;
> + unsigned i;
> + int ret = 0;
> +
> + offset = fs->super->s_first_data_block;
> + offset /= EXT2FS_CLUSTER_RATIO(fs);

This macro does not exist on my reference host for old distros. The
e2fsprogs version is 1.41.14 and I'd like to keep the compatibility at
least at that level.

The clustering has been added in 1.42 but can we add some compatibility
layer that will work on both version?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread David Sterba
On Mon, Apr 04, 2016 at 11:13:57AM +0100, Luis de Bethencourt wrote:
> Correct a typo in the chunk_mutex name.
> 
> Signed-off-by: Luis de Bethencourt 
> ---
> 
> Hi,
> 
> I noticed this typo while fixing bug 114281 [0]. If this type of fixes are
> not welcomed I could squash it into the patch for that bug.

> -  * holding chunk_muext to avoid allocating new chunks, holding
> +  * holding chunk_mutex to avoid allocating new chunks, holding

In this case it's a name of a mutex, so this makes sense eg. when one is
grepping for it. I'm not against fixing typos in comments in general,
it's usually better to fix several at once, eg. per file, see
bb7ab3b92e46da06b580c6f83abe7894dc449cca . If you find more, then please
send an updated patch, I'll queue this one into cleanups and can replace
it later.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-04 Thread David Sterba
On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:
> > After another look, why don't we use nodesize directly? Or stripesize
> > where applies. With max_size == 0 the test does not make sense, we ought
> > to know the alignment.
> >
> >
> Yes, my first though is also to use nodesize directly, which should be 
> always correct.
> 
> But the problem is, the related function call stack doesn't have any 
> member to reach btrfs_root or btrfs_fs_info.

JFYI, there's global_info avalaible, so it's not necessary to pass
fs_info down the callstacks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread Luis de Bethencourt
Correct a typo in the chunk_mutex name.

Signed-off-by: Luis de Bethencourt 
---

Hi,

I noticed this typo while fixing bug 114281 [0]. Sending a second version
because the first one didn't ammend cleanly after the latest changes in the
'for-next' branch.

Thanks,
Luis

[0] https://bugzilla.kernel.org/show_bug.cgi?id=114281

 fs/btrfs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7e766ffc..9c79337 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2050,7 +2050,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
int mixed = 0;
 
/*
-* holding chunk_muext to avoid allocating new chunks, holding
+* holding chunk_mutex to avoid allocating new chunks, holding
 * device_list_mutex to avoid the device being removed
 */
rcu_read_lock();
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix typo in btrfs_statfs()

2016-04-04 Thread Luis de Bethencourt
Correct a typo in the chunk_mutex name.

Signed-off-by: Luis de Bethencourt 
---

Hi,

I noticed this typo while fixing bug 114281 [0]. If this type of fixes are
not welcomed I could squash it into the patch for that bug.

Thanks,
Luis

[0] https://bugzilla.kernel.org/show_bug.cgi?id=114281

 fs/btrfs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a8e049a..86364b7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2028,7 +2028,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
u64 thresh = 0;
 
/*
-* holding chunk_muext to avoid allocating new chunks, holding
+* holding chunk_mutex to avoid allocating new chunks, holding
 * device_list_mutex to avoid the device being removed
 */
rcu_read_lock();
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed on innexistent inode

2016-04-04 Thread Henk Slager
On Mon, Apr 4, 2016 at 9:50 AM, Jérôme Poulin  wrote:
> Hi all,
>
> I have a BTRFS on disks running in RAID10 meta+data, one of the disk
> has been going bad and scrub was showing 18 uncorrectable errors
> (which is weird in RAID10). I tried using --repair-sector with hdparm
> even if it shouldn't be necessary since BTRFS would overwrite the
> sector. Repair sector fixed the sector in SMART but BTRFS was still
> showing 18 uncorr. errors.
>
> I finally decided to give up this opportunity to test the error
> correction property of BTRFS (this is a home system, backed up) and
> installed a brand new disk in the machine. After running btrfs
> replace, everything was fine, I decided to run btrfs scrub again and I
> still have the same 18 uncorrectable errors.

You might want this patch:
http://www.spinics.net/lists/linux-btrfs/msg53552.html

As workaround, you can reset the counters on new/healty device with:

btrfs device stats [-z] |

> Later on, since I had a new disk with more space, I decided to run a
> balance to free up the new space but the balance has stopped with csum
> errors too. Here are the output of multiple programs.
>
> How is it possible to get rid of the referenced csum errors if they do
> not exist? Also, the expected checksum looks suspiciously the same for
> multiple errors. Could it be bad RAM in that case? Can I convince
> BTRFS to update the csum?
>
> # btrfs inspect-internal logical-resolve -v 1809149952 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory
> # btrfs inspect-internal inode-resolve -v 296 /mnt/btrfs/
> ioctl ret=-1, error: No such file or directory
>
>
> dmesg after first bad sector:
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368716288 (dev /dev/dm-42 sector
> 2939136)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368720384 (dev /dev/dm-42 sector
> 2939144)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368724480 (dev /dev/dm-42 sector
> 2939152)
> avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
> error corrected: ino 1 off 655368728576 (dev /dev/dm-42 sector
> 2939160)
>
> dmesg after balance:
> [1738474.444648] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809195008 csum 1515428513 expected csum 2566472073
> [1738474.444649] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809084416 csum 4147641019 expected csum 1755301217
> [1738474.444702] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809199104 csum 1927504681 expected csum 2566472073
> [1738474.444717] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809211392 csum 3086571080 expected csum 2566472073
> [1738474.444917] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809084416 csum 4147641019 expected csum 1755301217
> [1738474.444962] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809195008 csum 1515428513 expected csum 2566472073
> [1738474.444998] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809199104 csum 1927504681 expected csum 2566472073
> [1738474.445034] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809211392 csum 3086571080 expected csum 2566472073
> [1738474.473286] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809149952 csum 3254083717 expected csum 2566472073
> [1738474.473357] BTRFS warning (device dm-40): csum failed ino 296 off
> 1809162240 csum 3157020538 expected csum 2566472073
>
> btrfs check:
> ./btrfs check /dev/mapper/luksbtrfsdata2
> Checking filesystem on /dev/mapper/luksbtrfsdata2
> UUID: 805f6ad7-1188-448d-aee4-8ddeeb70c8a7
> checking extents
> bad metadata [1453741768704, 1453741785088) crossing stripe boundary
> bad metadata [1454487764992, 1454487781376) crossing stripe boundary
> bad metadata [1454828552192, 1454828568576) crossing stripe boundary
> bad metadata [1454879735808, 1454879752192) crossing stripe boundary
> bad metadata [1455087222784, 1455087239168) crossing stripe boundary
> bad metadata [1456269426688, 1456269443072) crossing stripe boundary
> bad metadata [1456273227776, 1456273244160) crossing stripe boundary
> bad metadata [1456404234240, 1456404250624) crossing stripe boundary
> bad metadata [1456418914304, 1456418930688) crossing stripe boundary

Those are false alerts; This patch handles that:
https://patchwork.kernel.org/patch/8706891/

> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 689292505473 bytes used err is 0
> total csum bytes: 660112536
> total tree bytes: 1764098048
> total fs tree bytes: 961921024
> total extent tree bytes: 79331328
> btree space waste bytes: 232774315
> file data blocks allocated: 4148513517568
>  referenced 972284129280
>
> btrfs scrub:
> I don't have the output handy but the dmesg output were pairs of
> logical blocks like balance and no errors were corrected.
> --
> To unsubscribe from this l

csum failed on innexistent inode

2016-04-04 Thread Jérôme Poulin
Hi all,

I have a BTRFS on disks running in RAID10 meta+data, one of the disk
has been going bad and scrub was showing 18 uncorrectable errors
(which is weird in RAID10). I tried using --repair-sector with hdparm
even if it shouldn't be necessary since BTRFS would overwrite the
sector. Repair sector fixed the sector in SMART but BTRFS was still
showing 18 uncorr. errors.

I finally decided to give up this opportunity to test the error
correction property of BTRFS (this is a home system, backed up) and
installed a brand new disk in the machine. After running btrfs
replace, everything was fine, I decided to run btrfs scrub again and I
still have the same 18 uncorrectable errors.

Later on, since I had a new disk with more space, I decided to run a
balance to free up the new space but the balance has stopped with csum
errors too. Here are the output of multiple programs.

How is it possible to get rid of the referenced csum errors if they do
not exist? Also, the expected checksum looks suspiciously the same for
multiple errors. Could it be bad RAM in that case? Can I convince
BTRFS to update the csum?

# btrfs inspect-internal logical-resolve -v 1809149952 /mnt/btrfs/
ioctl ret=-1, error: No such file or directory
# btrfs inspect-internal inode-resolve -v 296 /mnt/btrfs/
ioctl ret=-1, error: No such file or directory


dmesg after first bad sector:
avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
error corrected: ino 1 off 655368716288 (dev /dev/dm-42 sector
2939136)
avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
error corrected: ino 1 off 655368720384 (dev /dev/dm-42 sector
2939144)
avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
error corrected: ino 1 off 655368724480 (dev /dev/dm-42 sector
2939152)
avr 01 18:29:52 p4.i.ticpu.net kernel: BTRFS info (device dm-43): read
error corrected: ino 1 off 655368728576 (dev /dev/dm-42 sector
2939160)

dmesg after balance:
[1738474.444648] BTRFS warning (device dm-40): csum failed ino 296 off
1809195008 csum 1515428513 expected csum 2566472073
[1738474.444649] BTRFS warning (device dm-40): csum failed ino 296 off
1809084416 csum 4147641019 expected csum 1755301217
[1738474.444702] BTRFS warning (device dm-40): csum failed ino 296 off
1809199104 csum 1927504681 expected csum 2566472073
[1738474.444717] BTRFS warning (device dm-40): csum failed ino 296 off
1809211392 csum 3086571080 expected csum 2566472073
[1738474.444917] BTRFS warning (device dm-40): csum failed ino 296 off
1809084416 csum 4147641019 expected csum 1755301217
[1738474.444962] BTRFS warning (device dm-40): csum failed ino 296 off
1809195008 csum 1515428513 expected csum 2566472073
[1738474.444998] BTRFS warning (device dm-40): csum failed ino 296 off
1809199104 csum 1927504681 expected csum 2566472073
[1738474.445034] BTRFS warning (device dm-40): csum failed ino 296 off
1809211392 csum 3086571080 expected csum 2566472073
[1738474.473286] BTRFS warning (device dm-40): csum failed ino 296 off
1809149952 csum 3254083717 expected csum 2566472073
[1738474.473357] BTRFS warning (device dm-40): csum failed ino 296 off
1809162240 csum 3157020538 expected csum 2566472073

btrfs check:
./btrfs check /dev/mapper/luksbtrfsdata2
Checking filesystem on /dev/mapper/luksbtrfsdata2
UUID: 805f6ad7-1188-448d-aee4-8ddeeb70c8a7
checking extents
bad metadata [1453741768704, 1453741785088) crossing stripe boundary
bad metadata [1454487764992, 1454487781376) crossing stripe boundary
bad metadata [1454828552192, 1454828568576) crossing stripe boundary
bad metadata [1454879735808, 1454879752192) crossing stripe boundary
bad metadata [1455087222784, 1455087239168) crossing stripe boundary
bad metadata [1456269426688, 1456269443072) crossing stripe boundary
bad metadata [1456273227776, 1456273244160) crossing stripe boundary
bad metadata [1456404234240, 1456404250624) crossing stripe boundary
bad metadata [1456418914304, 1456418930688) crossing stripe boundary
checking free space cache
checking fs roots
checking csums
checking root refs
found 689292505473 bytes used err is 0
total csum bytes: 660112536
total tree bytes: 1764098048
total fs tree bytes: 961921024
total extent tree bytes: 79331328
btree space waste bytes: 232774315
file data blocks allocated: 4148513517568
 referenced 972284129280

btrfs scrub:
I don't have the output handy but the dmesg output were pairs of
logical blocks like balance and no errors were corrected.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html