BTRFS - Write Barriers

2015-12-31 Thread fugazzi®
Hi everyone :-)

Just one question for the gurus here. 

I was wondering: if I disable write barriers in btrfs with the mount option 
nobarrier I just disable the periodic flushes of the hardware disk cache or 
I'm disabling also the order of the writes directed to the hard disk?

What I mean is: is it safe to disable write barrier with a UPS with which I 
will likely have the hardware always powered even in the event of a kernel 
crash, freeze, etc?

I'm asking because if also the ordering of the write is no more guaranteed I 
guess it would not be safe to disable write barrier even if the possibility of 
an unexpected power down of the HD was remote because in the case of a crash 
the order of the write would be messed up anyway and we could boot up with a 
completely corrupted fs.

Thank you very much for your kinds answers.
Warm Regards,
Mario
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS - Write Barriers

2015-12-31 Thread Duncan
fugazzi® posted on Thu, 31 Dec 2015 09:01:51 + as excerpted:

> Just one question for the gurus here.
> 
> I was wondering: if I disable write barriers in btrfs with the mount
> option nobarrier I just disable the periodic flushes of the hardware
> disk cache or I'm disabling also the order of the writes directed to the
> hard disk?
> 
> What I mean is: is it safe to disable write barrier with a UPS with
> which I will likely have the hardware always powered even in the event
> of a kernel crash, freeze, etc?
> 
> I'm asking because if also the ordering of the write is no more
> guaranteed I guess it would not be safe to disable write barrier even if
> the possibility of an unexpected power down of the HD was remote because
> in the case of a crash the order of the write would be messed up anyway
> and we could boot up with a completely corrupted fs.

>From the wiki mount options page:

https://btrfs.wiki.kernel.org/index.php/Mount_options

>

nobarrier

Do not use device barriers. NOTE: Using this option greatly increases the 
chances of you experiencing data corruption during a system crash or a 
power failure. This means full file-system corruption, and not just 
losing or corrupting data that was being written during the event. 

<

IOW, use at your own risk.

In theory you should be fine if you *NEVER* have a system crash (it's not 
just loss of power!), but as btrfs itself is still "stabilizING, not yet 
fully stable and mature", it can itself crash the system occasionally, 
and while that's normally OK as the atomic transaction nature of btrfs 
and the fact that it crashes or at minimum forces itself to read-only if 
it thinks something's wrong will normally save the filesystem from too 
much damage, if it happens to get into that state with nobarrier set, 
then all bets are off, because the normally atomic transactions that 
would either all be there or not be there at all are no onger atomic, and 
who knows /what/ sort of state you'll be in when you reboot.

So while it's there for the foolish, and perhaps the slightly less 
foolish once the filesystem fully stabilizes and matures, right now, you 
really /are/ playing Russian roulette with your data if you turn it on.

I'd personally recommend staying about as far away from it as you do the 
uninsulated live service mains coming into your building... on the SUPPLY 
side of the voltage stepdown transformer!  And for those that don't do 
that, well, there's Darwin awards.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS File trace from superblock using memory dumps

2015-12-31 Thread Ahtisham wani
Respected sir,
I have been researching on how btrfs works and manages files for a long 
time. What I want to achieve here is to trace a path of a file starting 
from superblock, then to root and so on. The problem is I dont know how to 
do it. I have been using dd and hexdump but I dont know what to look for 
and where to look for. I have been able to see some fragments of 
superblock at 64K but I dont know how to use it to trace the tree of tree 
roots node or its object_id. Any help is appreciated. Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS - Write Barriers

2015-12-31 Thread fugazzi®
Thanks Duncan for your answer.

Yes, I read that part of the wiki and since I saw the "system crash" just 
wanted to be sure because always only the power loss part is mentioned 
regarding barriers by other fs such as for example XFS.

The fact is that since the barrier stuff was removed from the elevator and put 
into the file system layer in the form of flush, FUA, etc I doubted that now 
it is no more safe to disable write barrier even if you have a battery backed 
HD hardware cache.

It could be that since the fs is no more issuing flush command to the elevator 
(barrier disable case) and the elevator barrier code was removed now, it could 
reorder also that precious writes and in doing so destroying the atomic order 
of the fs and as a consequence destroying the integrity of it also in the case 
where no power loss is experienced by the HD hardware cache.

That said it was my guessing that now it could be no more safe to disable 
write barriers in any filesystem regardless of its age and supposed stability.

The interaction between the kernel disk scheduler and the filesystem without 
barriers enabled could be unpredictable because the disk scheduler would now 
be authorized to change the order of any write to the underling hardware and 
not only of those not connected with the write barriers. Do I understood it 
correctly?

Regards.

On Thursday, December 31, 2015 9:34:24 AM WET Duncan wrote:
> fugazzi® posted on Thu, 31 Dec 2015 09:01:51 + as excerpted:
> > Just one question for the gurus here.
> > 
> > I was wondering: if I disable write barriers in btrfs with the mount
> > option nobarrier I just disable the periodic flushes of the hardware
> > disk cache or I'm disabling also the order of the writes directed to the
> > hard disk?
> > 
> > What I mean is: is it safe to disable write barrier with a UPS with
> > which I will likely have the hardware always powered even in the event
> > of a kernel crash, freeze, etc?
> > 
> > I'm asking because if also the ordering of the write is no more
> > guaranteed I guess it would not be safe to disable write barrier even if
> > the possibility of an unexpected power down of the HD was remote because
> > in the case of a crash the order of the write would be messed up anyway
> > and we could boot up with a completely corrupted fs.
> 
> From the wiki mount options page:
> 
> https://btrfs.wiki.kernel.org/index.php/Mount_options
> 
> 
> 
> nobarrier
> 
> Do not use device barriers. NOTE: Using this option greatly increases the
> chances of you experiencing data corruption during a system crash or a
> power failure. This means full file-system corruption, and not just
> losing or corrupting data that was being written during the event.
> 
> <
> 
> IOW, use at your own risk.
> 
> In theory you should be fine if you *NEVER* have a system crash (it's not
> just loss of power!), but as btrfs itself is still "stabilizING, not yet
> fully stable and mature", it can itself crash the system occasionally,
> and while that's normally OK as the atomic transaction nature of btrfs
> and the fact that it crashes or at minimum forces itself to read-only if
> it thinks something's wrong will normally save the filesystem from too
> much damage, if it happens to get into that state with nobarrier set,
> then all bets are off, because the normally atomic transactions that
> would either all be there or not be there at all are no onger atomic, and
> who knows /what/ sort of state you'll be in when you reboot.
> 
> So while it's there for the foolish, and perhaps the slightly less
> foolish once the filesystem fully stabilizes and matures, right now, you
> really /are/ playing Russian roulette with your data if you turn it on.
> 
> I'd personally recommend staying about as far away from it as you do the
> uninsulated live service mains coming into your building... on the SUPPLY
> side of the voltage stepdown transformer!  And for those that don't do
> that, well, there's Darwin awards.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS File trace from superblock using memory dumps

2015-12-31 Thread Hugo Mills
On Thu, Dec 31, 2015 at 10:20:17AM +, Ahtisham wani wrote:
> Respected sir,
> I have been researching on how btrfs works and manages files for a long 
> time. What I want to achieve here is to trace a path of a file starting 
> from superblock, then to root and so on. The problem is I dont know how to 
> do it. I have been using dd and hexdump but I dont know what to look for 
> and where to look for. I have been able to see some fragments of 
> superblock at 64K but I dont know how to use it to trace the tree of tree 
> roots node or its object_id. Any help is appreciated. Thanks

   Start here:

   https://btrfs.wiki.kernel.org/index.php/Data_Structures

   This will give you the basic high-level data structures. You can
explore those data structures fairly easily using btrfs-debug-tree.

   After that, you'll mostly have to start reading the code a little.
fs/btrfs/ctree.h in the kernel sources is the place to get all of the
data structures you'll need. Those will tell you the layout of the
data items.

   To get from the superblock to the diagram on the data structures
page, the first thing you'll need to do is read the list of system
chunks at the end of the superblock. Those chunks contain the chunk
tree, which contains the mapping from physical device addresses to
internal (virtual) addresses. Everything else is done in terms of
those virtual addresses. Once you have the chunk tree, you can start
using the other addresses in the superblock to find the tree of tree
roots, and then follow that into the other trees (at which point, you
can start using the data structures page).

   Hugo.

-- 
Hugo Mills | Comic Sans goes into a bar, and the barman says, "We
hugo@... carfax.org.uk | don't serve your type here."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


subvolume capacity

2015-12-31 Thread Xavier Romero
Hello,

I create a subvolume, I assign a quota on it, I mount the subvolume, the 
capacity reported for that mount point is not the assigned quota but the total 
btrfs filesystem capacity.
I would rather expect it to report the assigned space (unless there is no quota 
assigned obviously), that would perfectly fit on a multi-tenant environment; 
also using btrfs for virtual machines datastores, etc. Otherwise it will be 
really confusing and sometimes problematic when exporting subvolumes through 
NFS/CIFS/etc.

Best regards,
Xavier Romero
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: quota rescan hangs

2015-12-31 Thread Duncan
Xavier Romero posted on Thu, 31 Dec 2015 10:09:22 + as excerpted:

> Using BTRFS on CentOS 7, I get the filesystem hung by running btrfs
> quota rescan  /mnt/btrfs/
> 
> After that I could not access to the filesystem anymore until system
> restart.

> Additional Info:
> 
> [root@nex-dstrg-ctrl-1 ~]# modinfo btrfs
> filename: 
> /lib/modules/3.10.0-327.3.1.el7.x86_64/kernel/fs/btrfs/btrfs.ko

> btrfs-progs v3.19.1

> I'm just starting with BTRFS so I could be doing something wrong! Any
> ideas?

[OK, the below lays it on kinda thick.  I'm not saying your choice of 
centos /or/ of btrfs is bad, only that if your interest is in something 
so old and stal^hble as centos with its 3.10 kernel, then you really 
should reconsider whether btrfs is an appropriate choice for you, because 
it seems to be a seriously bad mismatch.  The answer to your actual 
question is below that discussion.]

First thing wrong, here's a quote from the Stability Status section right 
on the front page of the btrfs wiki:

https://btrfs.wiki.kernel.org/index.php/Main_Page

>

The Btrfs code base is under heavy development. Every effort is being 
made to keep it stable and fast. Due to the fast development speed, the 
state of development of the filesystem improves noticeably with every new 
Linux version, so it's recommended to run the most modern kernel possible.

<

And here's what the getting started page says:

https://btrfs.wiki.kernel.org/index.php/Getting_started

>

btrfs is a fast-moving target. There are typically a great many bug fixes 
and enhancements between one kernel release and the next. Therefore:
If you have btrfs filesystems, run the latest kernel.

If you are running a kernel two or more versions behind the latest one 
available from kernel.org, the first thing you will be asked to when you 
report a problem is to upgrade to the latest kernel. Some distributions 
keep backports of recent kernels to earlier releases -- see the page 
below for details.

Having the latest user-space tools are also useful, as they contain 
additional features and tools which may be of use in debugging or 
recovering your filesystem if something goes wrong. 

<

Centos, running kernel 3.10, is anything *BUT* "the latest kernel".  With 
five release cycles a year, 4.0 being 10 release cycles beyond 3.10, and 
4.4 very near release, 3.10 is now nearing three years old!

Further, btrfs didn't even have the experimental sticker peeled off until 
IIRC 3.12 or so, so that btrfs 3.10 isn't just nearly three years 
outdated, it's also still experimental!

OK, so we know that the enterprise distros support btrfs and backport 
stuff, but only they know what they backported, while we're focused on 
the mainline kernel here on this list.  So while the upstream btrfs and 
list recommendation is keep current, you're running what for all we know 
is a three year old experimental btrfs, with who knows what backports?  
If you want support for that, you really should be asking the distro that 
says they support it, not the upstream that says it's now ancient history 
from when the filesystem was still experimental.

Meanwhile, from here, running the still under heavy development 
"stabilizing but not yet entirely stable or mature" btrfs, on an 
enterprise distro that runs years old versions... seems there's some sort 
of bad-match incompatibility there.  If your emphasis is that old and 
stable, you really should reconsider whether the still under heavy 
development btrfs is an appropriate choice for you, or if a filesystem 
more suitably stable is more in keeping with your stability needs.  One 
or the other would seem to be the wrong choice, as they're at rather 
opposite ends of the spectrum and don't work well together.



OK, on to the specific question.  Tho the devs have been and are working 
very hard on quotas, to date (4.3 release kernel) they've never worked 
entirely correctly or reliably in btrfs, and my recommendation has always 
been, if you're not working with the devs on the latest version to help 
test, find and fix problems, which if you are, thanks, then you either 
need quota functionality or you don't.  Since quotas have never worked 
reliably in btrfs, if you need that functionality, you really need to be 
on a filesystem where it's much more stable and reliable than that 
function has been on btrfs.  OTOH, if you don't need quota functionality, 
then I strongly recommend turning it off and leaving it off until at 
least two kernel cycles have gone by with it working with no stability-
level issues.

Tho I'm not a dev, only a btrfs user and list regular, and my own use-
case doesn't need quotas, so given their problems I've kept them off, and 
I'm not actually sure what the 4.4 status is.  However, even if there's 
no known problems with btrf quotas in 4.4, given the history, as I said 
above, I strongly recommend not enabling them until at least two complete 
kernel cycles have completed with no quota issues, 

Re: RAID10 question

2015-12-31 Thread Duncan
Hugo Mills posted on Thu, 31 Dec 2015 11:51:53 + as excerpted:

> On Thu, Dec 31, 2015 at 09:52:16AM +, Xavier Romero wrote:
>> Hello,
>> 
>> I have 2 completely independent set of 12 disks each, let's name them
>> A1, A2, A3... A12 for first set, and B1, B2, B3...B12 for second set.
>> For availability purposes I want disks to be paired that way:
>> A1 <--> B1: RAID1 A2 <--> B2: RAID1 ...
>> A12 <--> B12: RAID1
>> 
>> And then I want a RAID0 out of all these RAID1.
>> 
>> I know I can achieve that by doing all the RAID1 with MD and then build
>> the RAID0 with BTRFS. But my question is: can I achieve that directly
>> with BTRFS RAID10?
> 
>No, not at the moment.

Additionally, if you're going to put btrfs on mdraid, then you may wish 
to consider reversing the above, doing raid01, which while ordinarily 
discouraged in favor of raid10, has some things going for it when the top 
level is btrfs, that raid10 doesn't.

The btrfs feature in question here is data and metadata checksumming and 
file integrity.  Btrfs normally checksums all data and metadata and 
verifies checksums at read-time, but when there's only one copy, as is 
the case with btrfs single and raid0 modes, if there's a checksum verify 
failure, all it can do is report it and fail the read.  If however, 
there's a second copy, as there is with btrfs raid1, then a checksum 
failure on the first copy will automatically failover to trying the 
second.  Assuming the second copy is good, it will use that instead of 
failing the read, and btrfs scrub can be used to systematically scrub and 
detect (if single/raid0 mode) or repair (if raid1/10 mode and the other 
copy is good) the entire filesystem.

Mdraid doesn't have that sort of integrity verification.  All it does 
with raid1 scrub is check that the copies agree, and pick an arbitrary 
copy to replace the other one with if they don't.  But for all it or you 
know, it can be replacing the good copy with the bad one, since it has no 
checksum verification to tell which is actually the good copy.

If that sort of data integrity verification and repair is of interest to 
you, you obviously want btrfs raid1, not mdraid1.  But btrfs, as the 
filesystem, must be the top layer.  So while raid10 is normally preferred 
over raid01, in this case, you may want to do raid01, putting the btrfs 
raid1 on top of the mdraid0.

Unfortunately that won't let you do a1 <-> b1, a2 <-> b2, etc.  But it 
will let you do a[1-6] <-> b[1-6], if that's good enough for your use-
case.

IOW, you have to choose between btrfs raid1 with data integrity repair on 
top, with only two mdraid0's underneath, or btrfs raid0 with only data 
integrity detection, not repair, on top, and a bunch of mdraid1 that 
don't have data integrity at all, underneath.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10] btrfs: reada: Avoid many times of empty loop

2015-12-31 Thread Zhao Lei
This is some cleanup, bugfix and enhance for reada, tested by a
script running scrub and relatice trace log.

Zhao Lei (10):
  btrfs: reada: Avoid many times of empty loop
  btrfs: reada: Move is_need_to_readahead contition earlier
  btrfs: reada: add all reachable mirrors into reada device list
  btrfs: reada: bypass adding extent when all zone failed
  btrfs: reada: Remove level argument in severial functions
  btrfs: reada: move reada_extent_put() to place after
__readahead_hook()
  btrfs: reada: Pass reada_extent into __readahead_hook() directly
  btrfs: reada: Use fs_info instead of root in __readahead_hook's
argument
  btrfs: reada: Jump into cleanup in direct way for __readahead_hook()
  btrfs: reada: Fix a debug code typo

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/disk-io.c |  22 +++
 fs/btrfs/reada.c   | 173 ++---
 3 files changed, 98 insertions(+), 101 deletions(-)

-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] btrfs: reada: add all reachable mirrors into reada device list

2015-12-31 Thread Zhao Lei
If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index dcc5b69..7733a09 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -328,7 +328,6 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
u64 length;
int real_stripes;
int nzones = 0;
-   int i;
unsigned long index = logical >> PAGE_CACHE_SHIFT;
int dev_replace_is_ongoing;
 
@@ -380,9 +379,9 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
dev = bbio->stripes[nzones].dev;
zone = reada_find_zone(fs_info, dev, logical, bbio);
if (!zone)
-   break;
+   continue;
 
-   re->zones[nzones] = zone;
+   re->zones[re->nzones++] = zone;
spin_lock(>lock);
if (!zone->elems)
kref_get(>refcnt);
@@ -392,8 +391,7 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
kref_put(>refcnt, reada_zone_release);
spin_unlock(_info->reada_lock);
}
-   re->nzones = nzones;
-   if (nzones == 0) {
+   if (re->nzones == 0) {
/* not a single zone found, error and out */
goto error;
}
@@ -418,8 +416,9 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
prev_dev = NULL;
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(
_info->dev_replace);
-   for (i = 0; i < nzones; ++i) {
-   dev = bbio->stripes[i].dev;
+   for (nzones = 0; nzones < re->nzones; ++nzones) {
+   dev = re->zones[nzones]->device;
+
if (dev == prev_dev) {
/*
 * in case of DUP, just add the first zone. As both
@@ -450,8 +449,8 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
prev_dev = dev;
ret = radix_tree_insert(>reada_extents, index, re);
if (ret) {
-   while (--i >= 0) {
-   dev = bbio->stripes[i].dev;
+   while (--nzones >= 0) {
+   dev = re->zones[nzones]->device;
BUG_ON(dev == NULL);
/* ignore whether the entry was inserted */
radix_tree_delete(>reada_extents, index);
@@ -470,10 +469,9 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
return re;
 
 error:
-   while (nzones) {
+   for (nzones = 0; nzones < re->nzones; ++nzones) {
struct reada_zone *zone;
 
-   --nzones;
zone = re->zones[nzones];
kref_get(>refcnt);
spin_lock(>lock);
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] btrfs: reada: Jump into cleanup in direct way for __readahead_hook()

2015-12-31 Thread Zhao Lei
Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 40 +---
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 869bb1c..902f899 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -130,26 +130,26 @@ static void __readahead_hook(struct btrfs_fs_info 
*fs_info,
re->scheduled_for = NULL;
spin_unlock(>lock);
 
-   if (err == 0) {
-   nritems = level ? btrfs_header_nritems(eb) : 0;
-   generation = btrfs_header_generation(eb);
-   /*
-* FIXME: currently we just set nritems to 0 if this is a leaf,
-* effectively ignoring the content. In a next step we could
-* trigger more readahead depending from the content, e.g.
-* fetch the checksums for the extents in the leaf.
-*/
-   } else {
-   /*
-* this is the error case, the extent buffer has not been
-* read correctly. We won't access anything from it and
-* just cleanup our data structures. Effectively this will
-* cut the branch below this node from read ahead.
-*/
-   nritems = 0;
-   generation = 0;
-   }
+   /*
+* this is the error case, the extent buffer has not been
+* read correctly. We won't access anything from it and
+* just cleanup our data structures. Effectively this will
+* cut the branch below this node from read ahead.
+*/
+   if (err)
+   goto cleanup;
 
+   /*
+* FIXME: currently we just set nritems to 0 if this is a leaf,
+* effectively ignoring the content. In a next step we could
+* trigger more readahead depending from the content, e.g.
+* fetch the checksums for the extents in the leaf.
+*/
+   if (!level)
+   goto cleanup;
+
+   nritems = btrfs_header_nritems(eb);
+   generation = btrfs_header_generation(eb);
for (i = 0; i < nritems; i++) {
struct reada_extctl *rec;
u64 n_gen;
@@ -188,6 +188,8 @@ static void __readahead_hook(struct btrfs_fs_info *fs_info,
reada_add_block(rc, bytenr, _key, n_gen);
}
}
+
+cleanup:
/*
 * free extctl records
 */
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] btrfs: reada: Remove level argument in severial functions

2015-12-31 Thread Zhao Lei
level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index ef9457e..66409f3 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -101,7 +101,7 @@ static void reada_start_machine(struct btrfs_fs_info 
*fs_info);
 static void __reada_start_machine(struct btrfs_fs_info *fs_info);
 
 static int reada_add_block(struct reada_control *rc, u64 logical,
-  struct btrfs_key *top, int level, u64 generation);
+  struct btrfs_key *top, u64 generation);
 
 /* recurses */
 /* in case of err, eb might be NULL */
@@ -197,8 +197,7 @@ static int __readahead_hook(struct btrfs_root *root, struct 
extent_buffer *eb,
if (rec->generation == generation &&
btrfs_comp_cpu_keys(, >key_end) < 0 &&
btrfs_comp_cpu_keys(_key, >key_start) > 0)
-   reada_add_block(rc, bytenr, _key,
-   level - 1, n_gen);
+   reada_add_block(rc, bytenr, _key, n_gen);
}
}
/*
@@ -315,7 +314,7 @@ static struct reada_zone *reada_find_zone(struct 
btrfs_fs_info *fs_info,
 
 static struct reada_extent *reada_find_extent(struct btrfs_root *root,
  u64 logical,
- struct btrfs_key *top, int level)
+ struct btrfs_key *top)
 {
int ret;
struct reada_extent *re = NULL;
@@ -562,13 +561,13 @@ static void reada_control_release(struct kref *kref)
 }
 
 static int reada_add_block(struct reada_control *rc, u64 logical,
-  struct btrfs_key *top, int level, u64 generation)
+  struct btrfs_key *top, u64 generation)
 {
struct btrfs_root *root = rc->root;
struct reada_extent *re;
struct reada_extctl *rec;
 
-   re = reada_find_extent(root, logical, top, level); /* takes one ref */
+   re = reada_find_extent(root, logical, top); /* takes one ref */
if (!re)
return -1;
 
@@ -921,7 +920,6 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
struct reada_control *rc;
u64 start;
u64 generation;
-   int level;
int ret;
struct extent_buffer *node;
static struct btrfs_key max_key = {
@@ -944,11 +942,10 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
 
node = btrfs_root_node(root);
start = node->start;
-   level = btrfs_header_level(node);
generation = btrfs_header_generation(node);
free_extent_buffer(node);
 
-   ret = reada_add_block(rc, start, _key, level, generation);
+   ret = reada_add_block(rc, start, _key, generation);
if (ret) {
kfree(rc);
return ERR_PTR(ret);
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] btrfs: reada: Move is_need_to_readahead contition earlier

2015-12-31 Thread Zhao Lei
Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index fb21bf0..dcc5b69 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -665,7 +665,6 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
u64 logical;
int ret;
int i;
-   int need_kick = 0;
 
spin_lock(_info->reada_lock);
if (dev->reada_curr_zone == NULL) {
@@ -701,6 +700,15 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
 
spin_unlock(_info->reada_lock);
 
+   spin_lock(>lock);
+   if (re->scheduled_for || list_empty(>extctl)) {
+   spin_unlock(>lock);
+   reada_extent_put(fs_info, re);
+   return 0;
+   }
+   re->scheduled_for = dev;
+   spin_unlock(>lock);
+
/*
 * find mirror num
 */
@@ -712,18 +720,8 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
}
logical = re->logical;
 
-   spin_lock(>lock);
-   if (!re->scheduled_for && !list_empty(>extctl)) {
-   re->scheduled_for = dev;
-   need_kick = 1;
-   }
-   spin_unlock(>lock);
-
reada_extent_put(fs_info, re);
 
-   if (!need_kick)
-   return 0;
-
atomic_inc(>reada_in_flight);
ret = reada_tree_block_flagged(fs_info->extent_root, logical,
mirror_num, );
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID10 question

2015-12-31 Thread Xavier Romero
Hello,

I have 2 completely independent set of 12 disks each, let's name them A1, A2, 
A3... A12 for first set, and B1, B2, B3...B12 for second set. For availability 
purposes I want disks to be paired that way:
A1 <--> B1: RAID1
A2 <--> B2: RAID1
...
A12 <--> B12: RAID1

And then I want a RAID0 out of all these RAID1.

I know I can achieve that by doing all the RAID1 with MD and then build the 
RAID0 with BTRFS. But my question is: can I achieve that directly with BTRFS 
RAID10?

Best regards,
Xavier Romero
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: quota rescan hangs

2015-12-31 Thread Xavier Romero
After restarting and removing all data, "qgroup show " tells me that a rescan 
is in progress, but "quota rescan -s" tells me the opposite.

[root@nex-dstrg-ctrl-1 btrfs]# btrfs quota rescan -w /mnt/btrfs/
quota rescan started

[root@nex-dstrg-ctrl-1 btrfs]# btrfs quota rescan -s /mnt/btrfs/
no rescan operation in progress

 [root@nex-dstrg-ctrl-1 btrfs]# btrfs qgroup show -pcre /mnt/btrfs/
WARNING: Rescan is running, qgroup data may be incorrect
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  16.00KiB 16.00KiB0.00B0.00B --- ---
0/389   0.00B0.00B0.00B0.00B --- ---
0/391   0.00B0.00B 10.00GiB0.00B --- ---
0/525 9.99GiB0.00B0.00B0.00B --- ---
0/59916.00KiB 16.00KiB  2.00GiB0.00B --- ---

[root@nex-dstrg-ctrl-1 btrfs]# btrfs sub list /mnt/btrfs/
ID 599 gen 10754 top level 5 path CLOUD_SSD_02

Not sure how to proceed !

-Mensaje original-
De: linux-btrfs-ow...@vger.kernel.org 
[mailto:linux-btrfs-ow...@vger.kernel.org] En nombre de Xavier Romero
Enviado el: dijous, 31 de desembre de 2015 11:09
Para: linux-btrfs@vger.kernel.org
Asunto: quota rescan hangs

Hello,

Using BTRFS on CentOS 7, I get the filesystem hung by running btrfs quota 
rescan  /mnt/btrfs/

After that I could not access to the filesystem anymore until system restart.

I did the scan because BTRFS suggested it:
[root@nex-dstrg-ctrl-1 btrfs]# btrfs qgroup show -F /mnt/btrfs/
WARNING: Qgroup data inconsistent, rescan recommended
qgroupid rfer excl
  
0/5  27.91GiB 27.91GiB
[root@nex-dstrg-ctrl-1 ~]# btrfs qgroup show -pcre /mnt/btrfs/
WARNING: Qgroup data inconsistent, rescan recommended
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  27.91GiB 27.91GiB0.00B0.00B --- ---
0/389 9.77GiB  9.77GiB0.00B0.00B --- ---
0/39116.00KiB 16.00KiB 10.00GiB0.00B --- ---
0/525 9.99GiB0.00B0.00B0.00B --- ---
0/59916.00KiB 16.00KiB  2.00GiB0.00B --- ---

dmesg output:
[154385.628385] INFO: task btrfs:22324 blocked for more than 120 seconds.
[154385.629044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[154385.629648] btrfs   D 881fc3bdef20 0 22324   3677 0x0080
[154385.629652]  88114d01bc60 0082 880e2fecae00 
88114d01bfd8 [154385.629656]  88114d01bfd8 88114d01bfd8 
880e2fecae00 881fc3be1100 [154385.629658]  883fc826c9f0 
883fc826c9f0 0001 881fc3bdef20 [154385.629661] Call Trace:
[154385.629673]  [] schedule+0x29/0x70 [154385.629690]  
[] wait_current_trans.isra.20+0xe7/0x130 [btrfs] 
[154385.629696]  [] ? wake_up_atomic_t+0x30/0x30 
[154385.629704]  [] start_transaction+0x2b8/0x5a0 [btrfs] 
[154385.629711]  [] btrfs_join_transaction+0x17/0x20 [btrfs] 
[154385.629722]  [] btrfs_qgroup_rescan+0x39/0x90 [btrfs] 
[154385.629731]  [] btrfs_ioctl+0x20b2/0x2b70 [btrfs] 
[154385.629736]  [] ? __mem_cgroup_commit_charge+0x152/0x390
[154385.629740]  [] ? lru_cache_add+0xe/0x10 [154385.629745]  
[] ? page_add_new_anon_rmap+0x91/0x130 [154385.629750]  
[] ? handle_mm_fault+0x7c0/0xf50 [154385.629752]  
[] ? __vma_link_rb+0xb8/0xe0 [154385.629759]  
[] do_vfs_ioctl+0x2e5/0x4c0 [154385.629765]  
[] ? file_has_perm+0xae/0xc0 [154385.629769]  
[] ? __do_page_fault+0xb1/0x450 [154385.629771]  
[] SyS_ioctl+0xa1/0xc0 [154385.629776]  [] 
system_call_fastpath+0x16/0x1b

Additional Info:

[root@nex-dstrg-ctrl-1 ~]# modinfo btrfs
filename:   /lib/modules/3.10.0-327.3.1.el7.x86_64/kernel/fs/btrfs/btrfs.ko
license:GPL
alias:  devname:btrfs-control
alias:  char-major-10-234
alias:  fs-btrfs
rhelversion:7.2
srcversion: B92059408E7CB90AE2D9A2F
depends:raid6_pq,xor,zlib_deflate
intree: Y
vermagic:   3.10.0-327.3.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:3D:4E:71:B0:42:9A:39:8B:8B:78:3B:6F:8B:ED:3B:AF:09:9E:E9:A7
sig_hashalgo:   sha256
[root@nex-dstrg-ctrl-1 ~]# btrfs filesystem show /mnt/btrfs/
Label: none  uuid: 6289b7b3-ba25-4c9e-af95-4a5fb18eeea1
Total devices 12 FS bytes used 29.86GiB
devid1 size 894.13GiB used 4.52GiB path /dev/md101
devid2 size 894.13GiB used 4.52GiB path /dev/md102
devid3 size 894.13GiB used 4.52GiB path /dev/md103
devid4 size 894.13GiB used 4.52GiB path /dev/md104
devid5 size 894.13GiB used 4.52GiB path /dev/md105
devid6 size 894.13GiB used 4.52GiB path 

quota rescan hangs

2015-12-31 Thread Xavier Romero
Hello,

Using BTRFS on CentOS 7, I get the filesystem hung by running
btrfs quota rescan  /mnt/btrfs/

After that I could not access to the filesystem anymore until system restart.

I did the scan because BTRFS suggested it:
[root@nex-dstrg-ctrl-1 btrfs]# btrfs qgroup show -F /mnt/btrfs/
WARNING: Qgroup data inconsistent, rescan recommended
qgroupid rfer excl
  
0/5  27.91GiB 27.91GiB
[root@nex-dstrg-ctrl-1 ~]# btrfs qgroup show -pcre /mnt/btrfs/
WARNING: Qgroup data inconsistent, rescan recommended
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  27.91GiB 27.91GiB0.00B0.00B --- ---
0/389 9.77GiB  9.77GiB0.00B0.00B --- ---
0/39116.00KiB 16.00KiB 10.00GiB0.00B --- ---
0/525 9.99GiB0.00B0.00B0.00B --- ---
0/59916.00KiB 16.00KiB  2.00GiB0.00B --- ---

dmesg output:
[154385.628385] INFO: task btrfs:22324 blocked for more than 120 seconds.
[154385.629044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[154385.629648] btrfs   D 881fc3bdef20 0 22324   3677 0x0080
[154385.629652]  88114d01bc60 0082 880e2fecae00 
88114d01bfd8
[154385.629656]  88114d01bfd8 88114d01bfd8 880e2fecae00 
881fc3be1100
[154385.629658]  883fc826c9f0 883fc826c9f0 0001 
881fc3bdef20
[154385.629661] Call Trace:
[154385.629673]  [] schedule+0x29/0x70
[154385.629690]  [] wait_current_trans.isra.20+0xe7/0x130 
[btrfs]
[154385.629696]  [] ? wake_up_atomic_t+0x30/0x30
[154385.629704]  [] start_transaction+0x2b8/0x5a0 [btrfs]
[154385.629711]  [] btrfs_join_transaction+0x17/0x20 [btrfs]
[154385.629722]  [] btrfs_qgroup_rescan+0x39/0x90 [btrfs]
[154385.629731]  [] btrfs_ioctl+0x20b2/0x2b70 [btrfs]
[154385.629736]  [] ? __mem_cgroup_commit_charge+0x152/0x390
[154385.629740]  [] ? lru_cache_add+0xe/0x10
[154385.629745]  [] ? page_add_new_anon_rmap+0x91/0x130
[154385.629750]  [] ? handle_mm_fault+0x7c0/0xf50
[154385.629752]  [] ? __vma_link_rb+0xb8/0xe0
[154385.629759]  [] do_vfs_ioctl+0x2e5/0x4c0
[154385.629765]  [] ? file_has_perm+0xae/0xc0
[154385.629769]  [] ? __do_page_fault+0xb1/0x450
[154385.629771]  [] SyS_ioctl+0xa1/0xc0
[154385.629776]  [] system_call_fastpath+0x16/0x1b

Additional Info:

[root@nex-dstrg-ctrl-1 ~]# modinfo btrfs
filename:   /lib/modules/3.10.0-327.3.1.el7.x86_64/kernel/fs/btrfs/btrfs.ko
license:GPL
alias:  devname:btrfs-control
alias:  char-major-10-234
alias:  fs-btrfs
rhelversion:7.2
srcversion: B92059408E7CB90AE2D9A2F
depends:raid6_pq,xor,zlib_deflate
intree: Y
vermagic:   3.10.0-327.3.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:3D:4E:71:B0:42:9A:39:8B:8B:78:3B:6F:8B:ED:3B:AF:09:9E:E9:A7
sig_hashalgo:   sha256
[root@nex-dstrg-ctrl-1 ~]# btrfs filesystem show /mnt/btrfs/
Label: none  uuid: 6289b7b3-ba25-4c9e-af95-4a5fb18eeea1
Total devices 12 FS bytes used 29.86GiB
devid1 size 894.13GiB used 4.52GiB path /dev/md101
devid2 size 894.13GiB used 4.52GiB path /dev/md102
devid3 size 894.13GiB used 4.52GiB path /dev/md103
devid4 size 894.13GiB used 4.52GiB path /dev/md104
devid5 size 894.13GiB used 4.52GiB path /dev/md105
devid6 size 894.13GiB used 4.52GiB path /dev/md106
devid7 size 894.13GiB used 4.52GiB path /dev/md107
devid8 size 894.13GiB used 4.52GiB path /dev/md108
devid9 size 894.13GiB used 4.52GiB path /dev/md109
devid   10 size 894.13GiB used 4.52GiB path /dev/md110
devid   11 size 894.13GiB used 4.52GiB path /dev/md111
devid   12 size 894.13GiB used 4.52GiB path /dev/md112

btrfs-progs v3.19.1
[root@nex-dstrg-ctrl-1 ~]# btrfs quota rescan -s /mnt/btrfs/
rescan operation running (current key 0)
[root@nex-dstrg-ctrl-1 ~]# btrfs subvolume list /mnt/btrfs/
ID 389 gen 10663 top level 5 path vol0
ID 391 gen 10514 top level 5 path vol1
ID 599 gen 10664 top level 5 path CLOUD_SSD_02


I'm just starting with BTRFS so I could be doing something wrong! Any ideas?

Best regards,
Xavier Romero

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 question

2015-12-31 Thread Hugo Mills
On Thu, Dec 31, 2015 at 09:52:16AM +, Xavier Romero wrote:
> Hello,
> 
> I have 2 completely independent set of 12 disks each, let's name them A1, A2, 
> A3... A12 for first set, and B1, B2, B3...B12 for second set. For 
> availability purposes I want disks to be paired that way:
> A1 <--> B1: RAID1
> A2 <--> B2: RAID1
> ...
> A12 <--> B12: RAID1
> 
> And then I want a RAID0 out of all these RAID1.
> 
> I know I can achieve that by doing all the RAID1 with MD and then build the 
> RAID0 with BTRFS. But my question is: can I achieve that directly with BTRFS 
> RAID10?

   No, not at the moment.

   Hugo.

-- 
Hugo Mills | Comic Sans goes into a bar, and the barman says, "We
hugo@... carfax.org.uk | don't serve your type here."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


[PATCH 07/10] btrfs: reada: Pass reada_extent into __readahead_hook directly

2015-12-31 Thread Zhao Lei
reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 7015906..7668066 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -105,33 +105,21 @@ static int reada_add_block(struct reada_control *rc, u64 
logical,
 
 /* recurses */
 /* in case of err, eb might be NULL */
-static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
-   u64 start, int err)
+static void __readahead_hook(struct btrfs_root *root, struct reada_extent *re,
+struct extent_buffer *eb, u64 start, int err)
 {
int level = 0;
int nritems;
int i;
u64 bytenr;
u64 generation;
-   struct reada_extent *re;
struct btrfs_fs_info *fs_info = root->fs_info;
struct list_head list;
-   unsigned long index = start >> PAGE_CACHE_SHIFT;
struct btrfs_device *for_dev;
 
if (eb)
level = btrfs_header_level(eb);
 
-   /* find extent */
-   spin_lock(_info->reada_lock);
-   re = radix_tree_lookup(_info->reada_tree, index);
-   if (re)
-   re->refcnt++;
-   spin_unlock(_info->reada_lock);
-
-   if (!re)
-   return -1;
-
spin_lock(>lock);
/*
 * just take the full list from the extent. afterwards we
@@ -221,11 +209,11 @@ static int __readahead_hook(struct btrfs_root *root, 
struct extent_buffer *eb,
 
reada_extent_put(fs_info, re);  /* one ref for each entry */
}
-   reada_extent_put(fs_info, re);  /* our ref */
+
if (for_dev)
atomic_dec(_dev->reada_in_flight);
 
-   return 0;
+   return;
 }
 
 /*
@@ -235,12 +223,27 @@ static int __readahead_hook(struct btrfs_root *root, 
struct extent_buffer *eb,
 int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
 u64 start, int err)
 {
-   int ret;
+   int ret = 0;
+   struct reada_extent *re;
+   struct btrfs_fs_info *fs_info = root->fs_info;
 
-   ret = __readahead_hook(root, eb, start, err);
+   /* find extent */
+   spin_lock(_info->reada_lock);
+   re = radix_tree_lookup(_info->reada_tree,
+  start >> PAGE_CACHE_SHIFT);
+   if (re)
+   re->refcnt++;
+   spin_unlock(_info->reada_lock);
+   if (!re) {
+   ret = -1;
+   goto start_machine;
+   }
 
-   reada_start_machine(root->fs_info);
+   __readahead_hook(fs_info, re, eb, start, err);
+   reada_extent_put(fs_info, re);  /* our ref */
 
+start_machine:
+   reada_start_machine(fs_info);
return ret;
 }
 
@@ -726,9 +729,9 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
ret = reada_tree_block_flagged(fs_info->extent_root, logical,
mirror_num, );
if (ret)
-   __readahead_hook(fs_info->extent_root, NULL, logical, ret);
+   __readahead_hook(fs_info->extent_root, re, NULL, logical, ret);
else if (eb)
-   __readahead_hook(fs_info->extent_root, eb, eb->start, ret);
+   __readahead_hook(fs_info->extent_root, re, eb, eb->start, ret);
 
if (eb)
free_extent_buffer(eb);
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] btrfs: reada: bypass adding extent when all zone failed

2015-12-31 Thread Zhao Lei
When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.

We should bypass this extent to avoid above case.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 7733a09..ef9457e 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -330,6 +330,7 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
int nzones = 0;
unsigned long index = logical >> PAGE_CACHE_SHIFT;
int dev_replace_is_ongoing;
+   int have_zone = 0;
 
spin_lock(_info->reada_lock);
re = radix_tree_lookup(_info->reada_tree, index);
@@ -461,10 +462,14 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
btrfs_dev_replace_unlock(_info->dev_replace);
goto error;
}
+   have_zone = 1;
}
spin_unlock(_info->reada_lock);
btrfs_dev_replace_unlock(_info->dev_replace);
 
+   if (!have_zone)
+   goto error;
+
btrfs_put_bbio(bbio);
return re;
 
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] btrfs: reada: Avoid many times of empty loop

2015-12-31 Thread Zhao Lei
We can see following loop(1 times) in trace_log:
 [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 
comm=kworker/u2:3 re->ref_cnt 88003741e0c0 1 -> 2
 [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1
 [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 
re->ref_cnt 88003741e0c0 1 -> 2
 [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1

 [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 
comm=kworker/u2:3 re->ref_cnt 88003741e0c0 1 -> 2
 [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1
 [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 
re->ref_cnt 88003741e0c0 1 -> 2
 [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1

 ...(1 times)

 [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 
comm=kworker/u2:3 re->ref_cnt 88003741e0c0 1 -> 2
 [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1
 [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 
re->ref_cnt 88003741e0c0 1 -> 2
 [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = 
88003741e0c0, refcnt = 2 -> 1

Reason:
 If more than one user trigger reada in same extent, the first task
 finished setting of reada data struct and call reada_start_machine()
 to start, and the second task only add a ref_count but have not
 add reada_extctl struct completely, the reada_extent can not finished
 all jobs, and will be selected in __reada_start_machine() for 1
 times(total times in __reada_start_machine()).

Fix:
 For a reada_extent without job, we don't need to run it, just return
 0 to let caller break.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index c65b42f..fb21bf0 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -713,7 +713,7 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
logical = re->logical;
 
spin_lock(>lock);
-   if (re->scheduled_for == NULL) {
+   if (!re->scheduled_for && !list_empty(>extctl)) {
re->scheduled_for = dev;
need_kick = 1;
}
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] btrfs: reada: move reada_extent_put to place after __readahead_hook()

2015-12-31 Thread Zhao Lei
We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.

Actually it is not a problem after my patch named:
  Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.

But we still need this patch to make the code in pretty logic.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 66409f3..7015906 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -722,8 +722,6 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
}
logical = re->logical;
 
-   reada_extent_put(fs_info, re);
-
atomic_inc(>reada_in_flight);
ret = reada_tree_block_flagged(fs_info->extent_root, logical,
mirror_num, );
@@ -735,6 +733,8 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
if (eb)
free_extent_buffer(eb);
 
+   reada_extent_put(fs_info, re);
+
return 1;
 
 }
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] btrfs: reada: Use fs_info instead of root in __readahead_hook's argument

2015-12-31 Thread Zhao Lei
What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()

Signed-off-by: Zhao Lei 
---
 fs/btrfs/ctree.h   |  4 ++--
 fs/btrfs/disk-io.c | 22 +++---
 fs/btrfs/reada.c   | 23 +++
 3 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 54e7b0d..0912f89 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4358,8 +4358,8 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
  struct btrfs_key *start, struct btrfs_key *end);
 int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
-int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
-u64 start, int err);
+int btree_readahead_hook(struct btrfs_fs_info *fs_info,
+struct extent_buffer *eb, u64 start, int err);
 
 static inline int is_fstree(u64 rootid)
 {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 974be09..9d120e4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -604,6 +604,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
int found_level;
struct extent_buffer *eb;
struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+   struct btrfs_fs_info *fs_info = root->fs_info;
int ret = 0;
int reads_done;
 
@@ -629,21 +630,21 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 
found_start = btrfs_header_bytenr(eb);
if (found_start != eb->start) {
-   btrfs_err_rl(eb->fs_info, "bad tree block start %llu %llu",
-  found_start, eb->start);
+   btrfs_err_rl(fs_info, "bad tree block start %llu %llu",
+found_start, eb->start);
ret = -EIO;
goto err;
}
-   if (check_tree_block_fsid(root->fs_info, eb)) {
-   btrfs_err_rl(eb->fs_info, "bad fsid on block %llu",
-  eb->start);
+   if (check_tree_block_fsid(fs_info, eb)) {
+   btrfs_err_rl(fs_info, "bad fsid on block %llu",
+eb->start);
ret = -EIO;
goto err;
}
found_level = btrfs_header_level(eb);
if (found_level >= BTRFS_MAX_LEVEL) {
-   btrfs_err(root->fs_info, "bad tree block level %d",
-  (int)btrfs_header_level(eb));
+   btrfs_err(fs_info, "bad tree block level %d",
+ (int)btrfs_header_level(eb));
ret = -EIO;
goto err;
}
@@ -651,7 +652,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb),
   eb, found_level);
 
-   ret = csum_tree_block(root->fs_info, eb, 1);
+   ret = csum_tree_block(fs_info, eb, 1);
if (ret) {
ret = -EIO;
goto err;
@@ -672,7 +673,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 err:
if (reads_done &&
test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(root, eb, eb->start, ret);
+   btree_readahead_hook(fs_info, eb, eb->start, ret);
 
if (ret) {
/*
@@ -691,14 +692,13 @@ out:
 static int btree_io_failed_hook(struct page *page, int failed_mirror)
 {
struct extent_buffer *eb;
-   struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
eb = (struct extent_buffer *)page->private;
set_bit(EXTENT_BUFFER_READ_ERR, >bflags);
eb->read_mirror = failed_mirror;
atomic_dec(>io_pages);
if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(root, eb, eb->start, -EIO);
+   btree_readahead_hook(eb->fs_info, eb, eb->start, -EIO);
return -EIO;/* we fixed nothing */
 }
 
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 7668066..869bb1c 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -105,15 +105,15 @@ static int reada_add_block(struct reada_control *rc, u64 
logical,
 
 /* recurses */
 /* in case of err, eb might be NULL */
-static void __readahead_hook(struct btrfs_root *root, struct reada_extent *re,
-struct extent_buffer *eb, u64 start, int err)
+static void __readahead_hook(struct btrfs_fs_info *fs_info,
+struct reada_extent *re, struct extent_buffer *eb,
+u64 start, int err)
 {
int level = 0;
int nritems;
int i;
u64 bytenr;
u64 generation;
-   struct btrfs_fs_info *fs_info = root->fs_info;
struct list_head list;
struct btrfs_device *for_dev;
 
@@ 

[PATCH 10/10] btrfs: reada: Fix a debug code typo

2015-12-31 Thread Zhao Lei
Remove one copy of loop to fix the typo of iterate zones.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 902f899..53ee7b1 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -898,14 +898,9 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
printk(KERN_CONT " zone %llu-%llu devs",
re->zones[i]->start,
re->zones[i]->end);
-   for (i = 0; i < re->nzones; ++i) {
-   printk(KERN_CONT " zone %llu-%llu devs",
-   re->zones[i]->start,
-   re->zones[i]->end);
-   for (j = 0; j < re->zones[i]->ndevs; ++j) {
-   printk(KERN_CONT " %lld",
-   re->zones[i]->devs[j]->devid);
-   }
+   for (j = 0; j < re->zones[i]->ndevs; ++j) {
+   printk(KERN_CONT " %lld",
+  re->zones[i]->devs[j]->devid);
}
}
printk(KERN_CONT "\n");
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/send.c:1482

2015-12-31 Thread Filipe Manana
On Thu, Dec 31, 2015 at 4:27 PM, Stephen R. van den Berg  wrote:
> Stephen R. van den Berg wrote:
>>I'm running 4.4.0-rc7.
>>This exact problem was present on 4.0.5 and 4.3.3 too though.
>
>>I do a "btrfs send /var/lib/lxc/template64/rootfs", that generates
>>the following error consistently at the same file, over and over again:
>
>>Dec 29 14:49:04 argo kernel: kernel BUG at fs/btrfs/send.c:1482!
>
> Ok, found part of the solution.
> The kernel bug was being triggered by symbolic links in that
> subvolume that have an empty target.  It is unknown how
> these ever ended up on that partition.

Well they can happen due a crash or snapshotting at very specific
points in time when an error happens while creating a symlink at
least.
I've sent a change for send that makes it not BUG_ON() but instead
fail with an EIO error and print a message do dmesg/syslog telling
that an empty symlink exists:

https://patchwork.kernel.org/patch/7936741/

As for fixing the (very) rare cases where we end up creating empty
symlinks, it's not trivial to fix.

Thanks.

>
> The partitions have been created using regular btrfs.
> The only strange thing that might have happened, is that I ran duperemove
> over those partitions afterward.
> --
> Stephen.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs scrub failing

2015-12-31 Thread John Center
Hi,

I run a weekly scrub, using Marc Merlin's btrfs-scrub script.
Usually, it completes without a problem, but this week it failed.  I
ran the scrub manually & it stops shortly:

john@mariposa:~$ sudo /sbin/btrfs scrub start -BdR /dev/md124p2
ERROR: scrubbing /dev/md124p2 failed for device id 1: ret=-1, errno=5
(Input/output error)
scrub device /dev/md124p2 (id 1) canceled
scrub started at Thu Dec 31 00:26:34 2015 and was aborted after 00:01:29
data_extents_scrubbed: 110967
tree_extents_scrubbed: 99638
data_bytes_scrubbed: 2548817920
tree_bytes_scrubbed: 1632468992
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 1573
csum_discards: 74371
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 4729667584

john@mariposa:~$ sudo /sbin/btrfs scrub status /dev/md124p2
scrub status for 9b5a6959-7df1-4455-a643-d369487d24aa
scrub started at Thu Dec 31 00:29:06 2015, running for 00:01:15
total bytes scrubbed: 3.46GiB with 0 errors

My Ubuntu 14.04 workstation is using the 4.2 kernel (Wily).  I'm using
btrfs-tools v4.3.1.  Btrfs is on top of mdadm raid1 (imsm).
Autodefrag is enabled. Both drives have checked out ok on smart tests.
Some directories are set up with nodatacow for VMs, etc.

john@mariposa:~$ sudo btrfs fi show
Label: none  uuid: 9b5a6959-7df1-4455-a643-d369487d24aa
Total devices 1 FS bytes used 961.46GiB
devid1 size 1.76TiB used 978.04GiB path /dev/md124p2

Funny thing is, if the scrub hadn't failed, I wouldn't know there were
any problems!  I've rebooted twice since the original scrub that
failed w/o a problem.  I've backed up all my files to an ext4
partition, again w/o a problem.

I've been searching for a clue on the wiki, mailing list, etc. on how
to fix this, but I'm at a loss.  From what I read, I shouldn't be able
to boot my workstation.  How should I go about repairing this?  Any
help would be greatly appreciated.

Thanks.

-John


BTW, I did run btrfs-find-root at one point & got the following:

john@mariposa:~$ sudo btrfs-find-root /dev/md124p2
Superblock thinks the generation is 1031315
Superblock thinks the level is 1
Found tree root at 1039015591936 gen 1031315 level 1
Well block 1039013101568(gen: 1031314 level: 1) seems good, but
generation/level doesn't match, want gen: 1031315 level: 1
Well block 1039003533312(gen: 1031313 level: 1) seems good, but
generation/level doesn't match, want gen: 1031315 level: 1
Well block 1039006171136(gen: 1031311 level: 0) seems good, but
generation/level doesn't match, want gen: 1031315 level: 1

... 500+ lines skipped...

Well block 519183810560(gen: 163422 level: 0) seems good, but
generation/level doesn't match, want gen: 1031315 level: 1
Well block 143915876352(gen: 38834 level: 0) seems good, but
generation/level doesn't match, want gen: 1031315 level: 1
Well block 4243456(gen: 3 level: 0) seems good, but generation/level
doesn't match, want gen: 1031315 level: 1
Well block 4194304(gen: 2 level: 0) seems good, but generation/level
doesn't match, want gen: 1031315 level: 1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs fail behavior when a device vanishes

2015-12-31 Thread Chris Murphy
This is a torture test, no data is at risk.

Two devices, btrfs raid1 with some stuff on them.
Copy from that array, elsewhere.
During copy, yank the active device.

dmesg shows many of these:

[ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
652123, rd 697237, flush 0, corrupt 0, gen 0

Why are the write errors nearly as high as the read errors, when there
is only a copy from this device happening?

Is Btrfs trying to write the read error count (for dev stats) of sdc1
onto sdc1, and that causes a write error?

Also, is there a command to make a block device go away? At least in
gnome shell when I eject a USB stick, it isn't just umounted, it no
longer appears with lsblk or blkid, so I'm wondering if there's a way
to vanish a misbehaving device so that Btrfs isn't bogged down with a
flood of retries.

In case anyone is curious, the entire dmesg from device insertion,
formatting, mounting, copying to then from, and device yanking is here
(should be permanent):
http://pastebin.com/raw/Wfe1pY4N

And the copy did successfully complete anyway, and the resulting files
have the same hashes as their originals. So, yay, despite the noisy
messages.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel BUG at fs/btrfs/send.c:1482

2015-12-31 Thread Stephen R. van den Berg
I'm running 4.4.0-rc7.
This exact problem was present on 4.0.5 and 4.3.3 too though.

I do a "btrfs send /var/lib/lxc/template64/rootfs", that generates
the following error consistently at the same file, over and over again:

Dec 29 14:49:04 argo kernel: kernel BUG at fs/btrfs/send.c:1482!
Dec 29 14:49:04 argo kernel: Modules linked in: nfsd
Dec 29 14:49:04 argo kernel: task: 880041295c40 ti: 88010423c000 
task.ti: 88010423c000
Dec 29 14:49:04 argo kernel: RSP: 0018:88010423fb20  EFLAGS: 00010202
Dec 29 14:49:04 argo kernel: RDX: 0001 RSI:  RDI: 

Dec 29 14:49:04 argo kernel: R10: 88019d53b5e0 R11:  R12: 
8801b35ac510
Dec 29 14:49:04 argo kernel: FS:  7fac9113f8c0() 
GS:88022fd8() knlGS:
Dec 29 14:49:04 argo kernel: CR2: 7f99ba308520 CR3: 000154a4 CR4: 
001006e0
Dec 29 14:49:04 argo kernel: 81ed 0009b15a a1ff 

Dec 29 14:49:04 argo kernel: 0009b15a  1a03 
8801baf2e800
Dec 29 14:49:04 argo kernel: [] 
send_create_inode_if_needed+0x30/0x49
Dec 29 14:49:04 argo kernel: [] ? btrfs_item_key+0x19/0x1b
Dec 29 14:49:04 argo kernel: [] 
btrfs_compare_trees+0x2f2/0x4fe
Dec 29 14:49:04 argo kernel: [] btrfs_ioctl_send+0x846/0xce5
Dec 29 14:49:04 argo kernel: [] ? 
try_to_freeze_unsafe+0x9/0x32
Dec 29 14:49:04 argo kernel: [] ? _raw_spin_lock_irq+0xf/0x11
Dec 29 14:49:04 argo kernel: [] ? ptrace_do_notify+0x84/0x95
Dec 29 14:49:04 argo kernel: [] SyS_ioctl+0x43/0x61
Dec 29 14:49:04 argo kernel: RIP  [] 
send_create_inode+0x1ce/0x30d


On the receiving end, I have a "btrfs receive" which takes the above stream as
input, and *always* reports this:

receiving snapshot 20151230-141324.1451484804.965085668@argo 
uuid=53df0616-5715-ad40-ae81-78a023860fe0, ctransid=649684 parent_
uuid=d3f807da-1e9d-aa4d-ab01-77ce5e2fbcd7, parent_ctransid=649735
utimes 
rename bin -> o257-379784-0
mkdir o257-34888-0
rename o257-34888-0 -> bin
utimes 
chown bin - uid=0, gid=0
chmod bin - mode=0755
utimes bin
rmdir boot
ERROR: rmdir boot failed. No such file or directory
mkdir o258-34888-0
rename o258-34888-0 -> boot
utimes 
chown boot - uid=0, gid=0
chmod boot - mode=0755
utimes boot
rename dev -> o259-379784-0
mkdir o259-34888-0
rename o259-34888-0 -> dev
... rest of the logging follows as normal...
... then we get ...
rmdir media
mkdir o264-34888-0
rename o264-34888-0 -> media
utimes 
chown media - uid=0, gid=0
chmod media - mode=0755
utimes media
rmdir mnt
ERROR: rmdir mnt failed. No such file or directory
rmdir opt
mkdir o266-34888-0
rename o266-34888-0 -> opt
utimes 
... continues as normal ...

It then still creates lots of files, until it encounters the sudden EOF
due to the sending side experiencing the kernel bug and abruptly halting
the send.

Since the problem is consistently and easily reproducible, I can immediately
try any proposed patches or fixes (or provide more insight into the
subvolume this problem occurs with).
Numerous other subvolumes in the same BTRFS partition work flawlessly
using btrfs send/receive.

The sending partition is RAID0 with two 512GB SSD drives.  The receiving
partition is RAID1 with 6 6TB HDD drives.
-- 
Stephen.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/send.c:1482

2015-12-31 Thread Stephen R. van den Berg
Stephen R. van den Berg wrote:
>I'm running 4.4.0-rc7.
>This exact problem was present on 4.0.5 and 4.3.3 too though.

>I do a "btrfs send /var/lib/lxc/template64/rootfs", that generates
>the following error consistently at the same file, over and over again:

>Dec 29 14:49:04 argo kernel: kernel BUG at fs/btrfs/send.c:1482!

Ok, found part of the solution.
The kernel bug was being triggered by symbolic links in that
subvolume that have an empty target.  It is unknown how
these ever ended up on that partition.

The partitions have been created using regular btrfs.
The only strange thing that might have happened, is that I ran duperemove
over those partitions afterward.
-- 
Stephen.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix number of transaction units required to create symlink

2015-12-31 Thread fdmanana
From: Filipe Manana 

We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2ea2e0e..5dbc07a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9660,9 +9660,11 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
/*
 * 2 items for inode item and ref
 * 2 items for dir items
+* 1 item for updating parent inode item
+* 1 item for the inline extent item
 * 1 item for xattr if selinux is on
 */
-   trans = btrfs_start_transaction(root, 5);
+   trans = btrfs_start_transaction(root, 7);
if (IS_ERR(trans))
return PTR_ERR(trans);
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: don't leave dangling dentry if symlink creation failed

2015-12-31 Thread fdmanana
From: Filipe Manana 

When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.

Example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt
  $ touch /mnt/foo
  $ ln -s /mnt/foo /mnt/bar
  ln: failed to create symbolic link ‘bar’: Cannot allocate memory
  $ umount /mnt
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, 
no inode ref
  found 131073 bytes used err is 1
  total csum bytes: 0
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 124305
  file data blocks allocated: 262144
   referenced 262144
  btrfs-progs v4.2.3

So fix this by adding the directory index entries as the very last
step of symlink creation.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bdb0008..2ea2e0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9693,10 +9693,6 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
if (err)
goto out_unlock_inode;
 
-   err = btrfs_add_nondir(trans, dir, dentry, inode, 0, index);
-   if (err)
-   goto out_unlock_inode;
-
path = btrfs_alloc_path();
if (!path) {
err = -ENOMEM;
@@ -9733,6 +9729,13 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
inode_set_bytes(inode, name_len);
btrfs_i_size_write(inode, name_len);
err = btrfs_update_inode(trans, root, inode);
+   /*
+* Last step, add directory indexes for our symlink inode. This is the
+* last step to avoid extra cleanup of these indexes if an error happens
+* elsewhere above.
+*/
+   if (!err)
+   err = btrfs_add_nondir(trans, dir, dentry, inode, 0, index);
if (err) {
drop_inode = 1;
goto out_unlock_inode;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: send, don't BUG_ON() when an empty symlink is found

2015-12-31 Thread fdmanana
From: Filipe Manana 

When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).

So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.

Reported-by: Stephen R. van den Berg 
Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 355a458..63a6152 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1469,7 +1469,21 @@ static int read_symlink(struct btrfs_root *root,
ret = btrfs_search_slot(NULL, root, , path, 0, 0);
if (ret < 0)
goto out;
-   BUG_ON(ret);
+   if (ret) {
+   /*
+* An empty symlink inode. Can happen in rare error paths when
+* creating a symlink (transaction committed before the inode
+* eviction handler removed the symlink inode items and a crash
+* happened in between or the subvol was snapshoted in between).
+* Print an informative message to dmesg/syslog so that the user
+* can delete the symlink.
+*/
+   btrfs_err(root->fs_info,
+ "Found empty symlink inode %llu at root %llu",
+ ino, root->root_key.objectid);
+   ret = -EIO;
+   goto out;
+   }
 
ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
struct btrfs_file_extent_item);
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/send.c:1482

2015-12-31 Thread Christoph Anton Mitterer
On Thu, 2015-12-31 at 18:29 +, Filipe Manana wrote:
> As for fixing the (very) rare cases where we end up creating empty
> symlinks, it's not trivial to fix.
Would it be reasonable to have btrfs-check list such broken symlinks?


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread ronnie sahlberg
Here is a kludge I hacked up.
Someone that cares could clean this up and start building a proper
test suite or something.

This test script creates a 3 disk raid1 filesystem and very slowly
writes a large file onto the filesystem while, one by one each disk is
disconnected then reconnected in a loop.
It is fairly trivial to trigger dataloss when devices are bounced like this.

You have to run the script as root due to the calls to [u]mount and iscsiadm




On Thu, Dec 31, 2015 at 1:23 PM, ronnie sahlberg
 wrote:
> On Thu, Dec 31, 2015 at 12:11 PM, Chris Murphy  
> wrote:
>> This is a torture test, no data is at risk.
>>
>> Two devices, btrfs raid1 with some stuff on them.
>> Copy from that array, elsewhere.
>> During copy, yank the active device.
>>
>> dmesg shows many of these:
>>
>> [ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
>> 652123, rd 697237, flush 0, corrupt 0, gen 0
>
> For automated tests a good way could be to build a multi device btrfs 
> filesystem
> ontop of it.
> For example STGT exporting n# volumes and then mount via the loopback 
> interface.
> Then you could just use tgtadm to add / remove the device in a
> controlled fashion and to any filesystem it will look exactly like if
> you pulled the device physically.
>
> This allows you to run fully automated and scripted "how long before
> the filesystem goes into total dataloss mode" tests.
>
>
>
> If you want more fine control than just plug/unplug on a live
> filesystem , you can use
> https://github.com/rsahlberg/flaky-stgt
> Again, this uses iSCSI but it allows you to script event such as
> "this range of blocks are now Uncorrectable read error" etc.
> To automatically stress test that the filesystem can deal with it.
>
>
> I created this STGT fork so that filesystem testers would have a way
> to automate testing of their failure paths.
> In particular for BTRFS which seems to still be incredible fragile
> when devices fail or disconnect.
>
> Unfortunately I don't think anyone cared very much. :-(
> Please BTRFS devs,  please use something like this for testing of
> failure modes and robustness. Please!
>
>
>
>>
>> Why are the write errors nearly as high as the read errors, when there
>> is only a copy from this device happening?
>>
>> Is Btrfs trying to write the read error count (for dev stats) of sdc1
>> onto sdc1, and that causes a write error?
>>
>> Also, is there a command to make a block device go away? At least in
>> gnome shell when I eject a USB stick, it isn't just umounted, it no
>> longer appears with lsblk or blkid, so I'm wondering if there's a way
>> to vanish a misbehaving device so that Btrfs isn't bogged down with a
>> flood of retries.
>>
>> In case anyone is curious, the entire dmesg from device insertion,
>> formatting, mounting, copying to then from, and device yanking is here
>> (should be permanent):
>> http://pastebin.com/raw/Wfe1pY4N
>>
>> And the copy did successfully complete anyway, and the resulting files
>> have the same hashes as their originals. So, yay, despite the noisy
>> messages.
>>
>>
>> --
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


test_0100_write_raid1_unplug.sh
Description: Bourne shell script


functions.sh
Description: Bourne shell script


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread Chris Murphy
On Thu, Dec 31, 2015 at 6:09 PM, ronnie sahlberg
 wrote:
> Here is a kludge I hacked up.
> Someone that cares could clean this up and start building a proper
> test suite or something.
>
> This test script creates a 3 disk raid1 filesystem and very slowly
> writes a large file onto the filesystem while, one by one each disk is
> disconnected then reconnected in a loop.
> It is fairly trivial to trigger dataloss when devices are bounced like this.

Yes, it's quite a torture test. I'd expect this would be a problem for
Btrfs until this feature is done at least:

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Take_device_with_heavy_IO_errors_offline_or_mark_as_.22unreliable.22

And maybe this one too
https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

Already we know that Btrfs tries to write indefinitely to missing
devices. If it reappears, what gets written? Will that device be
consistent? And then another one goes missing, comes back, now
possibly two devices with totally different states for identical
generations. It's a mess. We know that trivially causes major
corruption with btrfs raid1 if a user mounts e.g. devid1 rw,degraded
modifies that; then mounts devid2 (only) rw,degraded and modifies it;
and then mounts both devids together. Kablewy. Big mess. And that's
umounting each one in between those steps; not even the abrupt
disconnect/reconnect.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread Chris Murphy
On Thu, Dec 31, 2015 at 1:24 PM, Hugo Mills  wrote:
> On Thu, Dec 31, 2015 at 01:11:25PM -0700, Chris Murphy wrote:
>> This is a torture test, no data is at risk.
>>
>> Two devices, btrfs raid1 with some stuff on them.
>> Copy from that array, elsewhere.
>> During copy, yank the active device.
>>
>> dmesg shows many of these:
>>
>> [ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
>> 652123, rd 697237, flush 0, corrupt 0, gen 0
>>
>> Why are the write errors nearly as high as the read errors, when there
>> is only a copy from this device happening?
>
>I'm guessing completely here, but maybe it's trying to write
> corrected data to sdc1, because the original read failed?
>

Egads. OK that makes sense.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread Hugo Mills
On Thu, Dec 31, 2015 at 01:11:25PM -0700, Chris Murphy wrote:
> This is a torture test, no data is at risk.
> 
> Two devices, btrfs raid1 with some stuff on them.
> Copy from that array, elsewhere.
> During copy, yank the active device.
> 
> dmesg shows many of these:
> 
> [ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
> 652123, rd 697237, flush 0, corrupt 0, gen 0
> 
> Why are the write errors nearly as high as the read errors, when there
> is only a copy from this device happening?

   I'm guessing completely here, but maybe it's trying to write
corrected data to sdc1, because the original read failed?

   Hugo.

> Is Btrfs trying to write the read error count (for dev stats) of sdc1
> onto sdc1, and that causes a write error?
> 
> Also, is there a command to make a block device go away? At least in
> gnome shell when I eject a USB stick, it isn't just umounted, it no
> longer appears with lsblk or blkid, so I'm wondering if there's a way
> to vanish a misbehaving device so that Btrfs isn't bogged down with a
> flood of retries.
> 
> In case anyone is curious, the entire dmesg from device insertion,
> formatting, mounting, copying to then from, and device yanking is here
> (should be permanent):
> http://pastebin.com/raw/Wfe1pY4N
> 
> And the copy did successfully complete anyway, and the resulting files
> have the same hashes as their originals. So, yay, despite the noisy
> messages.
> 
> 

-- 
Hugo Mills | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/  |
PGP: E2AB1DE4  | The Goons


signature.asc
Description: Digital signature


btrfs send clone use case

2015-12-31 Thread Chris Murphy
I haven't previously heard of this use case for -c option. It seems to
work (no errors or fs weirdness afterward).

The gist: send a snapshot from drive 1 to drive 2; rw snapshot of the
drive 2 copy, and then make changes to it, then make an ro snapshot;
now send it back to drive 1 *as an incremental* send.

[dated subvolumes are ro, undated ones are rw]


# btrfs send /brick1/chrishome-20151128 | btrfs receive /brick2
# btrfs sub snap /brick2/chrishome-20151128 /brick2/chrishome
## make some modifications to chrishome contents
# btrfs sub snap -r /brick2/chrishome /brick2/chrishome-20151230
# btrfs send -p /brick2/chrishome-20151128 chrishome-20151230 | btrfs
receive /brick1
ERROR: check if we support uuid tree fails - Operation not permitted
At subvol chrishome:20151230/

However,

# btrfs send -p /brick2/chrishome-20151128 -c
/brick2/chrishome-20151128 chrishome-20151230 | btrfs receive /brick1

works. And it's fast (it's ~100G so I'd know if it weren't sending an
increment).

chrishome-20151128 is obviously identical on both sides in this case;
but I guess -c just acts to explicitly confirm this is true? The
brick2/chrishome-20151128 has a Received UUID that
matches the UUID of brick1/chrishome-20151128, so it seems their
identical states should be known?

Slightly confusing though: brick1/chrishome:20151230 (the one
resulting from the successful -p -c command) has the same Parent UUID
and Received UUID, which is the UUID for brick1/chrishome:20151128.
That's not really its parent, since it's a received subvolume I'd
expect this to be -, like it is for any other received subvolume
(which doesn't really have a parent).

Anyway it seems to be working.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread ronnie sahlberg
On Thu, Dec 31, 2015 at 12:11 PM, Chris Murphy  wrote:
> This is a torture test, no data is at risk.
>
> Two devices, btrfs raid1 with some stuff on them.
> Copy from that array, elsewhere.
> During copy, yank the active device.
>
> dmesg shows many of these:
>
> [ 7179.373245] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
> 652123, rd 697237, flush 0, corrupt 0, gen 0

For automated tests a good way could be to build a multi device btrfs filesystem
ontop of it.
For example STGT exporting n# volumes and then mount via the loopback interface.
Then you could just use tgtadm to add / remove the device in a
controlled fashion and to any filesystem it will look exactly like if
you pulled the device physically.

This allows you to run fully automated and scripted "how long before
the filesystem goes into total dataloss mode" tests.



If you want more fine control than just plug/unplug on a live
filesystem , you can use
https://github.com/rsahlberg/flaky-stgt
Again, this uses iSCSI but it allows you to script event such as
"this range of blocks are now Uncorrectable read error" etc.
To automatically stress test that the filesystem can deal with it.


I created this STGT fork so that filesystem testers would have a way
to automate testing of their failure paths.
In particular for BTRFS which seems to still be incredible fragile
when devices fail or disconnect.

Unfortunately I don't think anyone cared very much. :-(
Please BTRFS devs,  please use something like this for testing of
failure modes and robustness. Please!



>
> Why are the write errors nearly as high as the read errors, when there
> is only a copy from this device happening?
>
> Is Btrfs trying to write the read error count (for dev stats) of sdc1
> onto sdc1, and that causes a write error?
>
> Also, is there a command to make a block device go away? At least in
> gnome shell when I eject a USB stick, it isn't just umounted, it no
> longer appears with lsblk or blkid, so I'm wondering if there's a way
> to vanish a misbehaving device so that Btrfs isn't bogged down with a
> flood of retries.
>
> In case anyone is curious, the entire dmesg from device insertion,
> formatting, mounting, copying to then from, and device yanking is here
> (should be permanent):
> http://pastebin.com/raw/Wfe1pY4N
>
> And the copy did successfully complete anyway, and the resulting files
> have the same hashes as their originals. So, yay, despite the noisy
> messages.
>
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 question

2015-12-31 Thread Chris Murphy
On Thu, Dec 31, 2015 at 6:31 AM, Duncan <1i5t5.dun...@cox.net> wrote:

> Additionally, if you're going to put btrfs on mdraid, then you may wish
> to consider reversing the above, doing raid01, which while ordinarily
> discouraged in favor of raid10, has some things going for it when the top
> level is btrfs, that raid10 doesn't.

Yes, although it's a fine line how to create such a large volume of so
many drives. If you use many drives per raid0, when there is a failure
it takes a long time to rebuild. If you use few drives per raid0, fast
rebuild, but the exposure/risk with a 2nd failure is higher. e.g. two
extremes:

12x raid0 "bank A" and 12x raid0 "bank B"

If one drive dies, an entire bank is gone, and it's a long rebuild,
but if a 2nd drive dies, nearly 50/50 chance it dies in the same
already dead bank.

2x raid0 "bank A"  and 2x raid0 "bank C" and  through "bank L"

If one drive dies in bank A, then A is gone, short rebuild time, but
if a 2nd drive dies, almost certainly it will not be the 2nd bank A
drive, meaning it's in another bank and that means the whole array is
mortally wounded. Depending on what's missing and what needs to be
accessed, it might work OK for seconds, minutes, or hours, and then
totally implode. There's no way to predict it in advance.

Anyway, I'd sooner go with 3x raid5, or 6x raid6, and then pool them
with glusterfs. Even with a single node using replication only for the
separate raid5 bricks is more reliable than a 24x raid10 no matter
md+xfs or btrfs. That makes it effectively a raid 51. And if half the
storage is put on another node, now you have power supply and some
network redundancy too.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unrecoverable fs corruption?

2015-12-31 Thread Alexander Duscheleit

Hello,

I had a power fail today at my home server and after the reboot the 
btrfs RAID1 won't come back up.


When trying to mount one of the 2 disks of the array I get the following 
error:

[ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled
[ 4126.316402] BTRFS: has skinny extents
[ 4126.337324] BTRFS: failed to read chunk tree on sdb2
[ 4126.353027] BTRFS: open_ctree failed

a btrfs check segfaults after a few seconds with the following message:
(0:29)[root@hera]~  # ❯❯❯ btrfs check /dev/sdb2
warning devid 1 not found already
bad key ordering 68 69
Checking filesystem on /dev/sdb2
UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3
checking extents
bad key ordering 68 69
bad block 6513625202688
Errors found in extent allocation tree or chunk allocation
[1]11164 segmentation fault  btrfs check /dev/sdb2

I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 
1.1G repectively, I don't know
if I can upload them at all and also not where to store such large 
files.


I did try a btrfs check --repair on one of the disks which gave the 
following result:

enabling repair mode
warning devid 1 not found already
bad key ordering 68 69
repair mode will force to clear out log tree, Are you sure? [y/N]: y
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs[0x44161e]
btrfs(btrfs_reserve_extent+0xa7b)[0x4463db]
btrfs(btrfs_alloc_free_block+0x5f)[0x44649f]
btrfs(__btrfs_cow_block+0xc4)[0x437d64]
btrfs(btrfs_cow_block+0x35)[0x438365]
btrfs[0x43d3d6]
btrfs(btrfs_commit_transaction+0x95)[0x43f125]
btrfs(cmd_check+0x5ec)[0x429cdc]
btrfs(main+0x82)[0x40ef32]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610]
btrfs(_start+0x29)[0x40f039]


That's all I tried so far.
btrfs restore -viD seems to find most of the files accessible but since 
I don't have a spare hdd of
sufficient size I would have to break the array and reformat and use one 
of the disk as restore
target. I'm not prepared to do this before I know there is no other way 
to fix the drives since I'm

essentially destroying one more chance at saving the data.

Is there anything I can do to get the fs out of this mess?
--
Alexander Duscheleit
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread ronnie sahlberg
On Thu, Dec 31, 2015 at 5:27 PM, Chris Murphy  wrote:
> On Thu, Dec 31, 2015 at 6:09 PM, ronnie sahlberg
>  wrote:
>> Here is a kludge I hacked up.
>> Someone that cares could clean this up and start building a proper
>> test suite or something.
>>
>> This test script creates a 3 disk raid1 filesystem and very slowly
>> writes a large file onto the filesystem while, one by one each disk is
>> disconnected then reconnected in a loop.
>> It is fairly trivial to trigger dataloss when devices are bounced like this.
>
> Yes, it's quite a torture test. I'd expect this would be a problem for
> Btrfs until this feature is done at least:
>
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Take_device_with_heavy_IO_errors_offline_or_mark_as_.22unreliable.22
>
> And maybe this one too
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation
>
> Already we know that Btrfs tries to write indefinitely to missing
> devices. If it reappears, what gets written? Will that device be
> consistent? And then another one goes missing, comes back, now
> possibly two devices with totally different states for identical
> generations. It's a mess. We know that trivially causes major
> corruption with btrfs raid1 if a user mounts e.g. devid1 rw,degraded
> modifies that; then mounts devid2 (only) rw,degraded and modifies it;
> and then mounts both devids together. Kablewy. Big mess. And that's
> umounting each one in between those steps; not even the abrupt
> disconnect/reconnect.

Based on my test_0100... create a test script for that scenario too.
Even if btrfs can not handle it yet,  it does not hurt to have these
tests for scenarios that MUST work before the filesystem go officially
"stable+production".
Having these tests will possibly even make the work to close the
robustness gap easier since the devs will have reproducible test
scripts they can validate new features against.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread Chris Murphy
On Thu, Dec 31, 2015 at 1:11 PM, Chris Murphy  wrote:

> Also, is there a command to make a block device go away?

Maybe?
echo 1 > /sys/block/device-name/device/delete



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unrecoverable fs corruption?

2015-12-31 Thread Chris Murphy
On Thu, Dec 31, 2015 at 4:36 PM, Alexander Duscheleit
 wrote:
> Hello,
>
> I had a power fail today at my home server and after the reboot the btrfs
> RAID1 won't come back up.
>
> When trying to mount one of the 2 disks of the array I get the following
> error:
> [ 4126.316396] BTRFS info (device sdb2): disk space caching is enabled
> [ 4126.316402] BTRFS: has skinny extents
> [ 4126.337324] BTRFS: failed to read chunk tree on sdb2
> [ 4126.353027] BTRFS: open_ctree failed


Why are you trying to mount only one? What mount options did you use
when you did this?


>
> a btrfs check segfaults after a few seconds with the following message:
> (0:29)[root@hera]~  # ❯❯❯ btrfs check /dev/sdb2
> warning devid 1 not found already
> bad key ordering 68 69
> Checking filesystem on /dev/sdb2
> UUID: d55fa866-3baa-4e73-bf3e-5fda29672df3
> checking extents
> bad key ordering 68 69
> bad block 6513625202688
> Errors found in extent allocation tree or chunk allocation
> [1]11164 segmentation fault  btrfs check /dev/sdb2
>
> I have 2 btrfs-images (one with -w, one without) but they are 6.1G and 1.1G
> repectively, I don't know
> if I can upload them at all and also not where to store such large files.
>
> I did try a btrfs check --repair on one of the disks which gave the
> following result:
> enabling repair mode
> warning devid 1 not found already
> bad key ordering 68 69
> repair mode will force to clear out log tree, Are you sure? [y/N]: y
> Unable to find block group for 0
> extent-tree.c:289: find_search_start: Assertion `1` failed.
> btrfs[0x44161e]
> btrfs(btrfs_reserve_extent+0xa7b)[0x4463db]
> btrfs(btrfs_alloc_free_block+0x5f)[0x44649f]
> btrfs(__btrfs_cow_block+0xc4)[0x437d64]
> btrfs(btrfs_cow_block+0x35)[0x438365]
> btrfs[0x43d3d6]
> btrfs(btrfs_commit_transaction+0x95)[0x43f125]
> btrfs(cmd_check+0x5ec)[0x429cdc]
> btrfs(main+0x82)[0x40ef32]
> /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f881f983610]
> btrfs(_start+0x29)[0x40f039]
>
>
> That's all I tried so far.
> btrfs restore -viD seems to find most of the files accessible but since I
> don't have a spare hdd of
> sufficient size I would have to break the array and reformat and use one of
> the disk as restore
> target. I'm not prepared to do this before I know there is no other way to
> fix the drives since I'm
> essentially destroying one more chance at saving the data.
>
> Is there anything I can do to get the fs out of this mess?

I'm skeptical about the logic of using --repair, which modifies the
filesystem, on just one device of a two device rai1, while saying
you're reluctant to "break the array." It doesn't make sense to me to
expect such modification on one of the drives, keeps it at all
consistent with the other. I hope a dev can say whether --repair with
a missing device is a bad idea, because if so maybe degraded repairs
need a --force flag to help users from making things worse.

Anyway, in the meantime, my advice is do not mount either device rw
(together or separately). The less changes you make right now the
better.

What kernel and btrfs-progs version are you using?

Did you try to mount with -o recovery, or -o ro,recovery before trying
'btrfs check --repair' ? If so, post all relevant kernel messages.
Don't try -o recovery now if you haven't previously tried it; it's
probably safe to try -o ro,recovery if you haven't tried that yet. I
would try -o ro,recovery three ways: both devs, and each dev
separately (for which you'll use -o ro,recovery,degraded).

If that doesn't work, it sounds like it might be a task for 'btrfs
rescue chunk-recover' which will take a long time. But I suggest
waiting as long as possible for a reply, and in the meantime I suggest
looking at getting another drive to use as spare so you can keep both
of these drives.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fail behavior when a device vanishes

2015-12-31 Thread ronnie sahlberg
On Thu, Dec 31, 2015 at 5:27 PM, Chris Murphy  wrote:
> On Thu, Dec 31, 2015 at 6:09 PM, ronnie sahlberg
>  wrote:
>> Here is a kludge I hacked up.
>> Someone that cares could clean this up and start building a proper
>> test suite or something.
>>
>> This test script creates a 3 disk raid1 filesystem and very slowly
>> writes a large file onto the filesystem while, one by one each disk is
>> disconnected then reconnected in a loop.
>> It is fairly trivial to trigger dataloss when devices are bounced like this.
>
> Yes, it's quite a torture test. I'd expect this would be a problem for
> Btrfs until this feature is done at least:
>
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Take_device_with_heavy_IO_errors_offline_or_mark_as_.22unreliable.22
>
> And maybe this one too
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation
>
> Already we know that Btrfs tries to write indefinitely to missing
> devices.

Another question is how it handles writes when the mirrorset becomes
degraded that way.
I would expect it would :
* immediately emergency destage any dirty data in the write cache to
the surviving member disks.
* switch any future I/O to that mirrorset to use ordered and
synchronous writes to the surviving members.

> If it reappears, what gets written? Will that device be
> consistent? And then another one goes missing, comes back, now
> possibly two devices with totally different states for identical
> generations. It's a mess. We know that trivially causes major
> corruption with btrfs raid1 if a user mounts e.g. devid1 rw,degraded
> modifies that; then mounts devid2 (only) rw,degraded and modifies it;
> and then mounts both devids together. Kablewy. Big mess. And that's
> umounting each one in between those steps; not even the abrupt
> disconnect/reconnect.
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub failing

2015-12-31 Thread Duncan
John Center posted on Thu, 31 Dec 2015 11:20:28 -0500 as excerpted:

> I run a weekly scrub, using Marc Merlin's btrfs-scrub script.
> Usually, it completes without a problem, but this week it failed.  I ran
> the scrub manually & it stops shortly:
> 
> john@mariposa:~$ sudo /sbin/btrfs scrub start -BdR /dev/md124p2
> ERROR: scrubbing /dev/md124p2 failed for device id 1:
> ret=-1, errno=5 (Input/output error)
> scrub device /dev/md124p2 (id 1) canceled
> scrub started at Thu Dec 31 00:26:34 2015
> and was aborted after 00:01:29 [...]

> My Ubuntu 14.04 workstation is using the 4.2 kernel (Wily).
> I'm using btrfs-tools v4.3.1. [...]

A couple months ago, which would have made it around the 4.2 kernel 
you're running (with 4.3 being current and 4.4 nearly out), there were a 
number of similar scrub aborted reports on the list.

I don't recall seeing any directly related patches, but the reports died 
down, whether because everybody having them had reported already, or 
because a newer kernel fixed the problem, I'm not sure, as I never had 
the problem myself[1].

So I'd suggest upgrading to either the current 4.3 kernel or the latest 
4.4-rc, and hopefully the problem will be gone.  If I'd had the problem 
myself I could tell you for sure whether it went away for me with 4.3, 
but as I didn't...

In any event, the 4.2 kernel wasn't a long-term-support kernel anyway.  
4.1 was and 4.4 is scheduled to be.  So upstream 4.2 updates should be 
ended or coming to an end in any case, and an upgrade (or possibly 
downgrade to the LTS 4.1) is recommended anyway, tho of course Ubuntu 
could be doing its own support, independent of upstream, but then you'd 
need to get support from them as upstream won't be tracking patches 
they've backported and thus can't provide proper support.

---
[1]  All my btrfs are relatively small, under 50 GiB per device, and on 
SSD, so scrubs normally complete in under a minute, while your report of 
scrub aborting after a minute and a half was typical, so it's likely my 
scrubs were simply done before whatever the problem was could trigger.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html