Re: Battling an issue with btrfs quota

2017-01-30 Thread Philipp Kern
On 01/31/2017 01:15 AM, Philipp Kern wrote:
[...]
>>   149 00 RW   [btrfs-transacti]
> 
> So there's always a running btrfs-transaction. The kernel messages start
> off like this:
[...]

At it turns out, it also OOMs the complete machine after 2h while
consuming the available 4 GB RAM (w/o swap):

> [rootfs ]# [ 6347.417391] Out of memory: Kill process 228 (sh) score 0 or 
> sacrifice child
> [ 6347.450652] Killed process 228 (sh) total-vm:6724kB, anon-rss:112kB, 
> file-rss:1624kB, shmem-rss:0kB
> [ 6347.500015] Kernel panic - not syncing: Out of memory and no killable 
> processes...
> [ 6347.500015] 
> [ 6347.544684] CPU: 0 PID: 149 Comm: btrfs-transacti Not tainted 4.9.6-1-ARCH 
> #1
> [ 6347.580157] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 
> 07/16/2015
> [ 6347.614874]  c9d77738 81305440 c9d77800 
> 81930520
> [ 6347.651089]  c9d777c0 8117eed5 0008 
> c9d777d0
> [ 6347.687037]  c9d77768 09689dbd 0003 
> 8801032adcc0
> [ 6347.723589] Call Trace:
> [ 6347.735614]  [] dump_stack+0x63/0x83
> [ 6347.760845]  [] panic+0xe4/0x22d
> [ 6347.784187]  [] out_of_memory+0x333/0x480
> [ 6347.811209]  [] __alloc_pages_nodemask+0xda6/0xe80
> [ 6347.842223]  [] alloc_pages_current+0x95/0x140
> [ 6347.871859]  [] __page_cache_alloc+0xc4/0xe0
> [ 6347.900582]  [] pagecache_get_page+0xe7/0x290
> [ 6347.929238]  [] alloc_extent_buffer+0x113/0x480 [btrfs]
> [ 6347.962291]  [] read_tree_block+0x20/0x60 [btrfs]
> [ 6347.992642]  [] 
> read_block_for_search.isra.16+0x138/0x300 [btrfs]
> [ 6348.030866]  [] btrfs_search_slot+0x3be/0x9b0 [btrfs]
> [ 6348.063286]  [] find_parent_nodes+0x116/0x14b0 [btrfs]
> [ 6348.096113]  [] __btrfs_find_all_roots+0xbe/0x130 [btrfs]
> [ 6348.131093]  [] btrfs_find_all_roots+0x55/0x70 [btrfs]
> [ 6348.164002]  [] 
> btrfs_qgroup_prepare_account_extents+0x58/0xa0 [btrfs]
> [ 6348.203555]  [] 
> btrfs_commit_transaction.part.12+0x3e4/0xa90 [btrfs]
> [ 6348.241927]  [] ? wake_atomic_t_function+0x60/0x60
> [ 6348.273073]  [] btrfs_commit_transaction+0x3b/0x70 
> [btrfs]
> [ 6348.307196]  [] transaction_kthread+0x1ab/0x1e0 [btrfs]
> [ 6348.340041]  [] ? btrfs_cleanup_transaction+0x570/0x570 
> [btrfs]
> [ 6348.376634]  [] kthread+0xd9/0xf0
> [ 6348.400423]  [] ? __switch_to+0x2d2/0x630
> [ 6348.428162]  [] ? kthread_park+0x60/0x60
> [ 6348.455055]  [] ret_from_fork+0x25/0x30
> [ 6348.481684] Kernel Offset: disabled
> [ 6348.505391] ---[ end Kernel panic - not syncing: Out of memory and no 
> killable processes...
> [ 6348.505391] 
> Killed

Kind regards
Philipp Kern



signature.asc
Description: OpenPGP digital signature


Re: btrfs recovery

2017-01-30 Thread Duncan
Oliver Freyermuth posted on Sat, 28 Jan 2017 17:46:24 +0100 as excerpted:

>> Just don't count on restore to save your *** and always treat what it
>> can often bring to current as a pleasant surprise, and having it fail
>> won't be a down side, while having it work, if it does, will always be
>> up side.
>> =:^)
>> 
> I'll keep that in mind, and I think that in the future, before trying
> any "btrfs check" (or even repair)
> I will always try restore first if my backup was not fresh enough :-).

That's a wise idea, as long as you have the resources to actually be able 
to write the files somewhere (as people running btrfs really should, 
because it's /not/ fully stable yet).

One of the great things about restore is that all the writing it does is 
to the destination filesystem -- it doesn't attempt to actually write or 
repair anything on the filesystem it's trying to restore /from/, so it's 
far lower risk than anything that /does/ actually attempt to write to or 
repair the potentially damaged filesystem.

That makes it /extremely/ useful as a "first, to the extent possibke, 
make sure the backups are safely freshened" tool. =:^)


Meanwhile, FWIW, restore can also be used as a sort of undelete tool.  
Remember, btrfs is COW and writes any changes to a new location.  The old 
location tends to stick around, not any more referenced by anything 
"live", but still there until some other change happens to overwrite it.
 
Just like undelete on a more conventional filesystem, therefore, as long 
as you notice the problem before the old location has been overwritten 
again, it's often possible to recover it, altho the mechanisms involved 
are rather different on btrfs.  Basically, you use btrfs-find-root to get 
a list of old roots, then point restore at them using the -t option.  
There's a page on the wiki that goes into some detail in a more desperate 
"restore anything" context, but here, once you found a root that looked 
promising, you'd use restore's regex option to restore /just/ the file 
you're interested in, as it existed at the time that root was written.

There's actually a btrfs-undelete script on github that turns the 
otherwise multiple manual steps into a nice, smooth, undelete operation.  
Or at least it's supposed to.  I've never actually used it, tho I have 
examined the script out of curiosity to see what it did and how, and it /
looks/ like it should work.  I've kept that trick (and knowledge of where 
to look for the script) filed away in the back of my head in case I need 
it someday. =:^)


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Duncan
Michael Born posted on Mon, 30 Jan 2017 22:07:00 +0100 as excerpted:

> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born 
>> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>> 
>> There's still another day :-D
>> 
>> 
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with
>>> dd to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>> 
>> That won't ever work for any file system. It must be unmounted.
> 
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.

There's another angle with btrfs that makes block device image copies 
such as that a big problem, even if the dd was done with the filesystem 
unmounted.  This hasn't yet been mentioned in this thread, that I've 
seen, anyway.

* Btrfs takes the filesystem UUID, universally unique ID, at face value, 
considering it *UNIQUE* and actually identifying the various components 
of a possibly multi-device filesystem by the UUID.  Again, this is 
because btrfs, unlike normal filesystems, can be composed of multiple 
devices, so btrfs needs a way to detect what devices form parts of what 
filesystems, and it does this by tracking the UUID and considering 
anything with that UUID (which is supposed to be unique to that 
filesystem, remember, it actually _says_ "unique" in the label, after 
all) to be part of that filesystem.

Now you dd the block device somewhere else, making another copy, and 
btrfs suddenly has more devices that have UUIDs saying they belong to 
this filesystem than it should!

That has actually triggered corruption in some cases, because btrfs gets 
mixed up and writes changes to the wrong device, because after all, it 
*HAS* to be part of the same filesystem, because it has the same 
universally *unique* ID.

Only the supposedly "unique" ID isn't so "unique" any more, because 
someone copied the block device, and now there's two copies of the 
filesystem claiming to be the same one!  "Unique" is no longer "unique" 
and it has created the all too predictable problems as a result.


There are ways to work around the problem.  Basically, don't let btrfs 
see both copies at the same time, and *definitely* don't let it see both 
copies when one is mounted or an attempt is being made to mount it.

(Btrfs "sees" a new device when btrfs device scan is run.  Unfortunately 
for this case, udev tends to run btrfs device scan automatically whenever 
it detects a new device that seems to have btrfs on it.  So it can be 
rather difficult to keep btrfs from seeing it, because udev tends to 
monitor for new devices and see it right away, and tell btrfs about it 
when it does.  But it's possible to avoid damage if you're careful to 
only dd the unmounted filesystem device(s) and to safely hide one copy 
before attempting to mount the other.)


Of course that wasn't the case here.  With the dd of a live-mounted btrfs 
device, it's quite possible that btrfs detected and started writing to 
the dd-destination device instead of the original at some point, screwing 
things up even more than they would have been for a normal filesystem 
live-mounted dd.

In turn, it's quite possible that's why the old xfs /home still mounted, 
but the btrfs / didn't, because the xfs, while potentially damaged a bit, 
didn't suffer the abuse of writes to the wrong device that btrfs may well 
have suffered, due to the non-uniqueness of the supposedly universally 
unique IDs and the very confused btrfs that may well have caused.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File system is oddly full after kernel upgrade, balance doesn't help

2017-01-30 Thread Duncan
MegaBrutal posted on Sat, 28 Jan 2017 19:15:01 +0100 as excerpted:

> Of course I can't retrieve the data from before the balance, but here is
> the data from now:

FWIW, if it's available, btrfs fi usage tends to yield the richest 
information.  But it's also a (relatively) new addition to the btrfs-
tools suite, and the results of btrfs fi show combined with btrfs fi df 
are the older version, together displaying the same critical information, 
tho without quite as much multi-device information.  Meanwhile, both 
btrfs fi usage and btrfs fi df require a mounted btrfs, so when it won't 
mount, btrfs fi show is about the best that can be done, at least staying 
within the normal admin-user targeted commands (there's developer 
diagnostics targeted commands, but I'm not a dev, just a btrfs list 
regular and btrfs user myself, and to date have left those commands for 
the devs to play with).

But since usage is available, that's all I'm quoting, here:

> root@vmhost:~# btrfs fi usage /tmp/mnt/curlybrace
> Overall:
> Device size:  2.00GiB
> Device allocated: 1.90GiB
> Device unallocated: 103.38MiB
> Device missing: 0.00B
> Used:   789.94MiB
> Free (estimated):   162.18MiB(min: 110.50MiB)
> Data ratio:  1.00
> Metadata ratio:  2.00
> Global reserve: 512.00MiB(used: 0.00B)
> 
> Data,single: Size:773.62MiB, Used:714.82MiB
>/dev/mapper/vmdata--vg-lxc--curlybrace 773.62MiB
> 
> Metadata,DUP: Size:577.50MiB, Used:37.55MiB
>/dev/mapper/vmdata--vg-lxc--curlybrace   1.13GiB
> 
> System,DUP: Size:8.00MiB, Used:16.00KiB
>/dev/mapper/vmdata--vg-lxc--curlybrace  16.00MiB
> 
> Unallocated:
>/dev/mapper/vmdata--vg-lxc--curlybrace 103.38MiB
> 
> 
> So... if I sum the data, metadata, and the global reserve, I see why
> only ~170 MB is left. I have no idea, however, why the global reserve
> sneaked up to 512 MB for such a small file system, and how could I
> resolve this situation. Any ideas?

That's an interesting issue I've not seen before, tho my experience is 
relatively limited compared to say Chris (Murphy)'s or Hugo's, as other 
than my own systems, my experience is limited to the list, while they do 
the IRC channels, etc.

I've no idea how to resolve it, unless per some chance balance removes 
excess global reserve as well (I simply don't know, it has never come up 
that I've seen before).

But IIRC one of the devs (or possibly Hugo) mentioned something about 
global reserve being dynamic, based on... something, IDR what.  Given my 
far lower global reserve on multiple relatively small btrfs and the fact 
that my own use-case doesn't use subvolumes or snapshots, if yours does 
and you have quite a few, that /might/ be the explanation.

FWIW, while I tend to use rather small btrfs as well, in my case they're 
nearly all btrfs dual-device raid1.  However, a usage comparison based on 
my closest sized filesystem can still be useful, particularly the global 
reserve.  Here's my /, as you can see, 8 GiB per device raid1, so one 
copy (comparable to single mode if it were a single device, no dup mode 
metadata as it's a copy on each device) on each:

# btrfs fi u /
Overall:
Device size:  16.00GiB
Device allocated:  7.06GiB
Device unallocated:8.94GiB
Device missing:  0.00B
Used:  4.38GiB
Free (estimated):  5.51GiB  (min: 5.51GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID1: Size:3.00GiB, Used:1.96GiB
   /dev/sda5   3.00GiB
   /dev/sdb5   3.00GiB

Metadata,RAID1: Size:512.00MiB, Used:232.77MiB
   /dev/sda5 512.00MiB
   /dev/sdb5 512.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda5  32.00MiB
   /dev/sdb5  32.00MiB

Unallocated:
   /dev/sda5   4.47GiB
   /dev/sdb5   4.47GiB


It is worth noting that global reserve actually comes from metadata.  
That's why metadata never reports fully used, because global reserve 
isn't included in the used count, but can't normally be used for normal 
metadata.

Also note that under normal conditions, global reserve is always 0 used 
as btrfs is quite reluctant to use it for routine metadata storage, and 
will normally only use it for getting out of COW-based jams due to the 
fact that because of COW, even deleting something means temporarily 
allocating additional space to write the new metadata, without the 
deleted stuff, into.  Normally, btrfs will only write to global reserve 
if metadata space is all used and it thinks that by doing so it can end 
up actually freeing space.  In normal operations it will simply see the 
lack of regular metadata space available and will error out, without 
using the global reserve.

So if at any time btrfs reports 

Re: [PATCH] Btrfs: bulk delete checksum items in the same leaf

2017-01-30 Thread Omar Sandoval
On Mon, Jan 30, 2017 at 05:02:45PM -0800, Omar Sandoval wrote:
> On Sat, Jan 28, 2017 at 06:06:32AM +, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> > 
> > Very often we have the checksums for an extent spread in multiple items
> > in the checksums tree, and currently the algorithm to delete them starts
> > by looking for them one by one and then deleting them one by one, which
> > is not optimal since each deletion involves shifting all the other items
> > in the leaf and when the leaf reaches some low threshold, to move items
> > off the leaf into its left and right neighbor leafs. Also, after each
> > item deletion we release our search path and start a new search for other
> > checksums items.
> > 
> > So optimize this by deleting in bulk all the items in the same leaf that
> > contain checksums for the extent being freed.
> > 
> > Signed-off-by: Filipe Manana 
> > ---
> >  fs/btrfs/file-item.c | 28 +++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index e97e322..d7d6d4a 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -643,7 +643,33 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
> >  
> > /* delete the entire item, it is inside our range */
> > if (key.offset >= bytenr && csum_end <= end_byte) {
> > -   ret = btrfs_del_item(trans, root, path);
> > +   int del_nr = 1;
> > +
> > +   /*
> > +* Check how many csum items preceding this one in this
> > +* leaf correspond to our range and then delete them all
> > +* at once.
> > +*/
> > +   if (key.offset > bytenr && path->slots[0] > 0) {
> > +   int slot = path->slots[0] - 1;
> > +
> > +   while (slot >= 0) {
> > +   struct btrfs_key pk;
> > +
> > +   btrfs_item_key_to_cpu(leaf, , slot);
> > +   if (pk.offset < bytenr ||
> > +   pk.type != BTRFS_EXTENT_CSUM_KEY ||
> > +   pk.objectid !=
> > +   BTRFS_EXTENT_CSUM_OBJECTID)
> > +   break;
> > +   path->slots[0] = slot;
> > +   del_nr++;
> > +   key.offset = pk.offset;
> > +   slot--;
> > +   }
> > +   }
> > +   ret = btrfs_del_items(trans, root, path,
> > + path->slots[0], del_nr);
> > if (ret)
> > goto out;
> > if (key.offset == bytenr)
> 
> Hmm, this seems like the kind of operation that could use a helper.
> btrfs_del_item_range() or something like that, which takes the maximum
> key to delete. What do you think?

Err, or in this case, the minimum key.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: bulk delete checksum items in the same leaf

2017-01-30 Thread Omar Sandoval
On Sat, Jan 28, 2017 at 06:06:32AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Very often we have the checksums for an extent spread in multiple items
> in the checksums tree, and currently the algorithm to delete them starts
> by looking for them one by one and then deleting them one by one, which
> is not optimal since each deletion involves shifting all the other items
> in the leaf and when the leaf reaches some low threshold, to move items
> off the leaf into its left and right neighbor leafs. Also, after each
> item deletion we release our search path and start a new search for other
> checksums items.
> 
> So optimize this by deleting in bulk all the items in the same leaf that
> contain checksums for the extent being freed.
> 
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/file-item.c | 28 +++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index e97e322..d7d6d4a 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -643,7 +643,33 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>  
>   /* delete the entire item, it is inside our range */
>   if (key.offset >= bytenr && csum_end <= end_byte) {
> - ret = btrfs_del_item(trans, root, path);
> + int del_nr = 1;
> +
> + /*
> +  * Check how many csum items preceding this one in this
> +  * leaf correspond to our range and then delete them all
> +  * at once.
> +  */
> + if (key.offset > bytenr && path->slots[0] > 0) {
> + int slot = path->slots[0] - 1;
> +
> + while (slot >= 0) {
> + struct btrfs_key pk;
> +
> + btrfs_item_key_to_cpu(leaf, , slot);
> + if (pk.offset < bytenr ||
> + pk.type != BTRFS_EXTENT_CSUM_KEY ||
> + pk.objectid !=
> + BTRFS_EXTENT_CSUM_OBJECTID)
> + break;
> + path->slots[0] = slot;
> + del_nr++;
> + key.offset = pk.offset;
> + slot--;
> + }
> + }
> + ret = btrfs_del_items(trans, root, path,
> +   path->slots[0], del_nr);
>   if (ret)
>   goto out;
>   if (key.offset == bytenr)

Hmm, this seems like the kind of operation that could use a helper.
btrfs_del_item_range() or something like that, which takes the maximum
key to delete. What do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread GWB
Hello, Micheal,

Yes, you would certainly run the risk of doing more damage with dd, so
if you have an alternative, use that, and avoid dd.  If nothing else
works and you need the files, you might try it as a last resort.

My guess (and it is only a guess) is that if the image is close to the
same size as the root partition, the file data is there.  But that
doesn't do you much good if btrfs cannot read the "container" or the
specific partition and file system information, which btrfs send
provides.

Does someone on the list know if ext3/4 data recovery tools can also
search btrfs data?  That's another option.

Gordon

On Mon, Jan 30, 2017 at 4:37 PM, Michael Born  wrote:
> Hi Gordon,
>
> I'm quite sure this is not a good idea.
> I do understand, that dd-ing a running system will miss some changes
> done to the file system while copying. I'm surprised that I didn't end
> up with some corrupted files, but with no files at all.
> Also, I'm not interested in restoring the old Suse 13.2 system. I just
> want some configuration files from it.
>
> Cheers,
> Michael
>
> Am 30.01.2017 um 23:24 schrieb GWB:
>> <<
>> Hi btrfs experts.
>>
>> Hereby I apply for the stupidity of the month award.

>>
>> I have no doubt that I will will mount a serious challenge to you for
>> that title, so you haven't won yet.
>>
>> Why not dd the image back onto the original partition (or another
>> partition identical in size) and see if that is readable?
>>
>> My limited experience with btrfs (I am not an expert) is that read
>> only snapshots work well in this situation, but the initial hurdle is
>> using dd to get the image back onto a partition.  So I wonder if you
>> could dd the image back onto the original media (the hd sdd), then
>> make a read only snapshot, and then send the snapshot with btrfs send
>> to another storage medium.  With any luck, the machine might boot, and
>> you might find other snapshots which you may be able to turn into read
>> only snaps for btrfs send.
>>
>> This has worked for me on Ubuntu 14 for quite some time, but luckily I
>> have not had to restore the image file sent from btrfs send yet.  I
>> say luckily, because I realise now that the image created from btrfs
>> send should be tested, but so far no catastrophic failures with my
>> root partition have occurred (knock on wood).
>>
>> dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
>> tries to do, but not so great on for some file systems for more
>> intricate uses.  But why not try:
>>
>> dd if=imagefile.dd of=/dev/sdaX
>>
>> and see if it boots?  If it does not, then perhaps you have another
>> shot at the one time mount for btrfs rw if that works.
>>
>> Or is your root partition now running fine under Suse 14.2, and you
>> are just looking to recover a file files from the image?  If so, you
>> might try to dd from the image to a partition of original size as the
>> previous root, then adjust with gparted or fpart, and see if it is
>> readable.
>>
>> So instead of trying to restore a btrfs file structure, why not just
>> restore a partition with dd that happens to contain a btrfs file
>> structure, and then adjust the partition size to match the original?
>> btrfs cares about the tree structures, etc.  dd does not.
>>
>> What you did is not unusual, and can work fine with a number of file
>> structures, but the potential for disaster with dd is also great.  The
>> only thing I know of in btrfs that does a similar thing is:
>>
>> btrfs send -f btrfs-send-image-file /mount/read-write-snapshot
>>
>> Chances are, of course, good that without having current backups dd
>> could potentially ruin the rest of your file system set up, so maybe
>> transfer the image over to another machine that is expendable and test
>> this out.  I use btrfs on root and zfs for data, and make lots of
>> snapshots and send them to incremental backups (mostly zfs, but btrfs
>> works nicely with Ubuntu on root, with the occasional balance
>> problem).
>>
>> If dd did it, dd might be able to fix it.  Do that first before you
>> try to restore btrfs file structures.
>>
>> Or is this a terrible idea?  Someone else on the list should say so if
>> they know otherwise.
>>
>> Gordon
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Battling an issue with btrfs quota

2017-01-30 Thread Philipp Kern
Hi,

my btrfs-based system (~2.5 TiB stored in the filesystem replicated onto
on two disks, running kernel 4.9.6-1-ARCH) locked up after I enabled
quotas and had a btrfs-size tool running. Now the question is how to
recover from that. Whenever I mount the filesystem I end up with
btrfs-cleaner and a kworker hanging:

> [  491.154603] INFO: task kworker/u128:3:105 blocked for more than 120 
> seconds.
> [  491.188559]   Not tainted 4.9.6-1-ARCH #1
> [  491.209443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  491.247188] kworker/u128:3  D0   105  2 0x
> [  491.247208] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper 
> [btrfs]
> [  491.247210]  880103bc8800  8801034ba7c0 
> 8801062580c0
> [  491.247213]  880105fe8d40 c9c63c30 81605cdf 
> 8801034ba7c0
> [  491.247215]  0001 8801062580c0 810aa490 
> 8801034ba7c0
> [  491.247217] Call Trace:
> [  491.247222]  [] ? __schedule+0x22f/0x6e0
> [  491.247224]  [] ? wake_up_q+0x80/0x80
> [  491.247226]  [] schedule+0x3d/0x90
> [  491.247237]  [] wait_current_trans.isra.8+0xbe/0x110 
> [btrfs]
> [  491.247240]  [] ? wake_atomic_t_function+0x60/0x60
> [  491.247249]  [] start_transaction+0x2bc/0x4a0 [btrfs]
> [  491.247258]  [] btrfs_start_transaction+0x18/0x20 [btrfs]
> [  491.247267]  [] btrfs_qgroup_rescan_worker+0x7a/0x610 
> [btrfs]
> [  491.247278]  [] btrfs_scrubparity_helper+0x7d/0x350 
> [btrfs]
> [  491.247288]  [] btrfs_qgroup_rescan_helper+0xe/0x10 
> [btrfs]
> [  491.247291]  [] process_one_work+0x1e5/0x470
> [  491.247292]  [] worker_thread+0x48/0x4e0
> [  491.247294]  [] ? process_one_work+0x470/0x470
> [  491.247296]  [] kthread+0xd9/0xf0
> [  491.247298]  [] ? __switch_to+0x2d2/0x630
> [  491.247299]  [] ? kthread_park+0x60/0x60
> [  491.247301]  [] ret_from_fork+0x25/0x30
> [  491.247306] INFO: task btrfs-cleaner:148 blocked for more than 120 seconds.
> [  491.280723]   Not tainted 4.9.6-1-ARCH #1
> [  491.302026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  491.340471] btrfs-cleaner   D0   148  2 0x
> [  491.340475]  880103bc8800  8801032acf80 
> 8801062580c0
> [  491.340478]  8801032a8d40 c9cc3cf0 81605cdf 
> 8801032acf80
> [  491.340480]  0001 8801062580c0 810aa490 
> 8801032acf80
> [  491.340482] Call Trace:
> [  491.340487]  [] ? __schedule+0x22f/0x6e0
> [  491.340489]  [] ? wake_up_q+0x80/0x80
> [  491.340491]  [] schedule+0x3d/0x90
> [  491.340505]  [] wait_current_trans.isra.8+0xbe/0x110 
> [btrfs]
> [  491.340508]  [] ? wake_atomic_t_function+0x60/0x60
> [  491.340517]  [] start_transaction+0x2bc/0x4a0 [btrfs]
> [  491.340525]  [] btrfs_start_transaction+0x18/0x20 [btrfs]
> [  491.340534]  [] btrfs_drop_snapshot+0x4e9/0x880 [btrfs]
> [  491.340542]  [] 
> btrfs_clean_one_deleted_snapshot+0xbb/0x110 [btrfs]
> [  491.340552]  [] cleaner_kthread+0x141/0x1b0 [btrfs]
> [  491.340560]  [] ? 
> btrfs_destroy_pinned_extent+0x120/0x120 [btrfs]
> [  491.340562]  [] kthread+0xd9/0xf0
> [  491.340564]  [] ? __switch_to+0x2d2/0x630
> [  491.340565]  [] ? kthread_park+0x60/0x60
> [  491.340566]  [] ret_from_fork+0x25/0x30

Unfortunately whenever I try to execute a btrfs command against the
mounted filesystem -- e.g. to disable quota -- the command hangs. And
unfortunately that's in a shell without job control over a serial console.

Relevant output from ps:

>   105 00 DW   [kworker/u128:3]
>   107 00 SW   [kworker/u128:5]
>   111 00 SW<  [bioset]
>   112 00 SW<  [bioset]
>   113 00 SW<  [bioset]
>   115 00 SW   [kworker/1:2]
>   117 00 SW<  [kworker/0:1H]
>   118 00 SW<  [kworker/1:1H]
>   122 00 SW<  [bioset]
>   123 0 6724 Ssh -i
>   128 00 SW<  [btrfs-worker]
>   129 00 SW<  [kworker/u129:0]
>   130 00 SW<  [btrfs-worker-hi]
>   131 00 SW<  [btrfs-delalloc]
>   132 00 SW<  [btrfs-flush_del]
>   133 00 SW<  [btrfs-cache]
>   134 00 SW<  [btrfs-submit]
>   135 00 SW<  [btrfs-fixup]
>   136 00 SW<  [btrfs-endio]
>   137 00 SW<  [btrfs-endio-met]
>   138 00 SW<  [btrfs-endio-met]
>   139 00 SW<  [btrfs-endio-rai]
>   140 00 SW<  [btrfs-endio-rep]
>   141 00 SW<  [btrfs-rmw]
>   142 00 SW<  [btrfs-endio-wri]
>   143 00 SW<  [btrfs-freespace]
>   144 00 SW<  [btrfs-delayed-m]
>   145 00 SW<  [btrfs-readahead]
>   146 00 SW<  [btrfs-qgroup-re]
>   147 00 SW<  [btrfs-extent-re]
>   148 00 DW   [btrfs-cleaner]
>   149 00 RW   [btrfs-transacti]

So there's always a running btrfs-transaction. The kernel 

Re: btrfs recovery

2017-01-30 Thread Michael Born
Hi Gordon,

I'm quite sure this is not a good idea.
I do understand, that dd-ing a running system will miss some changes
done to the file system while copying. I'm surprised that I didn't end
up with some corrupted files, but with no files at all.
Also, I'm not interested in restoring the old Suse 13.2 system. I just
want some configuration files from it.

Cheers,
Michael

Am 30.01.2017 um 23:24 schrieb GWB:
> <<
> Hi btrfs experts.
> 
> Hereby I apply for the stupidity of the month award.
>>>
> 
> I have no doubt that I will will mount a serious challenge to you for
> that title, so you haven't won yet.
> 
> Why not dd the image back onto the original partition (or another
> partition identical in size) and see if that is readable?
> 
> My limited experience with btrfs (I am not an expert) is that read
> only snapshots work well in this situation, but the initial hurdle is
> using dd to get the image back onto a partition.  So I wonder if you
> could dd the image back onto the original media (the hd sdd), then
> make a read only snapshot, and then send the snapshot with btrfs send
> to another storage medium.  With any luck, the machine might boot, and
> you might find other snapshots which you may be able to turn into read
> only snaps for btrfs send.
> 
> This has worked for me on Ubuntu 14 for quite some time, but luckily I
> have not had to restore the image file sent from btrfs send yet.  I
> say luckily, because I realise now that the image created from btrfs
> send should be tested, but so far no catastrophic failures with my
> root partition have occurred (knock on wood).
> 
> dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
> tries to do, but not so great on for some file systems for more
> intricate uses.  But why not try:
> 
> dd if=imagefile.dd of=/dev/sdaX
> 
> and see if it boots?  If it does not, then perhaps you have another
> shot at the one time mount for btrfs rw if that works.
> 
> Or is your root partition now running fine under Suse 14.2, and you
> are just looking to recover a file files from the image?  If so, you
> might try to dd from the image to a partition of original size as the
> previous root, then adjust with gparted or fpart, and see if it is
> readable.
> 
> So instead of trying to restore a btrfs file structure, why not just
> restore a partition with dd that happens to contain a btrfs file
> structure, and then adjust the partition size to match the original?
> btrfs cares about the tree structures, etc.  dd does not.
> 
> What you did is not unusual, and can work fine with a number of file
> structures, but the potential for disaster with dd is also great.  The
> only thing I know of in btrfs that does a similar thing is:
> 
> btrfs send -f btrfs-send-image-file /mount/read-write-snapshot
> 
> Chances are, of course, good that without having current backups dd
> could potentially ruin the rest of your file system set up, so maybe
> transfer the image over to another machine that is expendable and test
> this out.  I use btrfs on root and zfs for data, and make lots of
> snapshots and send them to incremental backups (mostly zfs, but btrfs
> works nicely with Ubuntu on root, with the occasional balance
> problem).
> 
> If dd did it, dd might be able to fix it.  Do that first before you
> try to restore btrfs file structures.
> 
> Or is this a terrible idea?  Someone else on the list should say so if
> they know otherwise.
> 
> Gordon

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread GWB
<<
Hi btrfs experts.

Hereby I apply for the stupidity of the month award.
>>

I have no doubt that I will will mount a serious challenge to you for
that title, so you haven't won yet.

Why not dd the image back onto the original partition (or another
partition identical in size) and see if that is readable?

My limited experience with btrfs (I am not an expert) is that read
only snapshots work well in this situation, but the initial hurdle is
using dd to get the image back onto a partition.  So I wonder if you
could dd the image back onto the original media (the hd sdd), then
make a read only snapshot, and then send the snapshot with btrfs send
to another storage medium.  With any luck, the machine might boot, and
you might find other snapshots which you may be able to turn into read
only snaps for btrfs send.

This has worked for me on Ubuntu 14 for quite some time, but luckily I
have not had to restore the image file sent from btrfs send yet.  I
say luckily, because I realise now that the image created from btrfs
send should be tested, but so far no catastrophic failures with my
root partition have occurred (knock on wood).

dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
tries to do, but not so great on for some file systems for more
intricate uses.  But why not try:

dd if=imagefile.dd of=/dev/sdaX

and see if it boots?  If it does not, then perhaps you have another
shot at the one time mount for btrfs rw if that works.

Or is your root partition now running fine under Suse 14.2, and you
are just looking to recover a file files from the image?  If so, you
might try to dd from the image to a partition of original size as the
previous root, then adjust with gparted or fpart, and see if it is
readable.

So instead of trying to restore a btrfs file structure, why not just
restore a partition with dd that happens to contain a btrfs file
structure, and then adjust the partition size to match the original?
btrfs cares about the tree structures, etc.  dd does not.

What you did is not unusual, and can work fine with a number of file
structures, but the potential for disaster with dd is also great.  The
only thing I know of in btrfs that does a similar thing is:

btrfs send -f btrfs-send-image-file /mount/read-write-snapshot

Chances are, of course, good that without having current backups dd
could potentially ruin the rest of your file system set up, so maybe
transfer the image over to another machine that is expendable and test
this out.  I use btrfs on root and zfs for data, and make lots of
snapshots and send them to incremental backups (mostly zfs, but btrfs
works nicely with Ubuntu on root, with the occasional balance
problem).

If dd did it, dd might be able to fix it.  Do that first before you
try to restore btrfs file structures.

Or is this a terrible idea?  Someone else on the list should say so if
they know otherwise.

Gordon


On Mon, Jan 30, 2017 at 3:16 PM, Hans van Kranenburg
 wrote:
> On 01/30/2017 10:07 PM, Michael Born wrote:
>>
>>
>> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born  
>>> wrote:
 Hi btrfs experts.

 Hereby I apply for the stupidity of the month award.
>>>
>>> There's still another day :-D
>>>
>>>
>>>

 Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
 to an image file - while the system was online/running.
 Now, I can't mount the image.
>>>
>>> That won't ever work for any file system. It must be unmounted.
>>
>> I could mount and copy the data out of my /home image.dd (encrypted
>> xfs). That was also online while dd-ing it.
>>
 Could you give me some instructions how to repair the file system or
 extract some files from it?
>>>
>>> Not possible. The file system was being modified while dd was
>>> happening, so the image you've taken is inconsistent.
>>
>> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
>> change for months. Why would they change in the moment I copy their
>> blocks with dd?
>
> The metadata of btrfs is organized in a bunch of tree structures. The
> top of the trees (the smallest parts, trees are upside-down here /\ )
> and the superblock get modified quite often. Every time a tree gets
> modified, the new modified parts are written as a modified copy in
> unused space.
>
> So even if the files themselves do not change... if you miss those new
> writes which are being done in space that your dd already left behind...
> you end up with old and new parts of trees all over the place.
>
> In other words, a big puzzle with parts that do not connect with each
> other any more.
>
> And that's exactly what you see in all the errors. E.g. "parent transid
> verify failed on 32869482496 wanted 550112 found 550121" <- a part of a
> tree points to another part, but suddenly something else is found which
> should not be there. In this case wanted 550112 found 550121 

Re: btrfs recovery

2017-01-30 Thread Michael Born
Am 30.01.2017 um 22:20 schrieb Chris Murphy:
> On Mon, Jan 30, 2017 at 2:07 PM, Michael Born  wrote:
>> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
>> change for months. Why would they change in the moment I copy their
>> blocks with dd?
> 
> They didn't change. The file system changed. While dd is reading, it
> might be minutes between capturing different parts of the file system,
> and each superblock is in different locations on the disk,
> guaranteeing that if the dd takes more than 30 seconds, your dd image
> has different generation super blocks. Btrfs notices this at mount
> time and will refuse to mount because the file system is inconsistent.
> 
> It is certainly possible to fix this, but it's likely to be really,
> really tedious. The existing tools don't take this use case into
> account.
> 
> Maybe btfs-find-root can come up with some suggestions and you can use
> btrfs restore -t with the bytenr from find root, to see if you can get
> this old data, ignoring the changes that don't affect the old data.
> 
> What you do with this is btrfs-find-root and see what it comes up
> with. And work with the most recent (highest) generation going
> backward, plugging in the bytenr into btrfs restore with -t option.
> You'll also want to use the dry run to see if you're getting what you
> want. It's best to use the exact path if you know it, this takes much
> less time for it to search all files in a given tree. If you don't
> know the exact path, but you know part of a file name, then you'll
> need to use the regex option; or just let it dump everything it can
> from the image and go dumpster diving...

I really want to try the "btrfs-find-root / btrfs restore -t" method.
But, btrfs-find-root gives me just the 3 lines output and then nothing
for 16 hours.
I think, I saw a similar report that the tool just doesn't report back
in the mailing list archive (btrfs-find-root duration? Markus Binsteiner
Sat, 10 Dec 2016 16:12:25 -0800)

./btrfs-find-root /dev/loop0
Couldn't read tree root
Superblock thinks the generation is 550114
Superblock thinks the level is 1

Hans, also thank you for the explanation even though I'm not sure, I
understand.
I would be happy with older parts of the tree which then have lower
numbers than the 550112.

Michael


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Chris Murphy
On Mon, Jan 30, 2017 at 2:20 PM, Chris Murphy <li...@colorremedies.com> wrote:

> What people do with huge databases, which have this same problem,
> they'll take a volume snapshot. This first commits everything in
> flight, freezes the fs so no more changes can happen, then takes a
> snapshot, then unfreezes the original so the database can stay online.
> The freeze takes maybe a second or maybe a bit longer depending on how
> much stuff needs to be committed to stable media. Then backup the
> snapshot as a read-only volume. Once the backups is done, delete the
> snapshot.I

In Btrfs land, the way to do it is snapshot a subvolume, and then
rsync or 'btrfs send' the contents of the snapshot somewhere. I
actually often use this for whole volume backups:

## this will capture /boot and /boot/efi on separate file systems and
put the tar on Btrfs root.
cd /
tar -acf boot.tar.gz boot/

## My subvolumes are at the top level, fstab mounts them specifically,
so mount the top level to get access
sudo mount -o noatime  /mnt
## Take a snapshot of rootfs
sudo btrfs sub snap - r /mnt/root /mnt/root.20170130
## Send it to remote server
sudo btrfs send /mnt/root.20170130 | ssh chris@server "cat - >
~/root.20170130.btrfs'
## Restore it from server, assumes the subvolume/snapshot does not exist
ssh chris@server "sudo btrfs send -f ~/root.20170130.btrfs" | sudo
btrfs receive /mnt/

The same can be done with incremental images, but of course you need
all the the files and named in a sane way so you know in what order to
restore them since those incrementals are parent/child specific.

The other thing this avoids, critically, is the form of corruptions of
Btrfs whenever two or more of the same volume (by UUID) appears to the
kernel at the same time, and one of them is mounted. See gotchas of
block level copies.

https://btrfs.wiki.kernel.org/index.php/Gotchas

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Chris Murphy
On Mon, Jan 30, 2017 at 2:07 PM, Michael Born  wrote:
>
>
> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born  
>> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>>
>> There's still another day :-D
>>
>>
>>
>>>
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>>> to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>>
>> That won't ever work for any file system. It must be unmounted.
>
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.

If there are no substantial writes happening, it's possible it'll
behave like a power failure, read the journal and continue possibly
with the most recent commits being lost. But any substantial amount of
writes means some part of the volume is changed, and the update
reflecting that change is elsewhere, meanwhile the dd is capturing the
volume at different points in time rather than exactly as it is. It's
just not workable.

What people do with huge databases, which have this same problem,
they'll take a volume snapshot. This first commits everything in
flight, freezes the fs so no more changes can happen, then takes a
snapshot, then unfreezes the original so the database can stay online.
The freeze takes maybe a second or maybe a bit longer depending on how
much stuff needs to be committed to stable media. Then backup the
snapshot as a read-only volume. Once the backups is done, delete the
snapshot.





>
>>> Could you give me some instructions how to repair the file system or
>>> extract some files from it?
>>
>> Not possible. The file system was being modified while dd was
>> happening, so the image you've taken is inconsistent.
>
> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
> change for months. Why would they change in the moment I copy their
> blocks with dd?

They didn't change. The file system changed. While dd is reading, it
might be minutes between capturing different parts of the file system,
and each superblock is in different locations on the disk,
guaranteeing that if the dd takes more than 30 seconds, your dd image
has different generation super blocks. Btrfs notices this at mount
time and will refuse to mount because the file system is inconsistent.

It is certainly possible to fix this, but it's likely to be really,
really tedious. The existing tools don't take this use case into
account.

Maybe btfs-find-root can come up with some suggestions and you can use
btrfs restore -t with the bytenr from find root, to see if you can get
this old data, ignoring the changes that don't affect the old data.

What you do with this is btrfs-find-root and see what it comes up
with. And work with the most recent (highest) generation going
backward, plugging in the bytenr into btrfs restore with -t option.
You'll also want to use the dry run to see if you're getting what you
want. It's best to use the exact path if you know it, this takes much
less time for it to search all files in a given tree. If you don't
know the exact path, but you know part of a file name, then you'll
need to use the regex option; or just let it dump everything it can
from the image and go dumpster diving...



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Hans van Kranenburg
On 01/30/2017 10:07 PM, Michael Born wrote:
> 
> 
> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born  
>> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>>
>> There's still another day :-D
>>
>>
>>
>>>
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>>> to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>>
>> That won't ever work for any file system. It must be unmounted.
> 
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.
> 
>>> Could you give me some instructions how to repair the file system or
>>> extract some files from it?
>>
>> Not possible. The file system was being modified while dd was
>> happening, so the image you've taken is inconsistent.
> 
> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
> change for months. Why would they change in the moment I copy their
> blocks with dd?

The metadata of btrfs is organized in a bunch of tree structures. The
top of the trees (the smallest parts, trees are upside-down here /\ )
and the superblock get modified quite often. Every time a tree gets
modified, the new modified parts are written as a modified copy in
unused space.

So even if the files themselves do not change... if you miss those new
writes which are being done in space that your dd already left behind...
you end up with old and new parts of trees all over the place.

In other words, a big puzzle with parts that do not connect with each
other any more.

And that's exactly what you see in all the errors. E.g. "parent transid
verify failed on 32869482496 wanted 550112 found 550121" <- a part of a
tree points to another part, but suddenly something else is found which
should not be there. In this case wanted 550112 found 550121 means it's
bumping into something "from the future". Whaaa..

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Michael Born


Am 30.01.2017 um 21:51 schrieb Chris Murphy:
> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born  wrote:
>> Hi btrfs experts.
>>
>> Hereby I apply for the stupidity of the month award.
> 
> There's still another day :-D
> 
> 
> 
>>
>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>> to an image file - while the system was online/running.
>> Now, I can't mount the image.
> 
> That won't ever work for any file system. It must be unmounted.

I could mount and copy the data out of my /home image.dd (encrypted
xfs). That was also online while dd-ing it.

>> Could you give me some instructions how to repair the file system or
>> extract some files from it?
> 
> Not possible. The file system was being modified while dd was
> happening, so the image you've taken is inconsistent.

The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
change for months. Why would they change in the moment I copy their
blocks with dd?

Michael



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix leak of subvolume writers counter

2017-01-30 Thread Liu Bo
On Sat, Jan 28, 2017 at 06:06:54AM +, fdman...@kernel.org wrote:
> From: Robbie Ko 
> 
> When falling back from a nocow write to a regular cow write, we were
> leaking the subvolume writers counter in 2 situations, preventing
> snapshot creation from ever completing in the future, as it waits
> for that counter to go down to zero before the snapshot creation
> starts.
> 
> In run_delalloc_nocow, maybe not release subv_writers conter,
> will lead to create snapshot hang.

Reviewed-by: Liu Bo 

Thanks,

-liubo
> 
> Signed-off-by: Robbie Ko 
> Reviewed-by: Filipe Manana 
> [Improved changelog and subject]
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/inode.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a713d9d..7221d66 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1404,10 +1404,16 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>* either valid or do not exist.
>*/
>   if (csum_exist_in_range(fs_info, disk_bytenr,
> - num_bytes))
> + num_bytes)) {
> + if (!nolock)
> + btrfs_end_write_no_snapshoting(root);
>   goto out_check;
> - if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
> + }
> + if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
> + if (!nolock)
> + btrfs_end_write_no_snapshoting(root);
>   goto out_check;
> + }
>   nocow = 1;
>   } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>   extent_end = found_key.offset +
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hard crash on 4.9.5, part 2

2017-01-30 Thread Matt McKinnon
I have an error on this file system I've had in the distant pass where 
the mount would fail with a "file exists" error.  Running a btrfs check 
gives the following over and over again:


Found file extent holes:
start: 0, len: 290816
root 257 inode 28472371 errors 1000, some csum missing
root 257 inode 28472416 errors 1000, some csum missing
root 257 inode 9182183 errors 1000, some csum missing
root 257 inode 9182186 errors 1000, some csum missing
root 257 inode 28419536 errors 1100, file extent discount, some csum missing
Found file extent holes:
start: 0, len: 290816
root 257 inode 28472371 errors 1000, some csum missing
root 257 inode 28472416 errors 1000, some csum missing
root 257 inode 9182183 errors 1000, some csum missing
root 257 inode 9182186 errors 1000, some csum missing
root 257 inode 28419536 errors 1100, file extent discount, some csum missing


Are these found per subvolume snapshot I have and will eventually end?

Here is the crash after the mount (with recovery/usebackuproot):

[  627.233213] BTRFS warning (device sda1): 'recovery' is deprecated, 
use 'usebackuproot' instead
[  627.233216] BTRFS info (device sda1): trying to use backup root at 
mount time

[  627.233218] BTRFS info (device sda1): disk space caching is enabled
[  627.233220] BTRFS info (device sda1): has skinny extents
[  709.234688] [ cut here ]
[  709.234734] WARNING: CPU: 5 PID: 3468 at fs/btrfs/file.c:546 
btrfs_drop_extent_cache+0x3e8/0x400 [btrfs]
[  709.234735] Modules linked in: ipmi_devintf nfsd auth_rpcgss nfs_acl 
nfs lockd grace sunrpc fscache lp parport intel_rapl sb_edac
 edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
xt_tcpudp kvm nf_conntrack_ipv4 nf_defrag_ipv4 irqbypass crct10d
if_pclmul crc32_pclmul ghash_clmulni_intel xt_conntrack aesni_intel 
btrfs nf_conntrack aes_x86_64 lrw gf128mul iptable_filter glue_h
elper ip_tables ablk_helper cryptd x_tables dm_multipath joydev mei_me 
ioatdma mei lpc_ich wmi ipmi_si ipmi_msghandler shpchp mac_hi
d ses enclosure scsi_transport_sas raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor hid_generic megarai
d_sas raid6_pq ahci libcrc32c libahci igb usbhid raid1 hid i2c_algo_bit 
raid0 dca ptp multipath pps_core linear dm_mirror dm_region_

hash dm_log
[  709.234812] CPU: 5 PID: 3468 Comm: mount Not tainted 4.9.5-custom #1
[  709.234813] Hardware name: Supermicro 
X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0b 04/28/2014
[  709.234816]  bd3784bb7568 8e3c8e7c  

[  709.234820]  bd3784bb75a8 8e07d3d1 02220070 
9e5f0ae4d150
[  709.234823]  0002d000 9e5f0bc91f78 9e5f0bc91da8 
0002c000

[  709.234827] Call Trace:
[  709.234837]  [] dump_stack+0x63/0x87
[  709.234846]  [] __warn+0xd1/0xf0
[  709.234850]  [] warn_slowpath_null+0x1d/0x20
[  709.234874]  [] btrfs_drop_extent_cache+0x3e8/0x400 
[btrfs]
[  709.234895]  [] __btrfs_drop_extents+0x5b2/0xd30 
[btrfs]
[  709.234914]  [] ? 
generic_bin_search.constprop.36+0x8b/0x1e0 [btrfs]
[  709.234931]  [] ? btrfs_set_path_blocking+0x36/0x70 
[btrfs]

[  709.234942]  [] ? kmem_cache_alloc+0x194/0x1a0
[  709.234958]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[  709.234977]  [] btrfs_drop_extents+0x79/0xa0 [btrfs]
[  709.235002]  [] replay_one_extent+0x414/0x7b0 [btrfs]
[  709.235007]  [] ? autoremove_wake_function+0x40/0x40
[  709.235030]  [] replay_one_buffer+0x4cc/0x7c0 [btrfs]
[  709.235053]  [] ? 
mark_extent_buffer_accessed+0x4f/0x70 [btrfs]

[  709.235074]  [] walk_down_log_tree+0x1ba/0x3b0 [btrfs]
[  709.235094]  [] walk_log_tree+0xb4/0x1a0 [btrfs]
[  709.235114]  [] btrfs_recover_log_trees+0x20e/0x460 
[btrfs]

[  709.235133]  [] ? replay_one_extent+0x7b0/0x7b0 [btrfs]
[  709.235154]  [] open_ctree+0x2640/0x27f0 [btrfs]
[  709.235171]  [] btrfs_mount+0xca4/0xec0 [btrfs]
[  709.235176]  [] ? find_next_zero_bit+0x1e/0x20
[  709.235180]  [] ? pcpu_next_unpop+0x3e/0x50
[  709.235184]  [] ? find_next_bit+0x19/0x20
[  709.235190]  [] mount_fs+0x39/0x160
[  709.235193]  [] ? __alloc_percpu+0x15/0x20
[  709.235196]  [] vfs_kern_mount+0x67/0x110
[  709.235213]  [] btrfs_mount+0x18b/0xec0 [btrfs]
[  709.235216]  [] ? find_next_zero_bit+0x1e/0x20
[  709.235220]  [] mount_fs+0x39/0x160
[  709.235223]  [] ? __alloc_percpu+0x15/0x20
[  709.235225]  [] vfs_kern_mount+0x67/0x110
[  709.235228]  [] do_mount+0x1bb/0xc80
[  709.235232]  [] ? kmem_cache_alloc_trace+0x14b/0x1b0
[  709.235235]  [] SyS_mount+0x83/0xd0
[  709.235240]  [] entry_SYSCALL_64_fastpath+0x1e/0xad
[  709.235243] ---[ end trace d4e5dcddb432b7d3 ]---
[  709.354972] BTRFS: error (device sda1) in btrfs_replay_log:2506: 
errno=-17 Object already exists (Failed to recover log tree)
[  709.355570] BTRFS error (device sda1): cleaner transaction attach 
returned -30

[  709.548919] BTRFS error (device sda1): open_ctree failed


-Matt
--
To unsubscribe from this list: send the line "unsubscribe 

Re: btrfs recovery

2017-01-30 Thread Chris Murphy
On Mon, Jan 30, 2017 at 1:02 PM, Michael Born  wrote:
> Hi btrfs experts.
>
> Hereby I apply for the stupidity of the month award.

There's still another day :-D



>
> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
> to an image file - while the system was online/running.
> Now, I can't mount the image.

That won't ever work for any file system. It must be unmounted.


> Could you give me some instructions how to repair the file system or
> extract some files from it?

Not possible. The file system was being modified while dd was
happening, so the image you've taken is inconsistent.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove duplicated find_get_pages_contig

2017-01-30 Thread Liu Bo
This creates a helper to manipulate page bits to avoid duplicate uses.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 202 ---
 fs/btrfs/extent_io.h |   3 +-
 2 files changed, 98 insertions(+), 107 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d5f3edb..136fe96 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1549,94 +1549,122 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
return found;
 }
 
-static noinline void __unlock_for_delalloc(struct inode *inode,
-  struct page *locked_page,
-  u64 start, u64 end)
+/*
+ * index_ret:  record where we stop
+ * This only returns errors when page_ops has PAGE_LOCK.
+ */
+static int
+__process_pages_contig(struct address_space *mapping, struct page *locked_page,
+  pgoff_t start_index, pgoff_t end_index,
+  unsigned long page_ops, pgoff_t *index_ret)
 {
-   int ret;
+   unsigned long nr_pages = end_index - start_index + 1;
+   pgoff_t index = start_index;
struct page *pages[16];
-   unsigned long index = start >> PAGE_SHIFT;
-   unsigned long end_index = end >> PAGE_SHIFT;
-   unsigned long nr_pages = end_index - index + 1;
+   unsigned pages_locked = 0;
+   unsigned ret;
+   int err = 0;
int i;
 
-   if (index == locked_page->index && end_index == index)
-   return;
+   /*
+* Do NOT skip locked_page since we may need to set PagePrivate2 on it.
+*/
 
-   while (nr_pages > 0) {
-   ret = find_get_pages_contig(inode->i_mapping, index,
-min_t(unsigned long, nr_pages,
-ARRAY_SIZE(pages)), pages);
-   for (i = 0; i < ret; i++) {
-   if (pages[i] != locked_page)
-   unlock_page(pages[i]);
-   put_page(pages[i]);
-   }
-   nr_pages -= ret;
-   index += ret;
-   cond_resched();
+   /* PAGE_LOCK should not be mixed with other ops. */
+   if (page_ops & PAGE_LOCK) {
+   ASSERT(page_ops == PAGE_LOCK);
+   ASSERT(index_ret);
+   ASSERT(*index_ret == start_index);
}
-}
 
-static noinline int lock_delalloc_pages(struct inode *inode,
-   struct page *locked_page,
-   u64 delalloc_start,
-   u64 delalloc_end)
-{
-   unsigned long index = delalloc_start >> PAGE_SHIFT;
-   unsigned long start_index = index;
-   unsigned long end_index = delalloc_end >> PAGE_SHIFT;
-   unsigned long pages_locked = 0;
-   struct page *pages[16];
-   unsigned long nrpages;
-   int ret;
-   int i;
-
-   /* the caller is responsible for locking the start index */
-   if (index == locked_page->index && index == end_index)
-   return 0;
+   if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+   mapping_set_error(mapping, -EIO);
 
-   /* skip the page at the start index */
-   nrpages = end_index - index + 1;
-   while (nrpages > 0) {
-   ret = find_get_pages_contig(inode->i_mapping, index,
+   while (nr_pages > 0) {
+   ret = find_get_pages_contig(mapping, index,
 min_t(unsigned long,
-nrpages, ARRAY_SIZE(pages)), pages);
+nr_pages, ARRAY_SIZE(pages)), pages);
if (ret == 0) {
-   ret = -EAGAIN;
-   goto done;
-   }
-   /* now we have an array of pages, lock them all */
-   for (i = 0; i < ret; i++) {
/*
-* the caller is taking responsibility for
-* locked_page
+* Only if we're going to lock these pages, can we find
+* nothing at @index.
 */
+   ASSERT(page_ops & PAGE_LOCK);
+   goto out;
+   }
+
+   for (i = 0; i < ret; i++) {
+   if (page_ops & PAGE_SET_PRIVATE2)
+   SetPagePrivate2(pages[i]);
+
if (pages[i] != locked_page) {
-   lock_page(pages[i]);
-   if (!PageDirty(pages[i]) ||
-   pages[i]->mapping != inode->i_mapping) {
-   ret = -EAGAIN;
+   if (page_ops & PAGE_CLEAR_DIRTY)
+   clear_page_dirty_for_io(pages[i]);
+  

Re: btrfs recovery

2017-01-30 Thread Hans van Kranenburg
On 01/30/2017 09:02 PM, Michael Born wrote:
> Hi btrfs experts.
> 
> Hereby I apply for the stupidity of the month award.
> But, maybe you can help me restoring my dd backup or extracting some
> files from it?
> 
> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
> to an image file - while the system was online/running.
> Now, I can't mount the image.

Making a block level copy of a filesystem while it is online and being
modified has a near 100% chance of producing a corrupt result.

Simply think of the fact that something gets written somewhere at the
end of the disk which also relates to something that gets written at the
beginning of the disk, while your dd copy is doing its thing somewhere
in between...

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: create a helper to create em for IO

2017-01-30 Thread Liu Bo
We have similar codes to create and insert extent mapping around IO path,
this merges them into a single helper.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 187 +--
 1 file changed, 72 insertions(+), 115 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3d3753a..5e28355 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -108,11 +108,10 @@ static noinline int cow_file_range(struct inode *inode,
   u64 start, u64 end, u64 delalloc_end,
   int *page_started, unsigned long *nr_written,
   int unlock, struct btrfs_dedupe_hash *hash);
-static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
-  u64 len, u64 orig_start,
-  u64 block_start, u64 block_len,
-  u64 orig_block_len, u64 ram_bytes,
-  int type);
+static struct extent_map *
+create_io_em(struct inode *inode, u64 start, u64 len, u64 orig_start,
+u64 block_start, u64 block_len, u64 orig_block_len,
+u64 ram_bytes, int compress_type, int type);
 
 static int btrfs_dirty_inode(struct inode *inode);
 
@@ -690,7 +689,6 @@ static noinline void submit_compressed_extents(struct inode 
*inode,
struct btrfs_key ins;
struct extent_map *em;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct extent_map_tree *em_tree = _I(inode)->extent_tree;
struct extent_io_tree *io_tree;
int ret = 0;
 
@@ -778,46 +776,19 @@ static noinline void submit_compressed_extents(struct 
inode *inode,
 * here we're doing allocation and writeback of the
 * compressed pages
 */
-   btrfs_drop_extent_cache(inode, async_extent->start,
-   async_extent->start +
-   async_extent->ram_size - 1, 0);
-
-   em = alloc_extent_map();
-   if (!em) {
-   ret = -ENOMEM;
-   goto out_free_reserve;
-   }
-   em->start = async_extent->start;
-   em->len = async_extent->ram_size;
-   em->orig_start = em->start;
-   em->mod_start = em->start;
-   em->mod_len = em->len;
-
-   em->block_start = ins.objectid;
-   em->block_len = ins.offset;
-   em->orig_block_len = ins.offset;
-   em->ram_bytes = async_extent->ram_size;
-   em->bdev = fs_info->fs_devices->latest_bdev;
-   em->compress_type = async_extent->compress_type;
-   set_bit(EXTENT_FLAG_PINNED, >flags);
-   set_bit(EXTENT_FLAG_COMPRESSED, >flags);
-   em->generation = -1;
-
-   while (1) {
-   write_lock(_tree->lock);
-   ret = add_extent_mapping(em_tree, em, 1);
-   write_unlock(_tree->lock);
-   if (ret != -EEXIST) {
-   free_extent_map(em);
-   break;
-   }
-   btrfs_drop_extent_cache(inode, async_extent->start,
-   async_extent->start +
-   async_extent->ram_size - 1, 0);
-   }
-
-   if (ret)
+   em = create_io_em(inode, async_extent->start,
+ async_extent->ram_size, /* len */
+ async_extent->start, /* orig_start */
+ ins.objectid, /* block_start */
+ ins.offset, /* block_len */
+ ins.offset, /* orig_block_len */
+ async_extent->ram_size, /* ram_bytes */
+ async_extent->compress_type,
+ BTRFS_ORDERED_COMPRESSED);
+   if (IS_ERR(em))
+   /* ret value is not necessary due to void function */
goto out_free_reserve;
+   free_extent_map(em);
 
ret = btrfs_add_ordered_extent_compress(inode,
async_extent->start,
@@ -952,7 +923,6 @@ static noinline int cow_file_range(struct inode *inode,
u64 blocksize = fs_info->sectorsize;
struct btrfs_key ins;
struct extent_map *em;
-   struct extent_map_tree *em_tree = _I(inode)->extent_tree;
int ret = 0;
 
if (btrfs_is_free_space_inode(inode)) {
@@ -1008,39 +978,18 @@ static noinline int cow_file_range(struct inode *inode,
if (ret < 0)
goto 

[PATCH] Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist

2017-01-30 Thread Liu Bo
run_delalloc_nocow has used trans in two places where they don't actually need
@trans.

For btrfs_lookup_file_extent, we search for file extents without COWing
anything, and for btrfs_cross_ref_exist, the only place where we need @trans is
deferencing it in order to get running_transaction which we could easily get
from the global fs_info.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h   |  3 +--
 fs/btrfs/extent-tree.c | 22 --
 fs/btrfs/inode.c   | 38 +++---
 3 files changed, 16 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6a82371..73b2d51 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2577,8 +2577,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_fs_info 
*fs_info,
u64 bytenr, u64 num_bytes);
 int btrfs_exclude_logged_extents(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb);
-int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
+int btrfs_cross_ref_exist(struct btrfs_root *root,
  u64 objectid, u64 offset, u64 bytenr);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(
 struct btrfs_fs_info *info,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ed254b8..097fa4a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3025,8 +3025,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-static noinline int check_delayed_ref(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
+static noinline int check_delayed_ref(struct btrfs_root *root,
  struct btrfs_path *path,
  u64 objectid, u64 offset, u64 bytenr)
 {
@@ -3034,9 +3033,14 @@ static noinline int check_delayed_ref(struct 
btrfs_trans_handle *trans,
struct btrfs_delayed_ref_node *ref;
struct btrfs_delayed_data_ref *data_ref;
struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_transaction *cur_trans;
int ret = 0;
 
-   delayed_refs = >transaction->delayed_refs;
+   cur_trans = root->fs_info->running_transaction;
+   if (!cur_trans)
+   return 0;
+
+   delayed_refs = _trans->delayed_refs;
spin_lock(_refs->lock);
head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (!head) {
@@ -3087,8 +3091,7 @@ static noinline int check_delayed_ref(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-static noinline int check_committed_ref(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
+static noinline int check_committed_ref(struct btrfs_root *root,
struct btrfs_path *path,
u64 objectid, u64 offset, u64 bytenr)
 {
@@ -3159,9 +3162,8 @@ static noinline int check_committed_ref(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 objectid, u64 offset, u64 bytenr)
+int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
+ u64 bytenr)
 {
struct btrfs_path *path;
int ret;
@@ -3172,12 +3174,12 @@ int btrfs_cross_ref_exist(struct btrfs_trans_handle 
*trans,
return -ENOENT;
 
do {
-   ret = check_committed_ref(trans, root, path, objectid,
+   ret = check_committed_ref(root, path, objectid,
  offset, bytenr);
if (ret && ret != -ENOENT)
goto out;
 
-   ret2 = check_delayed_ref(trans, root, path, objectid,
+   ret2 = check_delayed_ref(root, path, objectid,
 offset, bytenr);
} while (ret2 == -EAGAIN);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 082b968..3d3753a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1250,7 +1250,6 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_trans_handle *trans;
struct extent_buffer *leaf;
struct btrfs_path *path;
struct btrfs_file_extent_item *fi;
@@ -1286,30 +1285,10 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 
nolock = btrfs_is_free_space_inode(inode);
 
-   if (nolock)
-   trans = btrfs_join_transaction_nolock(root);
-   else
-   trans = btrfs_join_transaction(root);
-
-   if (IS_ERR(trans)) {
-   

[PATCH] Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head

2017-01-30 Thread Liu Bo
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly, there is
no reason to deference it again inside the function.

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c | 2 +-
 fs/btrfs/delayed-ref.c | 5 +
 fs/btrfs/delayed-ref.h | 3 ++-
 fs/btrfs/extent-tree.c | 6 +++---
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 8299601..db70659 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1284,7 +1284,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
 */
delayed_refs = >transaction->delayed_refs;
spin_lock(_refs->lock);
-   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (head) {
if (!mutex_trylock(>mutex)) {
atomic_inc(>node.refs);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ef724a5..aebb48c 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -911,11 +911,8 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
  * the head node if any where found, or NULL if not.
  */
 struct btrfs_delayed_ref_head *
-btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr)
+btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 
bytenr)
 {
-   struct btrfs_delayed_ref_root *delayed_refs;
-
-   delayed_refs = >transaction->delayed_refs;
return find_ref_head(_refs->href_root, bytenr, 0);
 }
 
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 50947b5..22ca93b 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -262,7 +262,8 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *
-btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr);
+btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+   u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
   struct btrfs_delayed_ref_head *head);
 static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head 
*head)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e97302f..ed254b8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -888,7 +888,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
spin_lock(_refs->lock);
-   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (head) {
if (!mutex_trylock(>mutex)) {
atomic_inc(>node.refs);
@@ -3038,7 +3038,7 @@ static noinline int check_delayed_ref(struct 
btrfs_trans_handle *trans,
 
delayed_refs = >transaction->delayed_refs;
spin_lock(_refs->lock);
-   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (!head) {
spin_unlock(_refs->lock);
return 0;
@@ -7092,7 +7092,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
 
delayed_refs = >transaction->delayed_refs;
spin_lock(_refs->lock);
-   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (!head)
goto out_delayed_unlock;
 
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove unused trans in read_block_for_search

2017-01-30 Thread Liu Bo
@trans is not used at all, this removes it.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a426dc8..dd8014a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2437,10 +2437,9 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
  * reada.  -EAGAIN is returned and the search must be repeated.
  */
 static int
-read_block_for_search(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root, struct btrfs_path *p,
-  struct extent_buffer **eb_ret, int level, int slot,
-  struct btrfs_key *key, u64 time_seq)
+read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
+ struct extent_buffer **eb_ret, int level, int slot,
+ struct btrfs_key *key, u64 time_seq)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
u64 blocknr;
@@ -2870,8 +2869,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root
goto done;
}
 
-   err = read_block_for_search(trans, root, p,
-   , level, slot, key, 0);
+   err = read_block_for_search(root, p, , level,
+   slot, key, 0);
if (err == -EAGAIN)
goto again;
if (err) {
@@ -3014,7 +3013,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, struct 
btrfs_key *key,
goto done;
}
 
-   err = read_block_for_search(NULL, root, p, , level,
+   err = read_block_for_search(root, p, , level,
slot, key, time_seq);
if (err == -EAGAIN)
goto again;
@@ -5784,7 +5783,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
 
next = c;
next_rw_lock = path->locks[level];
-   ret = read_block_for_search(NULL, root, path, , level,
+   ret = read_block_for_search(root, path, , level,
slot, , 0);
if (ret == -EAGAIN)
goto again;
@@ -5834,7 +5833,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
if (!level)
break;
 
-   ret = read_block_for_search(NULL, root, path, , level,
+   ret = read_block_for_search(root, path, , level,
0, , 0);
if (ret == -EAGAIN)
goto again;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure

2017-01-30 Thread Liu Bo
On Fri, Jan 13, 2017 at 03:12:31PM +0900, takafumi-sslab wrote:
> Thanks for your replying.
> 
> I understand this bug is more complicated than I expected.
> I classify error cases under submit_extent_page() below
> 
> A: ENOMEM error at btrfs_bio_alloc() in submit_extent_page()
> I first assumed this case and sent the mail.
> When bio_ret is NULL, submit_extent_page() calls btrfs_bio_alloc().
> Then, btrfs_bio_alloc() may fail and submit_extent_page() returns -ENOMEM.
> In this case, bio_endio() is not called and the page's writeback bit
> remains.
> So, there is a need to call end_page_writeback() in the error handling.
> 
> B: errors under submit_one_bio() of submit_extent_page()
> Errors that occur under submit_one_bio() handles at bio_endio(), and
> bio_endio() would call end_page_writeback().
> 
> Therefore, as you mentioned in the last mail, simply adding
> end_page_writeback() like my last email and commit 55e3bd2e0c2e1 can
> conflict in the case of B.
> To avoid such conflict, one easy solution is adding PageWriteback() check
> too.
> 
> How do you think of this solution?

(sorry for the late reply.)

I think its caller, "__extent_writepage", has covered the above case
by setting page writeback again.

Thanks,

-liubo
> 
> Sincerely,
> 
> On 2016/12/22 15:20, Liu Bo wrote:
> > On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota wrote:
> > > This is actually inspired by Filipe's patch(55e3bd2e0c2e1).
> > > 
> > > When submit_extent_page() in __extent_writepage_io() fails,
> > > Btrfs misses clearing a writeback bit of the failed page.
> > > This causes the false under-writeback page.
> > > Then, another sync task hangs in filemap_fdatawait_range(),
> > > because it waits the false under-writeback page.
> > > 
> > > CPU0CPU1
> > > 
> > > __extent_writepage_io()
> > >ret = submit_extent_page() // fail
> > > 
> > >if (ret)
> > >  SetPageError(page)
> > >  // miss clearing the writeback bit
> > > 
> > >  sync()
> > >...
> > >filemap_fdatawait_range()
> > >  wait_on_page_writeback(page);
> > >  // wait the false under-writeback 
> > > page
> > > 
> > > Signed-off-by: Takafumi Kubota 
> > > ---
> > >   fs/btrfs/extent_io.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 1e67723..ef9793b 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -3443,8 +3443,10 @@ static noinline_for_stack int 
> > > __extent_writepage_io(struct inode *inode,
> > >bdev->bio, max_nr,
> > >end_bio_extent_writepage,
> > >0, 0, 0, false);
> > > - if (ret)
> > > + if (ret) {
> > >   SetPageError(page);
> > > + end_page_writeback(page);
> > > + }
> > OK...this could be complex as we don't know which part in
> > submit_extent_page gets the error, if the page has been added into bio
> > and bio_end would call end_page_writepage(page) as well, so whichever
> > comes later, the BUG() in end_page_writeback() would complain.
> > 
> > Looks like commit 55e3bd2e0c2e1 also has the same problem although I
> > gave it my reviewed-by.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> > >   cur = cur + iosize;
> > >   pg_offset += iosize;
> > > -- 
> > > 1.9.3
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Keio University
> System Software Laboratory
> Takafumi Kubota
> takafumi.kubota1...@sslab.ics.keio.jp
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs recovery

2017-01-30 Thread Michael Born
Hi btrfs experts.

Hereby I apply for the stupidity of the month award.
But, maybe you can help me restoring my dd backup or extracting some
files from it?

Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
to an image file - while the system was online/running.
Now, I can't mount the image.

I tried many commands (some output is below) that are suggested in the
wiki or blog pages without any success.
Unfortunately, the promising tool btrfs-find-root seems not to work. I
let it run on backup1.dd for 16 hours with the only output being:
./btrfs-find-root /dev/loop0
Couldn't read tree root
Superblock thinks the generation is 550114
Superblock thinks the level is 1

I then stopped it manually. (The 60GB dd file is on a ssd and one cpu
core was at 100% load all night)
I also tried the git clone of btrfs-progs which I checked out (the
tagged versions 4.9, 4.7, 4.4, 4.1) and compiled. I always got the
btrfs-find-root output as shown above.

Could you give me some instructions how to repair the file system or
extract some files from it?

Thank you,
Michael

PS: could you please CC me, as I'm not subscribed to the list.

Some commands and their output.

mount -t btrfs -o recovery,ro /dev/loop0 /mnt/oldroot/
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/loop0 ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

dmesg -T says:
[Mo Jan 30 01:08:20 2017] BTRFS info (device loop0): enabling auto recovery
[Mo Jan 30 01:08:20 2017] BTRFS info (device loop0): disk space caching
is enabled
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865271808
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865271808
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32862011392
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): parent transid
verify failed on 32869482496 wanted 550112 found 550121
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865353728
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS: open_ctree failed

---

btrfs fi show
Label: none  uuid: 1c203c00-2768-4ea8-9e00-94aba5825394
Total devices 1 FS bytes used 29.28GiB
devid1 size 60.00GiB used 32.07GiB path /dev/sda2

Label: none  uuid: 91a79eeb-08e0-470e-beab-916b38e09aca
Total devices 1 FS bytes used 44.23GiB
devid1 size 60.00GiB used 60.00GiB path /dev/loop0

The 1st one is my now running Suse 42.2 /

---

btrfs check /dev/loop0
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
ERROR: cannot open file system

---

./btrfs restore -l /dev/loop0
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
Could not open root, trying backup super
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
checksum verify failed on 32865271808 found E4E3BDB6 wanted 
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 64428703744
Could not open root, trying backup super

---

uname -a
Linux linux-azo5 4.4.36-8-default #1 SMP Fri Dec 9 16:18:38 UTC 2016
(3ec5648) x86_64 x86_64 x86_64 GNU/Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix -EINVEL in tree log recovering

2017-01-30 Thread Filipe Manana
On Tue, Oct 11, 2016 at 10:01 AM, robbieko  wrote:
> From: Robbie Ko 
>
> when tree log recovery, space_cache rebuild or dirty maybe save the cache.
> and then replay extent with disk_bytenr and disk_num_bytes,
> but disk_bytenr and disk_num_bytes maybe had been use for free space inode,
> will lead to -EINVEL.

-EINVEL -> -EINVAL

More importantly, and sorry to say, but I can't parse nor make sense
of your change log.
It kind of seems you're saying that replaying an extent from the log
tree can collide somehow with the space reserved for a free space
cache, or the other way around, writing a space cache attempts to use
an extent that overlaps an extent that was replayed during log
recovery (presumably during the transaction commit done at the end of
the log recovery).

Now honestly, think of how you would explain the problem in your
native tongue. Do you think a single short sentence like that is
enough to explain such a non-trivial problem? I doubt it, no matter
what language we pick... Or think that in a few months or maybe years
(or whatever time frame) even you forgot what was the problem and
you're trying to remember the details by reading the change log - do
you think this change log would help at all?

At least tell us what (function) is returning -EINVAL, make a function
call graph, give a sample scenario, or better yet, send a test case
(fstests) to reproduce this, since it seems to be a fully
deterministic and 100% reproducible case.

Thanks.

>
> BTRFS: error in btrfs_replay_log:2446: errno=-22 unknown (Failed to recover 
> log tree)
>
> therefore, we not save cache when tree log recovering.
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 665da8f..38b932c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3434,6 +3434,7 @@ again:
>
> spin_lock(_group->lock);
> if (block_group->cached != BTRFS_CACHE_FINISHED ||
> +   block_group->fs_info->log_root_recovering ||
> !btrfs_test_opt(root->fs_info, SPACE_CACHE)) {
> /*
>  * don't bother trying to write stuff out _if_
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix leak subvol subv_writers conter

2017-01-30 Thread Filipe Manana
On Fri, Oct 7, 2016 at 3:01 AM, robbieko  wrote:
> From: Robbie Ko 
>
> In run_delalloc_nocow, maybe not release subv_writers conter,
> will lead to create snapshot hang.
>
> Signed-off-by: Robbie Ko 

I've picked this into my integration branch for 4.11 and rewrote the
changelog and subject.

Thanks.

> ---
>  fs/btrfs/inode.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6811c4..9722554 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1386,11 +1386,17 @@ next_slot:
>  * this ensure that csum for a given extent are
>  * either valid or do not exist.
>  */
> -   if (csum_exist_in_range(root, disk_bytenr, num_bytes))
> +   if (csum_exist_in_range(root, disk_bytenr, 
> num_bytes)) {
> +   if (!nolock)
> +   btrfs_end_write_no_snapshoting(root);
> goto out_check;
> +   }
> if (!btrfs_inc_nocow_writers(root->fs_info,
> -disk_bytenr))
> +disk_bytenr)) {
> +   if (!nolock)
> +   btrfs_end_write_no_snapshoting(root);
> goto out_check;
> +   }
> nocow = 1;
> } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> extent_end = found_key.offset +
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix leak of subvolume writers counter

2017-01-30 Thread fdmanana
From: Robbie Ko 

When falling back from a nocow write to a regular cow write, we were
leaking the subvolume writers counter in 2 situations, preventing
snapshot creation from ever completing in the future, as it waits
for that counter to go down to zero before the snapshot creation
starts.

In run_delalloc_nocow, maybe not release subv_writers conter,
will lead to create snapshot hang.

Signed-off-by: Robbie Ko 
Reviewed-by: Filipe Manana 
[Improved changelog and subject]
Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a713d9d..7221d66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1404,10 +1404,16 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 * either valid or do not exist.
 */
if (csum_exist_in_range(fs_info, disk_bytenr,
-   num_bytes))
+   num_bytes)) {
+   if (!nolock)
+   btrfs_end_write_no_snapshoting(root);
goto out_check;
-   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
+   }
+   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
+   if (!nolock)
+   btrfs_end_write_no_snapshoting(root);
goto out_check;
+   }
nocow = 1;
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
extent_end = found_key.offset +
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: bulk delete checksum items in the same leaf

2017-01-30 Thread fdmanana
From: Filipe Manana 

Very often we have the checksums for an extent spread in multiple items
in the checksums tree, and currently the algorithm to delete them starts
by looking for them one by one and then deleting them one by one, which
is not optimal since each deletion involves shifting all the other items
in the leaf and when the leaf reaches some low threshold, to move items
off the leaf into its left and right neighbor leafs. Also, after each
item deletion we release our search path and start a new search for other
checksums items.

So optimize this by deleting in bulk all the items in the same leaf that
contain checksums for the extent being freed.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/file-item.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index e97e322..d7d6d4a 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -643,7 +643,33 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 
/* delete the entire item, it is inside our range */
if (key.offset >= bytenr && csum_end <= end_byte) {
-   ret = btrfs_del_item(trans, root, path);
+   int del_nr = 1;
+
+   /*
+* Check how many csum items preceding this one in this
+* leaf correspond to our range and then delete them all
+* at once.
+*/
+   if (key.offset > bytenr && path->slots[0] > 0) {
+   int slot = path->slots[0] - 1;
+
+   while (slot >= 0) {
+   struct btrfs_key pk;
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   if (pk.offset < bytenr ||
+   pk.type != BTRFS_EXTENT_CSUM_KEY ||
+   pk.objectid !=
+   BTRFS_EXTENT_CSUM_OBJECTID)
+   break;
+   path->slots[0] = slot;
+   del_nr++;
+   key.offset = pk.offset;
+   slot--;
+   }
+   }
+   ret = btrfs_del_items(trans, root, path,
+ path->slots[0], del_nr);
if (ret)
goto out;
if (key.offset == bytenr)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/6] btrfs-progs: convert: Switch to new rollback function

2017-01-30 Thread David Sterba
On Wed, Jan 25, 2017 at 08:42:01AM +0800, Qu Wenruo wrote:
> >> So this implies the current implementation is not good enough for review.
> >
> > I'd say the code hasn't been cleaned up for a long time so it's not good
> > enough for adding new features and doing broader fixes. The v2 rework
> > has fixed quite an important issue, but for other issues I'd rather get
> > smaller patches that eg. prepare the code for the final change.
> > Something that I can review without needing to reread the whole convert
> > and refresh memories about all details.
> >
> >> I'll try to extract more more set operation and make the core part more
> >> refined, with more ascii art comment for it.
> >
> > The ascii diagrams help, the overall convert design could be also better
> > documented etc. At the moment I'd rather spend some time on cleaning up
> > the sources but also don't want to block the fixes you've been sending.
> > I need to think about that more.
> 
> Feel free to block the rework.
> 
> I'll start from sending out basic documentations explaining the logic 
> behind convert/rollback, which should help review.

FYI, I've reorganized the convert files a bit, this patchset does not
apply anymore, but I'm expecting some more changes to it so please adapt
it to the new file structure.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh Raid-1 setup, dump-tree shows invalid owner id

2017-01-30 Thread Lakshmipathi.G
>
> Yes, the owner is the number of the tree.
>
> DATA_RELOC_TREE is -9, but then unsigned 64 bits.
>
 -9 + 2**64
> 18446744073709551607L
>
> So the result is a number that's close to the max or 64 bits.
>
> You can find those numbers in the kernel source in
>   include/uapi/linux/btrfs_tree.h
>
> e.g.:
>
> #define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
>

Thanks for the details. This owner number looked different from other
owner ids, so wanted to check on the same, now understood.

Cheers.
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-30 Thread Austin S. Hemmelgarn

On 2017-01-28 00:00, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
excerpted:


On 2017-01-27 06:01, Oliver Freyermuth wrote:

I'm also running 'memtester 12G' right now, which at least tests 2/3
of the memory. I'll leave that running for a day or so, but of course
it will not provide a clear answer...


A small update: while the online memtester is without any errors still,
I checked old syslogs from the machine and found something intriguing.



kernel: Corrupted low memory at 88009000 (9000 phys) = 00098d39
kernel: Corrupted low memory at 88009000 (9000 phys) = 00099795
kernel: Corrupted low memory at 88009000 (9000 phys) = 000dd64e


0x9000 = 36K...


This seems to be consistently happening from time to time (I have low
memory corruption checking compiled in).
The numbers always consistently increase, and after a reboot, start
fresh from a small number again.

I suppose this is a BIOS bug and it's storing some counter in low
memory. I am unsure whether this could have triggered the BTRFS
corruption, nor do I know what to do about it (are there kernel quirks
for that?). The vendor does not provide any updates, as usual.

If someone could confirm whether this might cause corruption for btrfs
(and maybe direct me to the correct place to ask for a kernel quirk for
this device - do I ask on MM, or somewhere else?), that would be much
appreciated.



It is a firmware bug, Linux doesn't use stuff in that physical address
range at all.  I don't think it's likely that this specific bug caused
the corruption, but given that the firmware doesn't have it's
allocations listed correctly in the e820 table (if they were listed
correctly, you wouldn't be seeing this message), it would not surprise
me if the firmware was involved somehow.


Correct me if I'm wrong (I'm no kernel expert, but I've been building my
own kernel for well over a decade now so having a working familiarity
with the kernel options, of which the following is my possibly incorrect
read), but I believe that's only "fact check: mostly correct" (mostly as
in yes it's the default, but there's a mainline kernel option to change
it).

I was just going over the related kernel options again a couple days ago,
so they're fresh in my head, and AFAICT...

There are THREE semi-related kernel options (config UI option location is
based on the mainline 4.10-rc5+ git kernel I'm presently running):

DEFAULT_MMAP_MIN_ADDR

Config location: Processor type and features:
Low address space to protect from user allocation

This one is virtual memory according to config help, so likely not
directly related, but similar idea.
Yeah, it really only affects userspace.  In effect, it's the lowest 
virtual address that a userspace program can allocate memory at.  By 
default on most systems it only covers the first page (which is to 
protect against NULL pointer bugs).  Most distros set it at 64k to 
provide a bit of extra protection.  There are a handful that set it to 0 
so that vm86 stuff works, but the number of such distros is going down 
over time because vm86 is not a common use case, and this can be 
configured at runtime through /proc/sys/vm/mmap_min_addr.


X86_CHECK_BIOS_CORRUPTION

Location: Same section, a few lines below the first one:
Check for low memory corruption

I guess this is the option you (OF) have enabled.  Note that according to
help, in addition to enabling this in options, a runtime kernel
commandline option must be given as well, to actually enable the checks.
There's another option that controls the default (I forget the config 
option and I'm too lazy right now to check), but he obviously either has 
that option enabled or has it enabled at run-time, otherwise there 
wouldn't be any messages in the kernel log about the check failing. 
FWIW, the reason this defaults to being off is that it runs every 60 
seconds, and therefore has a significant impact on power usage on mobile 
systems.


X86_RESERVE_LOW

Location: Same section, immediately below the check option:
Amount of low memory, in kilobytes, to reserve for the BIOS

Help for this one suggests enabling the check bios corruption option
above if there are any doubts, so the two are directly related.
Yes.  This specifies both the kernel equivalent of DEFAULT_MMAP_MIN_ADDR 
(so the kernel won't use anything with a physical address between 0 and 
this range), and the upper bound for the corruption check.


All three options apparently default to 64K (as that's what I see here
and I don't believe I've changed them), but can be changed.  See the
kernel options help and where it points for more.

My read of the above is that yes, by default the kernel won't use
physical 0x9000 (36K), as it's well within the 64K default reserve area,
but a blanket "Linux doesn't use stuff in that physical address range at
all" is incorrect, as if the defaults have been changed it /could/ use
that space (#3's minimum is 1 page, 4K, leaving that 36K address

Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-01-30 Thread Austin S. Hemmelgarn

On 2017-01-28 04:17, Andrei Borzenkov wrote:

27.01.2017 23:03, Austin S. Hemmelgarn пишет:

On 2017-01-27 11:47, Hans Deragon wrote:

On 2017-01-24 14:48, Adam Borowski wrote:


On Tue, Jan 24, 2017 at 01:57:24PM -0500, Hans Deragon wrote:


If I remove 'ro' from the option, I cannot get the filesystem mounted
because of the following error: BTRFS: missing devices(1) exceeds the
limit(0), writeable mount is not allowed So I am stuck. I can only
mount the filesystem as read-only, which prevents me to add a disk.


A known problem: you get only one shot at fixing the filesystem, but
that's
not because of some damage but because the check whether the fs is in a
shape is good enough to mount is oversimplistic.

Here's a patch, if you apply it and recompile, you'll be able to mount
degraded rw.

Note that it removes a safety harness: here, the harness got tangled
up and
keeps you from recovering when it shouldn't, but it _has_ valid uses
that.

Meow!


Greetings,

Ok, that solution will solve my problem in the short run, i.e. getting
my raid1 up again.

However, as a user, I am seeking for an easy, no maintenance raid
solution.  I wish that if a drive fails, the btrfs filesystem still
mounts rw and leaves the OS running, but warns the user of the failing
disk and easily allow the addition of a new drive to reintroduce
redundancy.  Are there any plans within the btrfs community to implement
such a feature?  In a year from now, when the other drive will fail,
will I hit again this problem, i.e. my OS failing to start, booting into
a terminal, and cannot reintroduce a new drive without recompiling the
kernel?

Before I make any suggestions regarding this, I should point out that
mounting read-write when a device is missing is what caused this issue
in the first place.



How do you replace device when filesystem is mounted read-only?

I'm saying that the use case you're asking to have supported is the 
reason stuff like this happens.  If you're mounting read-write degraded 
and fixing the filesystem _immediately_ then it's not an issue, that's 
exactly what read-write degraded mounts are for.  If you're mounting 
read-write degraded and then having the system run as if nothing was 
wrong, then I have zero sympathy because that's _dangerous_, even with 
LVM, MD-RAID, or even hardware RAID (actually, especially with hardware 
RAID, LVM and MD are smart enough to automatically re-sync, most 
hardware RAID controllers aren't).


That said, as I mentioned further down in my initial reply, you 
absolutely should be monitoring the filesystem and not letting things 
get this bad if at all possible.  It's actually very rare that a storage 
device fails catastrophically with no warning (at least, on the scale 
that most end users are operating).  At a minimum, even if you're using 
ext4 on top of LVM, you should be monitoring SMART attributes on the 
storage devices (or whatever the SCSI equivalent is if you use 
SCSI/SAS/FC devices).  While not 100% reliable (they are getting better 
though), they're generally a pretty good way to tell if a disk is likely 
to fail in the near future.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh Raid-1 setup, dump-tree shows invalid owner id

2017-01-30 Thread Hans van Kranenburg
On 01/30/2017 02:54 AM, Lakshmipathi.G wrote:
> After creating raid1:
> $./mkfs.btrfs -f -d raid1 -m raid1 /dev/sda6 /dev/sda7
> 
> and using
> $./btrfs inspect-internal dump-tree /dev/sda6  #./btrfs-debug-tree /dev/sda6
> 
> shows possible wrong value for 'owner'? 
> --
> checksum tree key (CSUM_TREE ROOT_ITEM 0) 
> leaf 29425664 items 0 free space 16283 generation 4 owner 7
> fs uuid 94fee00b-00aa-4d69-b947-347f743117f2
> chunk uuid 6477561c-cbca-45e4-980d-56727a8dc9d9
> data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 
> leaf 29442048 items 2 free space 16061 generation 4 owner 
> 18446744073709551607 <<< owner id?
> fs uuid 94fee00b-00aa-4d69-b947-347f743117f2
> chunk uuid 6477561c-cbca-45e4-980d-56727a8dc9d9
> --
> 
> or is that expected output?

Yes, the owner is the number of the tree.

DATA_RELOC_TREE is -9, but then unsigned 64 bits.

>>> -9 + 2**64
18446744073709551607L

So the result is a number that's close to the max or 64 bits.

You can find those numbers in the kernel source in
  include/uapi/linux/btrfs_tree.h

e.g.:

#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh Raid-1 setup, dump-tree shows invalid owner id

2017-01-30 Thread Lakshmipathi.G
Raid1 is irrelevant, looks like this happen with simple case too.
$./mkfs.btrfs tests/test.img
$./btrfs-debug-tree tests/test.img

possible issue with ./btrfs-debug-tree stdout?

On Mon, Jan 30, 2017 at 7:24 AM, Lakshmipathi.G
 wrote:
> After creating raid1:
> $./mkfs.btrfs -f -d raid1 -m raid1 /dev/sda6 /dev/sda7
>
> and using
> $./btrfs inspect-internal dump-tree /dev/sda6  #./btrfs-debug-tree /dev/sda6
>
> shows possible wrong value for 'owner'?
> --
> checksum tree key (CSUM_TREE ROOT_ITEM 0)
> leaf 29425664 items 0 free space 16283 generation 4 owner 7
> fs uuid 94fee00b-00aa-4d69-b947-347f743117f2
> chunk uuid 6477561c-cbca-45e4-980d-56727a8dc9d9
> data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0)
> leaf 29442048 items 2 free space 16061 generation 4 owner 
> 18446744073709551607 <<< owner id?
> fs uuid 94fee00b-00aa-4d69-b947-347f743117f2
> chunk uuid 6477561c-cbca-45e4-980d-56727a8dc9d9
> --
>
> or is that expected output?
>
> Cheers.
> Lakshmipathi.G
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-30 Thread Michal Hocko
On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html