Re: state of btrfs snapshot limitations?

2018-09-14 Thread James A. Robinson
Wow, thank to everyone for all that information, I'm going to have to
take some time to digest everything. :)

I just wanted to quickly say one thing: As Duncan surmised, I'm not
treating this as my primary backup, but more of an experimental add-on
feature.  The primary backup goes to an ext4 partition that is then
rsynced to a 'current' btrfs subvolume, the latter is what I'm taking
snapshots of.  These both share a RAID 1+0 enclosure.  The partitions
are oversized, 3TB each, so based on this thread that size is something
to watch out for if I use up a lot more of the space.

For other backups I've also got a Amazon Glacier based backup (large
"offsite" but painful to recover with) via Arq, and an Apple
TimeCapsule backup (easy point-and-click restore for a single item, but
damn flakey and painfully slow for anything complicated).  So I'm not
trusting any one device or scheme.

As an aside, one of the things I loved about reading the Plan 9 papers
was their description of their three-tier storage.  Where they had "infinite"
long term storage via a WORM array, SCSI disks on big servers for the
"cache," and then local disk / memory for the files they were working
on right at that moment (http://doc.cat-v.org/plan_9/misc/cw/cw.pdf).

I do like the idea of periodic "copy to a clean disk" and "mkfs of the
old disk" scheme instead of a complicated in-place rebuilding and/or
defragmentation, I think I have enough capacity that I will probably
try that if/when it becomes necessary.

I've got to read up a bit more on subvolumes, I am missing some
context from the warnings given by Chris regarding per-subvolume
options.


Re: state of btrfs snapshot limitations?

2018-09-14 Thread Duncan
James A. Robinson posted on Fri, 14 Sep 2018 14:05:29 -0700 as excerpted:

> The mail archive seems to indicate this list is appropriate for not only
> the technical coding issues, but also for user questions, so I wanted to
> pose a question here.  If I'm wrong about that, I apologize in advance.

User questions are fine here.  In fact, there are a number of non-dev 
regulars here who normally take the non-dev level questions.  I'm one of 
them. =:^)

> The page
> 
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
> 
> talks about the basic snapshot capabilities of btrfs and led me to look
> up what, if any, limits might apply.  I find some threads from a few
> years ago that talk about limiting the number of snapshots for a volume
> to 100.

Btrfs is optimized to make snapshotting very fast -- on an atomic copy-on-
write tree-based filesystem like btrfs it's pretty much just taking a new 
reference pointing at the current tree head so nothing in it disappears, 
and that's very fast -- but maintenance that works with existing 
snapshots (and other references) is often slower and doesn't always scale 
so nicely.  While from btrfs' perspective there's nothing "magical" about 
the number 100, in human terms it is of course easy to remember, and it's 
very roughly where the number of snapshots starts to take its toll on the 
time required for various filesystem maintenance tasks, including 
deleting snapshots, balance, fsck, quota maintenance, etc.

So the number of snapshots you can get away with depends primarily on 
three things:

1) Easiest and biggest factor:  If you don't need quotas, simply keeping 
that functionality turned off makes a big difference, and if you /do/ 
need them, turning them off temporarily for maintenance such as a 
rebalance, then doing a quota rescan when the balance is completed, can 
be the difference between a balance taking days or weeks with quotas on 
and constantly updating during the balance, vs. hours to a couple days 
turning quotas off during the balance.  There have been quite a number of 
people who have posted questions about balance not being practical (or 
even thinking it was hung) as it was taking "forever", that found simply 
turning quotas off (sometimes they didn't even know they were on, it was 
a distro setting) fixed the problem and that balance completed in a 
reasonable time after that.

(There have recently been patches to avoid some of the worst constant 
rescanning during balance, but as my own use-case doesn't require either 
quotas or snapshotting, I'm not following their status, and if quotas 
aren't required keeping them off will remain simplest and most efficient 
in any case.)

2) Use-case need for maintenance:  While (almost) any periodic-
snapshotting use-case is going to need snapshot thinning and thus 
snapshot removal as routine maintenance, some use-cases, particularly at 
the large scale, aren't going to find less routine maintenance tasks like 
full balance (converting between raid levels or adding/deleting devices 
to/from an existing filesystem) or check --repair, etc, useful; they'll 
simply swap in a hot-spare backup and mkfs the former working copy they 
would have otherwise needed maintenance on, because it's easier/simpler/
faster for them than trying to repair or change the device config of the 
existing filesystem, and their operating parameters already require the 
hot-spare resources for other reasons.

This is likely why a working fsck repair mechanism wasn't a high priority 
early on, and why it still has "holes" in the types of damage it can 
repair.  The big users such as facebook and oracle funding development 
simply don't find that sort of functionality useful as they hot-swap 
instead.  

But even for more "normal/personal" use-cases, if adding a device and 
rebalancing to make efficient use of it, or if repairing a broken 
filesystem when you already have the valuable stuff on it backed up 
anyway, is going to take days, with no guarantee all the problems will be 
fixed in any case for the repair case, even if it's going to take 
dropping by the local computer/electronics (super-)store for a new disk 
or three (remember the multi-device case), it may well make more sense to 
do that then to take days doing the repair/device-add with the existing 
filesystem.

Obviously if you aren't going to be repairing the filesystem or adding/
removing devices, the time that takes isn't a factor you need to worry 
about, and snapshot-deletion times are likely to be the only thing you 
need to worry about in terms of snapshot numbers scaling.

3) Backing-device speed, ssd vs. spinning-rust, etc, matters, but not as 
much as you might think, because for some filesystem maintenance 
operations, particularly with large numbers of snapshots/reflinks, parts 
of them are cpu- or memory-bound, not IO-bound.


So while 100 snapshots is a convenient number as a recommendation, it 
really depends.  On slow systems with quotas on and 

Re: state of btrfs snapshot limitations?

2018-09-14 Thread Chris Murphy
On Fri, Sep 14, 2018 at 3:05 PM, James A. Robinson
 wrote:

> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
>
> talks about the basic snapshot capabilities of btrfs and led
> me to look up what, if any, limits might apply.  I find some
> threads from a few years ago that talk about limiting the
> number of snapshots for a volume to 100.

It does seem variable and I'm not certain what the pattern is that
triggers pathological behavior. There's a container thread about a
year ago with someone using docker on Btrfs with more than 100K
containers, per day, but I don't know the turn over rate. That person
does say it's deletion that's expensive but not intolerably so.

My advice is you come up with as many strategies as you can implement.
Because if one strategy starts to implode with terrible performance,
you can just bail on it (or try fixing it, or submitting bug reports
to make Btrfs better down the road, etc.), and yet you still have one
or more other strategies that are still viable.

By strategy, you might want to implement both your ideal and
conservative approaches, and also something in the middle. Also, it's
reasonable to mirror those strategies on a different storage stack,
e.g. LVM thin volumes and XFS. LVM thin volumes are semi-cheap to
create, and semi-cheap to delete; where Btrfs snapshots are almost
free to create, and expensive to delete (varies depending on changes
in it or the subvolume it's created from). But if the LVM thin pool's
metadata pool runs out of space, it's big trouble. I expect to lose
all the LV's if that ever happens. Also, this strategy doesn't have
send/receive, so ordinary use of rsync is expensive since it reads and
compares both source and destination. The first answer for this
question contains a possible work around depending on hard links.

https://serverfault.com/questions/489289/handling-renamed-files-or-directories-in-rsync


With Btrfs big issues for scalability are the extent tree, which is
shared among all snapshots and subvolumes. Therefore, the bigger the
file system gets, in effect the more fragile the extent tree becomes.
The other thing is btrfs check is super slow with large volumes, some
people have dozen or more TiB file systems that take days to check.

I also agree with the noatime suggestion from Hans. Note this is a per
subvolume mount time option, so if you're using the subvol= or
subvolid= mount options, you need to noatime every time, once per file
system isn't enough.



-- 
Chris Murphy


Re: state of btrfs snapshot limitations?

2018-09-14 Thread James A. Robinson
Thanks very much for the useful information.  I'll give the simple
scheme a try, after I adjust mount preferences.

Jim


Re: state of btrfs snapshot limitations?

2018-09-14 Thread Hans van Kranenburg
Hi,

On 09/14/2018 11:05 PM, James A. Robinson wrote:
> The mail archive seems to indicate this list is appropriate
> for not only the technical coding issues, but also for user
> questions, so I wanted to pose a question here.  If I'm
> wrong about that, I apologize in advance.

It's fine. Your observation is correct.

> The page
> 
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
> 
> talks about the basic snapshot capabilities of btrfs and led
> me to look up what, if any, limits might apply.  I find some
> threads from a few years ago that talk about limiting the
> number of snapshots for a volume to 100.
> 
> The reason I'm curious is I wanted to try and use the
> snapshot capability as a way of keeping a 'history' of a
> backup volume I maintain.  The backup doesn't change a
> lot overtime, but small changes are made to files within
> it daily.

The 100 above is just a number because users ask "ok, but *how* many?".

As far as I know, the real thing that is causing complexity for the
filesystem is how much actual changes are being done to the subvolume
all the time, after it's being snapshotted.

Creating btrfs snapshots is cheap. Only as soon as you start making
modifications, the subvolume in which you make the changes is going to
diverge from the other ones which share the same history. And changes
mean not only changes to data (changing, adding removing files), but
also pure metadata changes (e.g. using touch command on a file).

When just using the snapshots, opening and reading files etc, this
should however not be a big problem.

But, other btrfs specific actions are affected, like balance and device
remove, using quota.

In any case, make sure:
- you are not using quota / qgroups (highly affected by this sort of
complexity)
- you *always* mount with noatime (which is not the default, and yes,
noatime, not relatime or anything else) to prevent unnessary changes on
metadata which unnecessarily cause exactly this kind of complexity to
happen.

When doing this, and not having to use btrfs balance and add/remove
disks, and if the data doesn't change much over time (especially if it's
just adding new stuff all the time), you are likely able to have far
more snapshots of the thing.

> The Plan 9 OS has a nice archival filesystem that lets you
> easily maintain snapshots, and has various tools that make
> it simple to keep a /snapshot//mmdd snapshot going back
> for the life of the filesystem.
> 
> I wanted to try and replicate the basic functionality of
> that history using a non-plan-9 filesystem.  At first I
> tried rsnapshot but I find its technique of rotating and
> deleting backups is thrashing the disks to the point that it
> can't keep up with the rotations (the cp -al is fast, but
> the periodic rm -rf of older snapshots kills the disk).

Yes, btrfs snapshots are already a huge improvement compared to that.
(Also, cp -l causes a modifications to also be done in the "snapshots",
because it's still the same file, b)

> With btrfs I was thinking perhaps I could more efficiently
> maintain the archive of changes over time using a snapshot.
> If this is an awful thought and I should just go away,
> please let me know.
> 
> If the limit is 100 or less I'd need use a more complicated
> rotation scheme.  For example with a layout like the
> following:
> 
> min/
> hour/
> day/
> month/
> year/
> 
> The idea being each bucket, min, hour, day, month, would
> be capped and older snapshots would be removed and replaced
> with newer ones over time.
> 
> so with a 15-minute snapshot cycle I'd end up with
> 
> min/[00,15,30,45]
> hour/[00-23]
> day/[01-31]
> month/[01-12]
> year/[2018,2019,...]
> 
> (72+ snapshots with room for a few years worth of yearly's).
> 
> But if things have changed with btrfs over the past few
> years and number of snapshots scales much higher, I would
> use the easier scheme:
> 
> /min/[00,15,30,45]
> /hourly/[00-23]
> /daily//
> 
> with 365 snapshots added per additional year.

There are tools available that can do this for you. The one I use is
btrbk, https://github.com/digint/btrbk (probably packaged in your
favorite linux distro).

I'd say, just try it. Add a snapshot schedule in your btrbk config, and
set it to never expire older ones. Then, just see what happens, and only
if you start seeing things slow down a lot, start worrying about what to
do, and let us know how far you got.

Have fun,

P.S. Here's an unfinished page from a tutorial that I'm writing that is
still heavily under construction, which touches the subject of
snapshotting data and metadata. Maybe it might help to explain
"complexity starts when changing things" more:

https://github.com/knorrie/python-btrfs/blob/tutorial/tutorial/cows.md

-- 
Hans van Kranenburg


state of btrfs snapshot limitations?

2018-09-14 Thread James A. Robinson
The mail archive seems to indicate this list is appropriate
for not only the technical coding issues, but also for user
questions, so I wanted to pose a question here.  If I'm
wrong about that, I apologize in advance.

The page

https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

talks about the basic snapshot capabilities of btrfs and led
me to look up what, if any, limits might apply.  I find some
threads from a few years ago that talk about limiting the
number of snapshots for a volume to 100.

The reason I'm curious is I wanted to try and use the
snapshot capability as a way of keeping a 'history' of a
backup volume I maintain.  The backup doesn't change a
lot overtime, but small changes are made to files within
it daily.

The Plan 9 OS has a nice archival filesystem that lets you
easily maintain snapshots, and has various tools that make
it simple to keep a /snapshot//mmdd snapshot going back
for the life of the filesystem.

I wanted to try and replicate the basic functionality of
that history using a non-plan-9 filesystem.  At first I
tried rsnapshot but I find its technique of rotating and
deleting backups is thrashing the disks to the point that it
can't keep up with the rotations (the cp -al is fast, but
the periodic rm -rf of older snapshots kills the disk).

With btrfs I was thinking perhaps I could more efficiently
maintain the archive of changes over time using a snapshot.
If this is an awful thought and I should just go away,
please let me know.

If the limit is 100 or less I'd need use a more complicated
rotation scheme.  For example with a layout like the
following:

min/
hour/
day/
month/
year/

The idea being each bucket, min, hour, day, month, would
be capped and older snapshots would be removed and replaced
with newer ones over time.

so with a 15-minute snapshot cycle I'd end up with

min/[00,15,30,45]
hour/[00-23]
day/[01-31]
month/[01-12]
year/[2018,2019,...]

(72+ snapshots with room for a few years worth of yearly's).

But if things have changed with btrfs over the past few
years and number of snapshots scales much higher, I would
use the easier scheme:

/min/[00,15,30,45]
/hourly/[00-23]
/daily//

with 365 snapshots added per additional year.


[PATCH v2] Btrfs: remove wait_ordered_rane in btrfs_evict_inode

2018-09-14 Thread Liu Bo
When we delete an inode,

btrfs_evict_inode() {
truncate_inode_pages_final()
truncate_inode_pages_range()
lock_page()
truncate_cleanup_page()
 btrfs_invalidatepage()
  wait_on_page_writeback
   btrfs_lookup_ordered_range()
 cancel_dirty_page()
   unlock_page()
 ...
 btrfs_wait_ordered_range()
 ...

As VFS has called ->invalidatepage() to get all ordered extents done
(if there is any) and truncated all page cache pages (no dirty pages
to writeback after this step), wait_ordered_range() is just a noop.

Reviewed-by: David Sterba 
Signed-off-by: Liu Bo 
---
v2: More details in the description.

 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ff1d2ed2dc94..d3febc3a6bc0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5390,9 +5390,6 @@ void btrfs_evict_inode(struct inode *inode)
 
if (is_bad_inode(inode))
goto no_delete;
-   /* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
-   if (!special_file(inode->i_mode))
-   btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
 
-- 
1.8.3.1



Re: Problem with BTRFS

2018-09-14 Thread Nicholas D Steeves
Hi,

On Fri, Sep 14, 2018 at 10:13:06PM +0200, Rafael Jesús Alcántara Pérez wrote:
> Hi,
> 
> It seems that btrfs-progs_4.17-1 from Sid, includes that feature (at
> least, it says so in the manual page). I don't know if I can install it
> on Stretch but I'll try.
> 
> Greets and thank you very much to both of you ;)
> Rafael J. Alcántara Pérez.

Please do not install btrfs-progs from sid on stretch, it's likely to
break your system.  If you can't wait, here is a link to what I
uploaded.  It includes both the source and the binary packages (see
gpg signed changes file, and please take care to verify the binaries
weren't tampered with):

https://drive.google.com/file/d/1WflwBEn-QN_btrKPiefz7Kxz58VT8kIQ/view?usp=sharing

Of course, this package is unofficial.  The official one will soon
become available through the regular channels.

For best results, set up a local file:// apt repository so that apt
update && apt upgrade will work properly.  Official bpo will
automatically overwrite these, in any case.

Cheers,
Nicholas


signature.asc
Description: PGP signature


Re: Problem with BTRFS

2018-09-14 Thread Rafael Jesús Alcántara Pérez
Hi,

It seems that btrfs-progs_4.17-1 from Sid, includes that feature (at
least, it says so in the manual page). I don't know if I can install it
on Stretch but I'll try.

Greets and thank you very much to both of you ;)
Rafael J. Alcántara Pérez.

El 14/09/18 a las 20:18, Nicholas D Steeves escribió:
> Hi,
> 
> On Fri, Sep 14, 2018 at 10:46:12PM +0500, Roman Mamedov wrote:
>> On Fri, 14 Sep 2018 19:27:04 +0200
>> Rafael Jesús Alcántara Pérez  wrote:
>>
>>> BTRFS info (device sdc1): use lzo compression, level 0
>>> BTRFS warning (device sdc1): 'recovery' is deprecated, use
>>> 'usebackuproot' instead
>>> BTRFS info (device sdc1): trying to use backup root at mount time
>>> BTRFS info (device sdc1): disk space caching is enabled
>>> BTRFS info (device sdc1): has skinny extents
>>> BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
>>> fs_devices total_rw_bytes 601023424
>>
>> There is a recent feature added to "btrfs rescue" to fix this kind of
>> condition: https://patchwork.kernel.org/patch/10011399/
>>
>> You need a recent version of the Btrfs tools for it, not sure which, I see
>> that it's not in version 4.13 but is present in 4.17.
> 
> btrfs-progs_4.17-1~bpo9+1_amd64.changes is waiting to pass through NEW
> (due to the new library packages in the backport).  It will take
> between 24h and several weeks to pass to stretch-backports.  The
> variability in expected delivery is because a Debian FTP Master will
> need to manually ACK the changes.
> 
> If 4.17.1 is required, please 'reportbug btrfs-progs' and take care to
> set the "found" version to 4.17-1, plus provide a citation for why
> 4.17.1 is required--without this justification the bug severity could
> be downgraded to wishlist.  This hypothetical bug would be for the
> "unstable" suite (packages migrate to testing after a period of
> testing, and then become eligible for backporting).
> 
> Take care!
> Nicholas
> 


-- 


Re: [PATCH] Btrfs: remove wait_ordered_range in btrfs_evict_inode

2018-09-14 Thread Liu Bo
On Fri, Sep 14, 2018 at 03:14:33PM +0200, David Sterba wrote:
> On Wed, Sep 12, 2018 at 06:06:22AM +0800, Liu Bo wrote:
> > As VFS has called ->invalidatepage() to get all ordered extents done
> > and truncated all page cache pages, wait_ordered_range() is just a
> > noop.
> 
> Agreed, though looking up the exact points when there are no pages to be
> waited for took me some time. More references and hints in the changelog
> would be good. I'll add the patch to misc-next so it can be tested, but
> please send me an update if you have idea how to rephrase the changelog.
> Thanks.
> 
> Reviewed-by: David Sterba 

OK, I'll add some call stack, hopefully can make it better.

thanks,
-liubo


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-14 Thread Liu Bo
On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > complexity.
> > > 
> > > We'd like to see some numbers, though just by reading the code I think
> > > there's going to be a noticeable improvement for some workloads.
> > > 
> > > There's a common pattern:
> > > 
> > > while(node = rb_first) {
> > >   entry = rb_entry(node)
> > >   next = rb_next(node)
> > >   rb_erase(node)
> > >   cleanup(entry)
> > > }
> > > 
> > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > in rb_first. That's a cached memory access vs looped pointer
> > > dereference trade-off that IMHO has a clear winner.
> > > 
> > > As the pattern in many cases traverses the whole tree and removes all
> > > entries, ideally we could get rid of the rebalancing too. The entries
> > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > the end. This requires that order of the freeing does not matter, which
> > > might no apply to some trees.
> > 
> > The order of freeing does not matter, but we need the tree to return
> > the correct next node to free, IOW, we have to maintain the rbtree
> > until the last two nodes.
> > 
> > > 
> > > I did not find suitable rbtree functions or helpers for that,
> > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > rb_erase itself forces the rebalancing internally. But I think we can
> > > implement that.
> > > 
> > > We could possibly use rb_next_postorder that makes sure that all
> > > children are traversed before the parent so this is at least something
> > > that could considerd.
> > > 
> > 
> > Qu was inquiring about the same thing, I haven't got time to dig it
> > further.
> > 
> > The question we need to answer is that whether we care about how fast
> > destruction can make, as some are in critical paths while some are
> > not.
> 
> Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> the more efficient tree destruction would shorten the time where a lock
> is held.
> 
> In other cases it might complicate the code too much, though the
> performance is not critical, eg. at umount time. Although, it'd be good
> to optimize that too now that we know how.
> 
> And in the remaining cases it's not possible to avoid rb_erase. I had a
> closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> Based on that I'd like to add this series to for-next, the first node
> caching is clear enough and safe. We can do the other optimizations
> later.

This needs more fine-grained debugging to see what's the cost
lies on.

About the perf. number of rb_cached_node, I measured the time spent in
extent map loop in btrfs_evict_inode(),

evict_inode_truncate_pages()
while (!RB_EMPTY_ROOT(_tree->map)) {
 rb_first();
 rb_erase();
}


for a em tree containing 10,000 em,
- with rb_first_cached, the loop takes 4880454 ns,
- with rb_first, the loop takes 4584148 ns,

looks like the difference is not that much, ~6%.  After all it's about
scalability, the more rb tree gets, the more rb_first_cached() wins.

thanks,
-liubo


Re: Problem with BTRFS

2018-09-14 Thread Nicholas D Steeves
Hi,

On Fri, Sep 14, 2018 at 10:46:12PM +0500, Roman Mamedov wrote:
> On Fri, 14 Sep 2018 19:27:04 +0200
> Rafael Jesús Alcántara Pérez  wrote:
> 
> > BTRFS info (device sdc1): use lzo compression, level 0
> > BTRFS warning (device sdc1): 'recovery' is deprecated, use
> > 'usebackuproot' instead
> > BTRFS info (device sdc1): trying to use backup root at mount time
> > BTRFS info (device sdc1): disk space caching is enabled
> > BTRFS info (device sdc1): has skinny extents
> > BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
> > fs_devices total_rw_bytes 601023424
> 
> There is a recent feature added to "btrfs rescue" to fix this kind of
> condition: https://patchwork.kernel.org/patch/10011399/
> 
> You need a recent version of the Btrfs tools for it, not sure which, I see
> that it's not in version 4.13 but is present in 4.17.

btrfs-progs_4.17-1~bpo9+1_amd64.changes is waiting to pass through NEW
(due to the new library packages in the backport).  It will take
between 24h and several weeks to pass to stretch-backports.  The
variability in expected delivery is because a Debian FTP Master will
need to manually ACK the changes.

If 4.17.1 is required, please 'reportbug btrfs-progs' and take care to
set the "found" version to 4.17-1, plus provide a citation for why
4.17.1 is required--without this justification the bug severity could
be downgraded to wishlist.  This hypothetical bug would be for the
"unstable" suite (packages migrate to testing after a period of
testing, and then become eligible for backporting).

Take care!
Nicholas


signature.asc
Description: PGP signature


Re: Problem with BTRFS

2018-09-14 Thread Roman Mamedov
On Fri, 14 Sep 2018 19:27:04 +0200
Rafael Jesús Alcántara Pérez  wrote:

> BTRFS info (device sdc1): use lzo compression, level 0
> BTRFS warning (device sdc1): 'recovery' is deprecated, use
> 'usebackuproot' instead
> BTRFS info (device sdc1): trying to use backup root at mount time
> BTRFS info (device sdc1): disk space caching is enabled
> BTRFS info (device sdc1): has skinny extents
> BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
> fs_devices total_rw_bytes 601023424

There is a recent feature added to "btrfs rescue" to fix this kind of
condition: https://patchwork.kernel.org/patch/10011399/

You need a recent version of the Btrfs tools for it, not sure which, I see
that it's not in version 4.13 but is present in 4.17.

-- 
With respect,
Roman


Problem with BTRFS

2018-09-14 Thread Rafael Jesús Alcántara Pérez
Hi:

After a power outage, my server refuses to mount a BTRFS file system.

This is the result of trying to mount with options recovery,ro:

BTRFS info (device sdc1): use lzo compression, level 0
BTRFS warning (device sdc1): 'recovery' is deprecated, use
'usebackuproot' instead
BTRFS info (device sdc1): trying to use backup root at mount time
BTRFS info (device sdc1): disk space caching is enabled
BTRFS info (device sdc1): has skinny extents
BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
fs_devices total_rw_bytes 601023424
BTRFS error (device sdc1): failed to read chunk tree: -22
BTRFS error (device sdc1): open_ctree failed

And this is the initial output of a "btrfs check --readonly":

  https://pastebin.com/6gs7hJTG

I'm planning to do a "btrfs recover fix-device-size" but I don't know if
it can be dangerous or if it's better to do a "btrfs check --repair" before.

Any recommendation?

Greets and thanks in advance.
Rafael J. Alcántara Pérez.
-- 


Re: Problem with BTRFS

2018-09-14 Thread Rafael Jesús Alcántara Pérez
Hi again:

Sorry, I missed some details :)

$ uname -a
Linux gemini 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1
(2018-08-27) x86_64 GNU/Linux

$ btrfs --version
btrfs-progs v4.13.3

$ sudo btrfs fi show
Label: '/srv/dedalo'  uuid: d1071744-ac3b-4926-bb43-27091ea73c05
Total devices 4 FS bytes used 2.57TiB
devid3 size 1.82TiB used 1.80TiB path /dev/sdd1
devid4 size 1.82TiB used 1.81TiB path /dev/sdc1
devid5 size 1.82TiB used 1.81TiB path /dev/sdb1
devid6 size 10.00GiB used 573.94MiB path /dev/loop0

I can't include "btrfs fi df" because I can't mount the file system.

I've attached the dmesg output to the message.

Greets and thanks again.
Rafael J. Alcántara Pérez.

El 14/09/18 a las 19:27, Rafael Jesús Alcántara Pérez escribió:
> Hi:
> 
> After a power outage, my server refuses to mount a BTRFS file system.
> 
> This is the result of trying to mount with options recovery,ro:
> 
> BTRFS info (device sdc1): use lzo compression, level 0
> BTRFS warning (device sdc1): 'recovery' is deprecated, use
> 'usebackuproot' instead
> BTRFS info (device sdc1): trying to use backup root at mount time
> BTRFS info (device sdc1): disk space caching is enabled
> BTRFS info (device sdc1): has skinny extents
> BTRFS error (device sdc1): super_total_bytes 601020864 mismatch with
> fs_devices total_rw_bytes 601023424
> BTRFS error (device sdc1): failed to read chunk tree: -22
> BTRFS error (device sdc1): open_ctree failed
> 
> And this is the initial output of a "btrfs check --readonly":
> 
>   https://pastebin.com/6gs7hJTG
> 
> I'm planning to do a "btrfs recover fix-device-size" but I don't know if
> it can be dangerous or if it's better to do a "btrfs check --repair" before.
> 
> Any recommendation?
> 
> Greets and thanks in advance.
> Rafael J. Alcántara Pérez.
> 


-- 
[0.00] Linux version 4.17.0-0.bpo.3-amd64 (debian-ker...@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)
[0.00] Command line: BOOT_IMAGE=/vmlinuz-4.17.0-0.bpo.3-amd64 root=UUID=e51e49de-ad23-436d-bb84-a50bb8ec8c94 ro quiet vsyscall=emulate consoleblank=0
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Enabled xstate features 0x3, context size is 576 bytes, using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009] usable
[0.00] BIOS-e820: [mem 0x0010-0xbfb4] usable
[0.00] BIOS-e820: [mem 0xbfb5-0xbfb65fff] reserved
[0.00] BIOS-e820: [mem 0xbfb66000-0xbfb85bff] ACPI data
[0.00] BIOS-e820: [mem 0xbfb85c00-0xbfff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00073fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.5 present.
[0.00] DMI: Dell Inc. PowerEdge 2900/0NX642, BIOS 2.7.0 10/30/2010
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x74 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-EBFFF uncachable
[0.00]   EC000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 3F8000 write-back
[0.00]   1 base 008000 mask 3FC000 write-back
[0.00]   2 base 01 mask 3F write-back
[0.00]   3 base 02 mask 3E write-back
[0.00]   4 base 04 mask 3C write-back
[0.00]   5 base 00BFC0 mask 3FFFC0 uncachable
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.00] e820: update [mem 0xbfc0-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xbfb50 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fe710-0x000fe71f] mapped at [(ptrval)]
[0.00] Base memory trampoline at [(ptrval)] 99000 size 24576
[0.00] BRK [0x405e3f000, 0x405e3] PGTABLE
[0.00] BRK [0x405e4, 0x405e40fff] PGTABLE
[0.00] BRK [0x405e41000, 0x405e41fff] PGTABLE
[0.00] BRK [0x405e42000, 0x405e42fff] PGTABLE
[0.00] BRK [0x405e43000, 0x405e43fff] PGTABLE
[0.00] BRK [0x405e44000, 0x405e44fff] PGTABLE
[

Re: [PATCH 05/11] btrfs: dev-replace: avoid useless lock on error handling path

2018-09-14 Thread David Sterba
On Fri, Sep 07, 2018 at 04:55:11PM +0200, David Sterba wrote:
> The exit sequence in btrfs_dev_replace_start does not allow to simply
> add a label to the right place so the error handling after starting
> transaction failure jumps there. Currently there's a lock that pairs
> with the unlock in the section, which is unnecessary and only raises
> questions.  Add a variable to track the locking status and avoid the
> extra locking.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/dev-replace.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 8b512dddf727..0a49843b2ee6 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -400,6 +400,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>   int ret;
>   struct btrfs_device *tgt_device = NULL;
>   struct btrfs_device *src_device = NULL;
> + bool need_unlock;
>  
>   ret = btrfs_find_device_by_devspec(fs_info, srcdevid,
>   srcdev_name, _device);
> @@ -424,6 +425,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>   return PTR_ERR(trans);
>   }
>  
> + need_unlock = true;
>   btrfs_dev_replace_write_lock(dev_replace);
>   switch (dev_replace->replace_state) {
>   case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> @@ -462,6 +464,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>   atomic64_set(_replace->num_write_errors, 0);
>   atomic64_set(_replace->num_uncorrectable_read_errors, 0);
>   btrfs_dev_replace_write_unlock(dev_replace);
> + need_unlock = false;
>  
>   ret = btrfs_sysfs_add_device_link(tgt_device->fs_devices, tgt_device);
>   if (ret)
> @@ -473,7 +476,6 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>   trans = btrfs_start_transaction(root, 0);
>   if (IS_ERR(trans)) {
>   ret = PTR_ERR(trans);
> - btrfs_dev_replace_write_lock(dev_replace);

There's a merge conflict with Jeff's patch
https://patchwork.kernel.org/patch/10591003/ that adds some dev-replace
state change, so the lock must be here. I'll update my patch to

@@ -474,6 +477,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
+   need_unlock = true;
btrfs_dev_replace_write_lock(dev_replace);
dev_replace->replace_state =
BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED;
---


[PATCH 4/4 v2] btrfs: tests: polish ifdefs around testing helper

2018-09-14 Thread David Sterba
Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.

Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.

As this is only for the testing code, drop unlikely().

Signed-off-by: David Sterba 
---

v2:
- remove unlikely
- simplify to: return test_bit(...)

 fs/btrfs/ctree.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 32d2fce4ac53..1656ada9200b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3708,17 +3708,17 @@ static inline int btrfs_defrag_cancelled(struct 
btrfs_fs_info *fs_info)
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_inode_set_ops(struct inode *inode);
 void btrfs_test_destroy_inode(struct inode *inode);
-#endif
 
 static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
 {
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-   if (unlikely(test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO,
- _info->fs_state)))
-   return 1;
-#endif
+   return test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, _info->fs_state);
+}
+#else
+static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
+{
return 0;
 }
+#endif
 
 static inline void cond_wake_up(struct wait_queue_head *wq)
 {
-- 
2.18.0



[PATCH 1/4 v2] btrfs: tests: add separate stub for find_lock_delalloc_range

2018-09-14 Thread David Sterba
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Signed-off-by: David Sterba 
---

v2:
- add noinline_for_stack back
 fs/btrfs/ctree.h |  6 --
 fs/btrfs/extent_io.c | 13 -
 fs/btrfs/extent_io.h |  2 +-
 fs/btrfs/tests/extent-io-tests.c | 10 +-
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..45b7029d0f23 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -41,12 +41,6 @@ extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
 struct btrfs_ordered_sum;
 
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-#define STATIC noinline
-#else
-#define STATIC static noinline
-#endif
-
 #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
 
 #define BTRFS_MAX_MIRRORS 3
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4dd6faab02bb..93108b18b231 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1568,7 +1568,7 @@ static noinline int lock_delalloc_pages(struct inode 
*inode,
  *
  * 1 is returned if we find something, 0 if nothing was in the tree
  */
-STATIC u64 find_lock_delalloc_range(struct inode *inode,
+static noinline_for_stack u64 find_lock_delalloc_range(struct inode *inode,
struct extent_io_tree *tree,
struct page *locked_page, u64 *start,
u64 *end, u64 max_bytes)
@@ -1648,6 +1648,17 @@ STATIC u64 find_lock_delalloc_range(struct inode *inode,
return found;
 }
 
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+u64 btrfs_find_lock_delalloc_range(struct inode *inode,
+   struct extent_io_tree *tree,
+   struct page *locked_page, u64 *start,
+   u64 *end, u64 max_bytes)
+{
+   return find_lock_delalloc_range(inode, tree, locked_page, start, end,
+   max_bytes);
+}
+#endif
+
 static int __process_pages_contig(struct address_space *mapping,
  struct page *locked_page,
  pgoff_t start_index, pgoff_t end_index,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b4d03e677e1d..1a7fdcbca49b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -546,7 +546,7 @@ int free_io_failure(struct extent_io_tree *failure_tree,
struct extent_io_tree *io_tree,
struct io_failure_record *rec);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-noinline u64 find_lock_delalloc_range(struct inode *inode,
+u64 btrfs_find_lock_delalloc_range(struct inode *inode,
  struct extent_io_tree *tree,
  struct page *locked_page, u64 *start,
  u64 *end, u64 max_bytes);
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index d9269a531a4d..9e0f4a01be14 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -106,7 +106,7 @@ static int test_find_delalloc(u32 sectorsize)
set_extent_delalloc(, 0, sectorsize - 1, 0, NULL);
start = 0;
end = 0;
-   found = find_lock_delalloc_range(inode, , locked_page, ,
+   found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
 , max_bytes);
if (!found) {
test_err("should have found at least one delalloc");
@@ -137,7 +137,7 @@ static int test_find_delalloc(u32 sectorsize)
set_extent_delalloc(, sectorsize, max_bytes - 1, 0, NULL);
start = test_start;
end = 0;
-   found = find_lock_delalloc_range(inode, , locked_page, ,
+   found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
 , max_bytes);
if (!found) {
test_err("couldn't find delalloc in our range");
@@ -171,7 +171,7 @@ static int test_find_delalloc(u32 sectorsize)
}
start = test_start;
end = 0;
-   found = find_lock_delalloc_range(inode, , locked_page, ,
+   found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
 , max_bytes);
if (found) {
test_err("found range when we shouldn't have");
@@ -192,7 +192,7 @@ static int test_find_delalloc(u32 sectorsize)
set_extent_delalloc(, max_bytes, total_dirty - 1, 0, NULL);
start = test_start;
end = 0;
-   found = find_lock_delalloc_range(inode, , locked_page, ,
+   found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
   

Re: [PATCH 0/3] btrfs: trim latency improvements

2018-09-14 Thread David Sterba
On Thu, Sep 06, 2018 at 05:18:13PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> This patch set fixes a few issues with trim.
> 
> 1) Fix device list iteration.  We're iterating the ->alloc_list while
>holding the device_list_mutex.  The ->alloc_list is protected by
>the chunk mutex and we don't want to hold it across the entire
>trim execution.  Instead, use the ->devices list, which is protected
>by the device_list_mutex.
> 
> 2) Skip trim on devices that don't support it.  Rather than letting
>the block layer reject it, bounce out early.
> 
> 3) Don't keep the commit_root_sem locked and the transaction pinned
>across the block layer component of trim.  We only need these to
>ensure the pending chunks list doesn't go away underneath us, so
>it's safe to drop across the trim itself.  Historically, this
>caused issues when fstrim and balance would run at the same time
>since balance would produce lots of transactions and would
>have to wait constantly, causing problems for everything else that
>wanted to start a transaction.
> 
> -Jeff
> ---
> 
> Jeff Mahoney (3):
>   btrfs: use ->devices list instead of ->alloc_list in btrfs_trim_fs
>   btrfs: don't attempt to trim devices that don't support it
>   btrfs: keep trim from interfering with transaction commits

The patches have been in for-next for some time, I'm moving them to
misc-next now and will probably forward them to the next rc (5) with
other trim fixes from Qu.

Please note that the commit subject in the Fixes: tag should be in
"...", I've fixed that.


Re: [PATCH 3/3] btrfs: keep trim from interfering with transaction commits

2018-09-14 Thread David Sterba
On Thu, Sep 06, 2018 at 05:18:16PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Commit 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM)
> fixed free space trimming, but introduced latency when it was running.
> This is due to it pinning the transaction using both a incremented
> refcount and holding the commit root sem for the duration of a single
> trim operation.
> 
> This was to ensure safety but it's unnecessary.  We already hold the the
> chunk mutex so we know that the chunk we're using can't be allocated
> while we're trimming it.
> 
> In order to check against chunks allocated already in this transaction,
> we need to check the pending chunks list.  To to that safely without
> joining the transaction (or attaching than then having to commit it) we
> need to ensure that the dev root's commit root doesn't change underneath
> us and the pending chunk lists stays around until we're done with it.
> 
> We can ensure the former by holding the commit root sem and the latter
> by pinning the transaction.  We do this now, but the critical section
> covers the trim operation itself and we don't need to do that.
> 
> This patch moves the pinning and unpinning logic into helpers and
> unpins the transaction after performing the search and check for
> pending chunks.
> 
> Limiting the critical section of the transaction pinning improves
> the latency substantially on slower storage (e.g. image files over NFS).
> 
> Fixes: 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM
> Signed-off-by: Jeff Mahoney 

Reviewed-by: David Sterba 


Re: [PATCH 2/3] btrfs: don't attempt to trim devices that don't support it

2018-09-14 Thread David Sterba
On Thu, Sep 06, 2018 at 05:18:15PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> We check whether any device the file system is using supports discard
> in the ioctl call, but then we attempt to trim free extents on every
> device regardless of whether discard is supported.  Due to the way
> we mask off EOPNOTSUPP, we can end up issuing the trim operations
> on each free range on devices that don't support it, just wasting time.
> 
> Fixes: 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM)
> Signed-off-by: Jeff Mahoney 

Reviewed-by: David Sterba 


Re: [PATCH 1/3] btrfs: use ->devices list instead of ->alloc_list in btrfs_trim_fs

2018-09-14 Thread David Sterba
On Thu, Sep 06, 2018 at 05:18:14PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> btrfs_trim_fs iterates over the fs_devices->alloc_list while holding
> the device_list_mutex.  The problem is that ->alloc_list is protected
> by the chunk mutex.  We don't want to hold the chunk mutex over
> the trim of the entire file system.  Fortunately, the ->dev_list
> list is protected by the dev_list mutex and while it will give us
> all devices, including read-only devices, we already just skip the
> read-only devices.  Then we can continue to take and release the chunk
> mutex while scanning each device.
> 
> Fixes: 499f377f49f (btrfs: iterate over unused chunk space in FITRIM)
> Signed-off-by: Jeff Mahoney 

Reviewed-by: David Sterba 


Re: [PATCH v3.1 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2018-09-14 Thread David Sterba
On Fri, Sep 07, 2018 at 02:16:24PM +0800, Qu Wenruo wrote:
> [BUG]
> fstrim on some btrfs only trims the unallocated space, not trimming any
> space in existing block groups.
> 
> [CAUSE]
> Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
> range [0, super->total_bytes).
> So later btrfs_trim_fs() will only be able to trim block groups in range
> [0, super->total_bytes).
> 
> While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
> its logical address space, there is nothing limiting the location where
> we put block groups.
> 
> For btrfs with routine balance, it's quite easy to relocate all
> block groups and bytenr of block groups will start beyond super->total_bytes.
> 
> In that case, btrfs will not trim existing block groups.
> 
> [FIX]
> Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
> can get the unmodified range, which is normally set to [0, U64_MAX].
> 
> Reported-by: Chris Murphy 
> Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
> ioctl")
> Cc:  # v4.0+
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 


Re: [PATCH v3.1 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-09-14 Thread David Sterba
On Fri, Sep 07, 2018 at 02:16:23PM +0800, Qu Wenruo wrote:
> + if (bg_failed)
> + btrfs_warn(fs_info, "failed to trim %llu block group(s)",
> +bg_failed);

> + if (dev_failed)
> + btrfs_warn(fs_info, "failed to trim %llu device(s)",
> +dev_failed);

Minor update, I've added bg_ret and dev_ret to the messages as this will
end up in the logs and could be useful for later analysis.


Re: [PATCH v3.1 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-09-14 Thread David Sterba
On Fri, Sep 07, 2018 at 02:16:23PM +0800, Qu Wenruo wrote:
> Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
> error happens when trimming existing block groups, it will skip the
> remaining blocks and continue to trim unallocated space for each device.
> 
> And the return value will only reflect the final error from device
> trimming.
> 
> This patch will fix such behavior by:
> 
> 1) Recording last error from block group or device trimming
>So return value will also reflect the last error during trimming.
>Make developer more aware of the problem.
> 
> 2) Continuing trimming if we can
>If we failed to trim one block group or device, we could still try
>next block group or device.
> 
> 3) Report number of failures during block group and device trimming
>So it would be less noisy, but still gives user a brief summary of
>what's going wrong.
> 
> Such behavior can avoid confusion for case like failure to trim the
> first block group and then only unallocated space is trimmed.
> 
> Reported-by: Chris Murphy 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 

> - ret = btrfs_trim_block_group(cache,
> -  _trimmed,
> -  start,
> -  end,
> -  range->minlen);
> + ret = btrfs_trim_block_group(cache, _trimmed,
> + start, end, range->minlen);

No unrealted changes please, this just reformats the argument list.


Re: [PATCH] btrfs: fix error handling in btrfs_dev_replace_start

2018-09-14 Thread David Sterba
On Mon, Sep 10, 2018 at 07:47:29PM +0200, David Sterba wrote:
> On Thu, Sep 06, 2018 at 03:52:17PM -0400, je...@suse.com wrote:
> > From: Jeff Mahoney 
> > 
> > When we fail to start a transaction in btrfs_dev_replace_start,
> > we leave dev_replace->replace_start set to STARTED but clear
> > ->srcdev and ->tgtdev.  Later, that can result in an Oops in
> > btrfs_dev_replace_progress when having state set to STARTED or
> > SUSPENDED implies that ->srcdev is valid.
> > 
> > Also fix error handling when the state is already STARTED or
> > SUSPENDED while starting.  That, too, will clear ->srcdev and ->tgtdev
> > even though it doesn't own them.  This should be an impossible case to
> > hit since we should be protected by the BTRFS_FS_EXCL_OP bit being set.
> > Let's add an ASSERT there while we're at it.
> > 
> > Fixes: e93c89c1a (Btrfs: add new sources for device replace code)
> > Signed-off-by: Jeff Mahoney 
> > ---
> >  fs/btrfs/dev-replace.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> > index e2ba0419297a..0581c8570a05 100644
> > --- a/fs/btrfs/dev-replace.c
> > +++ b/fs/btrfs/dev-replace.c
> > @@ -445,6 +445,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info 
> > *fs_info,
> > break;
> > case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> > case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> > +   ASSERT(0);
> 
> I don't like it fixed that way. The asserts get compiled in
> conditionally, so the behaviour might be differend dending on the
> config. If the cases are unreachable, then the two cases can be replaced
> by a default: and return EINVAL.

As this is a minor issue and a matter of cleanup, I'm going to merge the
patch as-is because it's a bugfix.


Re: [PATCH 0/8 V2] Add delayed-refs support to btrfs-progs

2018-09-14 Thread David Sterba
On Thu, Aug 16, 2018 at 04:10:27PM +0300, Nikolay Borisov wrote:
> Hello, 
> 
> Here is the second version of the delayed refs for progs support. The first 
> version can be found here [1]. I've taken into account all the feedback from 
> Misono and have verified the code is working and rebased it atop btrfs-progs
> 4.17.1.
> 
> Changes since v1: 
>  * Removed num_entries variable from delayed ref root
>  
>  * Added a patch to refactor btrfs_write_dirty_block_groups and subsequently 
>  changed when this function is called to fix an issue reported by Misono. I
>  verified that 'make test-fsck TEST_ENABLE_OVERRIDE=true 
> TEST_ARGS_CHECK=--mode=lowmem'
>  produces no errors
> 
>  * Added 2 patches which remove the newly added adapter functions at the 
>  beggining of the series, following the wiring up of the delayed refs 
>  infrastructured. The first one (dealing with __free_extent2) is trivial, 
> while
>  the second one (for alloc_reserved_tree_block2) is a bit more involved, since
>  I've opted to merge the two functions. 
> 
>  * Rebased atop latest btrfs-progs release - 4.17.1
> 
>  * Dropped patches which have been merged in the mean time
> 
> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg79173.html
> 
> Nikolay Borisov (8):
>   btrfs-progs: Add __free_extent2 function
>   btrfs-progs: Add alloc_reserved_tree_block2 function
>   btrfs-progs: Add delayed refs infrastructure
>   btrfs-progs: Make btrfs_write_dirty_block_groups take only trans
> argument
>   btrfs-progs: Wire up delayed refs
>   btrfs-progs: Remove old delayed refs infrastructure
>   btrfs-progs: Remove __free_extent2
>   btrfs-progs: Merge alloc_reserved_tree_block(2|)

For the record, the v2 patchset has been merged to devel.


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-14 Thread David Sterba
On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > more efficient to do this with rb_first_cached() which has the O(1)
> > > complexity.
> > 
> > We'd like to see some numbers, though just by reading the code I think
> > there's going to be a noticeable improvement for some workloads.
> > 
> > There's a common pattern:
> > 
> > while(node = rb_first) {
> > entry = rb_entry(node)
> > next = rb_next(node)
> > rb_erase(node)
> > cleanup(entry)
> > }
> > 
> > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > completely reshuffle the tree. With the caching we'll skip the traversal
> > in rb_first. That's a cached memory access vs looped pointer
> > dereference trade-off that IMHO has a clear winner.
> > 
> > As the pattern in many cases traverses the whole tree and removes all
> > entries, ideally we could get rid of the rebalancing too. The entries
> > would be cleaned up in postorder way, ie. reset the data and kfree in
> > the end. This requires that order of the freeing does not matter, which
> > might no apply to some trees.
> 
> The order of freeing does not matter, but we need the tree to return
> the correct next node to free, IOW, we have to maintain the rbtree
> until the last two nodes.
> 
> > 
> > I did not find suitable rbtree functions or helpers for that,
> > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > rb_erase itself forces the rebalancing internally. But I think we can
> > implement that.
> > 
> > We could possibly use rb_next_postorder that makes sure that all
> > children are traversed before the parent so this is at least something
> > that could considerd.
> > 
> 
> Qu was inquiring about the same thing, I haven't got time to dig it
> further.
> 
> The question we need to answer is that whether we care about how fast
> destruction can make, as some are in critical paths while some are
> not.

Yeah, I take __btrfs_return_cluster_to_free_space as an example where
the more efficient tree destruction would shorten the time where a lock
is held.

In other cases it might complicate the code too much, though the
performance is not critical, eg. at umount time. Although, it'd be good
to optimize that too now that we know how.

And in the remaining cases it's not possible to avoid rb_erase. I had a
closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
Based on that I'd like to add this series to for-next, the first node
caching is clear enough and safe. We can do the other optimizations
later.


Re: [PATCH] btrfs-progs: change filename limit to 255 when creating subvolume

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 03:39:03PM +0800, Su Yanjun wrote:
> Modify the file name length limit to meet the Linux naming convention. 
> In addition, the file name length is always bigger than 0, no need to 
> compare with 0 again.
> 
> Issue: #145
> Signed-off-by: Su Yanjun 

Looks good, please send a test, thanks. You can copy portions of the
misc-tests/014-filesystem-label that does similar string length check
for the label.


Re: [PATCH] btrfs-progs: Continue checking even we found something wrong in free space cache

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 07:43:27AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/9/11 下午10:48, David Sterba wrote:
> > On Tue, Sep 04, 2018 at 08:41:28PM +0800, Qu Wenruo wrote:
> >> No need to abort checking, especially for RO check free space cache is
> >> meaningless, the errors in fs/extent tree is more interesting for
> >> developers.
> >>
> >> So continue checking even something in free space cache is wrong.
> >>
> >> Reported-by: Etienne Champetier 
> >> Signed-off-by: Qu Wenruo 
> >> ---
> >>  check/main.c | 1 -
> >>  1 file changed, 1 deletion(-)
> >>
> >> diff --git a/check/main.c b/check/main.c
> >> index b361cd7e26a0..4f720163221e 100644
> >> --- a/check/main.c
> >> +++ b/check/main.c
> >> @@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv)
> >>error("errors found in free space tree");
> >>else
> >>error("errors found in free space cache");
> > 
> > Please make it a warning and update the message so it says that it will
> > continue despite the errors found.
> 
> I'm fine to add warning, but isn't it the expected behavior?
> 
> In fact I'm quite surprised that when we found something wrong we just
> abort checking.
> 
> It's never the case for file/extent tree check. We always report all
> errors we found, not only the first error and exit.

You're right, in context of 'check' this is the expected behaviour. So
let's keep error().


Re: [PATCH 0/6] btrfs-progs: Variant fixes for fuzz-tests

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 07:59:34AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/9/12 上午12:38, David Sterba wrote:
> > On Wed, Aug 29, 2018 at 01:27:39PM +0800, Qu Wenruo wrote:
> >> Gentle ping.
> >>
> >> These fixes are pretty small, I'd like to see them merged before I need
> >> to rebase them again and again.
> > 
> > I've merged them now, thanks. I had to edit all changelogs and remove
> > the - lines, shortened the paths in the test logs and fixed typos.
> 
> Thanks for the editing.
> 
> I'll keep the log format as default for all later patches.
> 
> BTW, is there any difference between 2 spaces indent and 1 space indent
> for log?

No, it's for some visual separation so on first glance is obvious what's
the changelog text and what are logs or commands to reproduce. One space
is ok too so the log is not drifting away.


Re: [PATCH 4/4] btrfs: tests: polish ifdefs around testing helper

2018-09-14 Thread David Sterba
On Tue, Sep 11, 2018 at 12:14:47PM -0700, Omar Sandoval wrote:
> > The unlikely can go away, sure.
> > 
> > I would still like to remove the test_bit call when tests are compiled
> > out. There are about 10 calls to btrfs_is_testing in various core
> > functions, followed by further statements. This would have a
> > (negligible) runtime penalty but generates effectively unused code on
> > production builds.
> > 
> > The static inline function returning 0 allows to optimize out the unused
> > code, so smaller code, fewer inctructions, etc.
> 
> Absolutely, I just mean that the CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> version can be cleaner:
> 
> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
> {
>   return test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, _info->fs_state);
> }
> #else
> static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
> {
>   return 0;
> }
> #endif
> 
> I find `if (1) return 1; else return 0;` really icky.

I see what you mean, will fix it.


Re: [PATCH] Btrfs: skip setting path to blocking mode if balance is not needed

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 09:51:33AM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.09.2018 01:06, Liu Bo wrote:
> > balance_level() may return early in some cases, but these checks don't
> > have to be done with blocking write lock.
> > 
> > This puts together these checks into a helper and the benefit is to
> > avoid switching spinning locks to blocking locks (in these paticular
> > cases) which slows down btrfs overall.
> 
> Performance patches without numbers are frowned upon. You need to
> substantiate your claims.
> 
> 
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/ctree.c | 41 ++---
> >  1 file changed, 30 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 858085490e23..ba267a069ca1 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1758,6 +1758,29 @@ static void root_sub_used(struct btrfs_root *root, 
> > u32 size)
> > return eb;
> >  }
> >  
> > +static bool need_balance_level(struct btrfs_fs_info *fs_info,
> 
> nit: I think should_balance_level seems more readable, but it could be
> just me so won't insist on that.

Quick grep shows that should_ is used more frequently, so I'd go with
that.


Re: [PATCH] btrfs: Remove logically dead code from btrfs_orphan_cleanup

2018-09-14 Thread David Sterba
On Thu, Sep 13, 2018 at 11:35:00AM +0300, Nikolay Borisov wrote:
> In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
> executed. This is due to the last assignment to 'ret' depending on
> the return value of btrfs_iget. If an error other than -ENOENT is
> returned then the loop is prematurely terminated by 'goto out'.
> On the other hand, if the error value is ENOENT then a subsequent
> if branch is executed that always re-assigns 'ret' and in case it's
> an error just terminates the loop. No functional changes.
> 
> CID: 1437392
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 


Re: [PATCH] Btrfs: remove wait_ordered_range in btrfs_evict_inode

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 06:06:22AM +0800, Liu Bo wrote:
> As VFS has called ->invalidatepage() to get all ordered extents done
> and truncated all page cache pages, wait_ordered_range() is just a
> noop.

Agreed, though looking up the exact points when there are no pages to be
waited for took me some time. More references and hints in the changelog
would be good. I'll add the patch to misc-next so it can be tested, but
please send me an update if you have idea how to rephrase the changelog.
Thanks.

Reviewed-by: David Sterba 


Re: [PATCH v2] Btrfs: skip set_page_dirty if eb is dirty

2018-09-14 Thread David Sterba
On Fri, Sep 14, 2018 at 01:44:42AM +0800, Liu Bo wrote:
> As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
> are dirty, so no need to set pages dirty again.
> 
> Ftrace showed that the loop took 10us on my dev box, so removing this
> can save us at least 10us if eb is already dirty.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 


Re: [PATCH v2] Btrfs: assert page dirty bit

2018-09-14 Thread David Sterba
On Fri, Sep 14, 2018 at 01:46:08AM +0800, Liu Bo wrote:
> Just in case that someone breaks the rule that pages are dirty as long
> as eb is dirty.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

I've reordered the patches so the assert comes first, before the actual
conditional dirtying.


Re: [PATCH v2] Btrfs: remove level==0 check in balance_level

2018-09-14 Thread David Sterba
On Fri, Sep 14, 2018 at 01:55:59AM +0800, Liu Bo wrote:
> btrfs_search_slot()
>if (level != 0)
>   setup_nodes_for_search()
>   balance_level()
> 
> It is just impossible to have level=0 in balance_level.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 


Re: btrfs filesystem show takes a long time

2018-09-14 Thread David Sterba
Hi,

thanks for the report, I've forwarded it to the issue tracker
https://github.com/kdave/btrfs-progs/issues/148

The show command uses the information provided by blkid, that presumably
caches that. The default behaviour of 'fi show' is to skip mount checks,
so the delays are likely caused by blkid, but that's not the only
possible reason.

You may try to run the show command under strace to see where it blocks.


btrfs filesystem show takes a long time

2018-09-14 Thread Stefan K
Dear Maintainer,

the command btrfs fi show takes too much time:

time btrfs fi show  
Label: none  uuid: 513dc574-e8bc-4336-b181-00d1e9782c1c
Total devices 2 FS bytes used 2.34GiB
devid1 size 927.79GiB used 4.03GiB path /dev/sdv2
devid2 size 927.79GiB used 4.03GiB path /dev/sdar2

real12m59.763s
user0m0.008s
sys 0m0.044s


time btrfs fi show
Label: none  uuid: 513dc574-e8bc-4336-b181-00d1e9782c1c
Total devices 2 FS bytes used 2.34GiB
devid1 size 927.79GiB used 4.03GiB path /dev/sdv2
devid2 size 927.79GiB used 4.03GiB path /dev/sdar2

real6m22.498s
user0m0.012s
sys 0m0.024s


time btrfs fi show
Label: none  uuid: 513dc574-e8bc-4336-b181-00d1e9782c1c
Total devices 2 FS bytes used 2.34GiB
devid1 size 927.79GiB used 4.03GiB path /dev/sdv2
devid2 size 927.79GiB used 4.03GiB path /dev/sdar2

real6m19.796s
user0m0.012s
sys 0m0.024s


Maybe its related to that I've some harddisk in my system:
ls /dev/disk/by-path/ |grep -v part |wc -l
44

This is also a known Debian Bug #891717

-- System Information:
Debian Release: 9.3
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.0-5-amd64 (SMP w/20 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages btrfs-progs depends on:
ii  e2fslibs1.43.4-2
ii  libblkid1   2.29.2-1
ii  libc6   2.24-11+deb9u1
ii  libcomerr2  1.43.4-2
ii  liblzo2-2   2.08-1.2+b2
ii  libuuid12.29.2-1
ii  zlib1g  1:1.2.8.dfsg-5


[PATCH v1.1] btrfs-progs: Do metadata prealloc as long as we're not modifying extent tree

2018-09-14 Thread Qu Wenruo
In github issues, one user reports unexpected ENOSPC error if enabling
datasum.
After some investigation, it looks like that during ext2_saved/image
creation, we could create large file extent whose size can be 128M (max
data extent size).

In that case, its csum will be at least 128K. Under certain case we need
to allocate extra metadata chunks to fulfill such space requirement.

However we only do metadata prealloc if we're reserving extents for fs
trees.
(we use btrfs_root::ref_cows to determine whether we should do metadata
prealloc, and that member is only set for fs trees).

There is no explaination on why we only do metadata prealloc for file
trees, but at least from my investigation, it could be related to avoid
nested extent tree modication.

At least extent reservation for csum tree shouldn't be a problem with
metadata block group preallocation.

So change the metadata block group preallocation check from
"root->ref_cow" to "root->root_key.objectid !=
BTRFS_EXTENT_TREE_OBJECTID", and add some comment for it.

Issue: 123
Signed-off-by: Qu Wenruo 
---
changelog:
v1.1
  Fix stupid subject, from "csum tree" to "extent tree"
---
 extent-tree.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/extent-tree.c b/extent-tree.c
index 5d49af5a901e..bdf1b0e94c5f 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2652,7 +2652,12 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
profile = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
}
 
-   if (root->ref_cows) {
+   /*
+* Do metadata preallocate if we're not modifying extent tree.
+* Allocating chunk while modify extent tree could lead to tranid
+* mismatch, as do_chunk_alloc() could commit transaction.
+*/
+   if (root->root_key.objectid != BTRFS_EXTENT_TREE_OBJECTID) {
if (!(profile & BTRFS_BLOCK_GROUP_METADATA)) {
ret = do_chunk_alloc(trans, info,
 num_bytes,
-- 
2.19.0



[PATCH] btrfs-progs: Do metadata prealloc as long as we're not modifying csum tree

2018-09-14 Thread Qu Wenruo
In github issues, one user reports unexpected ENOSPC error if enabling
datasum.
After some investigation, it looks like that during ext2_saved/image
creation, we could create large file extent whose size can be 128M (max
data extent size).

In that case, its csum will be at least 128K. Under certain case we need
to allocate extra metadata chunks to fulfill such space requirement.

However we only do metadata prealloc if we're reserving extents for fs
trees.
(we use btrfs_root::ref_cows to determine whether we should do metadata
prealloc, and that member is only set for fs trees).

There is no explaination on why we only do metadata prealloc for file
trees, but at least from my investigation, it could be related to avoid
nested extent tree modication.

At least extent reservation for csum tree shouldn't be a problem with
metadata block group preallocation.

So change the metadata block group preallocation check from
"root->ref_cow" to "root->root_key.objectid !=
BTRFS_EXTENT_TREE_OBJECTID", and add some comment for it.

Issue: 123
Signed-off-by: Qu Wenruo 
---
 extent-tree.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/extent-tree.c b/extent-tree.c
index 5d49af5a901e..bdf1b0e94c5f 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2652,7 +2652,12 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
profile = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
}
 
-   if (root->ref_cows) {
+   /*
+* Do metadata preallocate if we're not modifying extent tree.
+* Allocating chunk while modify extent tree could lead to tranid
+* mismatch, as do_chunk_alloc() could commit transaction.
+*/
+   if (root->root_key.objectid != BTRFS_EXTENT_TREE_OBJECTID) {
if (!(profile & BTRFS_BLOCK_GROUP_METADATA)) {
ret = do_chunk_alloc(trans, info,
 num_bytes,
-- 
2.19.0



Re: [PATCH 2/2] btrfs-progs: convert: Output meaningful error messages for create_image()

2018-09-14 Thread Nikolay Borisov



On 14.09.2018 10:25, Qu Wenruo wrote:
> When convert failed, the error messsage would look like:
> 
>   create btrfs filesystem:
>   blocksize: 4096
>   nodesize:  16384
>   features:  extref, skinny-metadata (default)
>   creating ext2 image file
>   ERROR: failed to create ext2_saved/image: -1
>   WARNING: an error occurred during conversion, filesystem is partially
>   created but not finalized and not mountable
> 
> We can only know something wrong happened during "ext2_saved/image" file
> creation, but unable to know what exactly went wrong.
> 
> This patch will add the following error messages for create_image() and
> its callee:
> 
> 1) Sanity test error
> 2) Csum calculation error
> 3) Free ino number allocation error
> 4) Inode creation error
> 5) Inode mode change error
> 6) Inode link error
> 
> With all these error messages, we should be pretty easy to locate the
> error without extra debugging.
> 
> Reported-by: Serhat Sevki Dincer 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  convert/main.c | 37 ++---
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/convert/main.c b/convert/main.c
> index 3736a14955d1..f100699b652c 100644
> --- a/convert/main.c
> +++ b/convert/main.c
> @@ -290,10 +290,16 @@ static int create_image_file_range(struct 
> btrfs_trans_handle *trans,
>   if (disk_bytenr) {
>   /* Check if the range is in a data block group */
>   bg_cache = btrfs_lookup_block_group(root->fs_info, bytenr);
> - if (!bg_cache)
> + if (!bg_cache) {
> + error("missing data block for bytenr %llu", bytenr);
>   return -ENOENT;
> - if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_DATA))
> + }
> + if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_DATA)) {
> + error(
> + "data bytenr %llu is covered by non-data block group %llu flags 0x%llu",
> +   bytenr, bg_cache->key.objectid, bg_cache->flags);
>   return -EINVAL;
> + }
>  
>   /* The extent should never cross block group boundary */
>   len = min_t(u64, len, bg_cache->key.objectid +
> @@ -310,8 +316,13 @@ static int create_image_file_range(struct 
> btrfs_trans_handle *trans,
>   if (ret < 0)
>   return ret;
>  
> - if (datacsum)
> + if (datacsum) {
>   ret = csum_disk_extent(trans, root, bytenr, len);
> + if (ret < 0)
> + error(
> + "failed to calculate csum for bytenr %llu len %llu, %s\n",
> +   bytenr, len, strerror(-ret));
> + }
>   *ret_len = len;
>   return ret;
>  }
> @@ -759,18 +770,30 @@ static int create_image(struct btrfs_root *root,
>  
>   ret = btrfs_find_free_objectid(trans, root, BTRFS_FIRST_FREE_OBJECTID,
>  );
> - if (ret < 0)
> + if (ret < 0) {
> + error("failed to find free objectid for root %llu, %s",
> + root->root_key.objectid, strerror(-ret));
>   goto out;
> + }
>   ret = btrfs_new_inode(trans, root, ino, 0400 | S_IFREG);
> - if (ret < 0)
> + if (ret < 0) {
> + error("failed to create new inode for root %llu, %s",
> + root->root_key.objectid, strerror(-ret));
>   goto out;
> + }
>   ret = btrfs_change_inode_flags(trans, root, ino, flags);
> - if (ret < 0)
> + if (ret < 0) {
> + error("failed to change inode flag for ino %llu root %llu, %s",
> + ino, root->root_key.objectid, strerror(-ret));
>   goto out;
> + }
>   ret = btrfs_add_link(trans, root, ino, BTRFS_FIRST_FREE_OBJECTID, name,
>strlen(name), BTRFS_FT_REG_FILE, NULL, 1, 0);
> - if (ret < 0)
> + if (ret < 0) {
> + error("failed to link ino %llu to '/%s' in root %llu, %s",
> + ino, name, root->root_key.objectid, strerror(-ret));
>   goto out;
> + }
>  
>   key.objectid = ino;
>   key.type = BTRFS_INODE_ITEM_KEY;
> 


Re: [PATCH 1/2] btrfs: convert: Make read_disk_extent() return more meaningful -EIO other -1

2018-09-14 Thread Nikolay Borisov



On 14.09.2018 10:25, Qu Wenruo wrote:
> When pread64() returns value smaller than expected, it normally means
> EIO, so just return -EIO to replace the intermediate number.
> So when IO fails, we should be able to get more meaningful error number
> of than EPERM.
> 
> Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 

> ---
>  convert/source-fs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/convert/source-fs.c b/convert/source-fs.c
> index b6d08370428a..5660a22cc652 100644
> --- a/convert/source-fs.c
> +++ b/convert/source-fs.c
> @@ -201,7 +201,7 @@ int read_disk_extent(struct btrfs_root *root, u64 bytenr,
>   ret = 0;
>  fail:
>   if (ret > 0)
> - ret = -1;
> + ret = -EIO;
>   return ret;
>  }
>  
> 


[PATCH 2/2] btrfs-progs: convert: Output meaningful error messages for create_image()

2018-09-14 Thread Qu Wenruo
When convert failed, the error messsage would look like:

  create btrfs filesystem:
  blocksize: 4096
  nodesize:  16384
  features:  extref, skinny-metadata (default)
  creating ext2 image file
  ERROR: failed to create ext2_saved/image: -1
  WARNING: an error occurred during conversion, filesystem is partially
  created but not finalized and not mountable

We can only know something wrong happened during "ext2_saved/image" file
creation, but unable to know what exactly went wrong.

This patch will add the following error messages for create_image() and
its callee:

1) Sanity test error
2) Csum calculation error
3) Free ino number allocation error
4) Inode creation error
5) Inode mode change error
6) Inode link error

With all these error messages, we should be pretty easy to locate the
error without extra debugging.

Reported-by: Serhat Sevki Dincer 
Signed-off-by: Qu Wenruo 
---
 convert/main.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 3736a14955d1..f100699b652c 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -290,10 +290,16 @@ static int create_image_file_range(struct 
btrfs_trans_handle *trans,
if (disk_bytenr) {
/* Check if the range is in a data block group */
bg_cache = btrfs_lookup_block_group(root->fs_info, bytenr);
-   if (!bg_cache)
+   if (!bg_cache) {
+   error("missing data block for bytenr %llu", bytenr);
return -ENOENT;
-   if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_DATA))
+   }
+   if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_DATA)) {
+   error(
+   "data bytenr %llu is covered by non-data block group %llu flags 0x%llu",
+ bytenr, bg_cache->key.objectid, bg_cache->flags);
return -EINVAL;
+   }
 
/* The extent should never cross block group boundary */
len = min_t(u64, len, bg_cache->key.objectid +
@@ -310,8 +316,13 @@ static int create_image_file_range(struct 
btrfs_trans_handle *trans,
if (ret < 0)
return ret;
 
-   if (datacsum)
+   if (datacsum) {
ret = csum_disk_extent(trans, root, bytenr, len);
+   if (ret < 0)
+   error(
+   "failed to calculate csum for bytenr %llu len %llu, %s\n",
+ bytenr, len, strerror(-ret));
+   }
*ret_len = len;
return ret;
 }
@@ -759,18 +770,30 @@ static int create_image(struct btrfs_root *root,
 
ret = btrfs_find_free_objectid(trans, root, BTRFS_FIRST_FREE_OBJECTID,
   );
-   if (ret < 0)
+   if (ret < 0) {
+   error("failed to find free objectid for root %llu, %s",
+   root->root_key.objectid, strerror(-ret));
goto out;
+   }
ret = btrfs_new_inode(trans, root, ino, 0400 | S_IFREG);
-   if (ret < 0)
+   if (ret < 0) {
+   error("failed to create new inode for root %llu, %s",
+   root->root_key.objectid, strerror(-ret));
goto out;
+   }
ret = btrfs_change_inode_flags(trans, root, ino, flags);
-   if (ret < 0)
+   if (ret < 0) {
+   error("failed to change inode flag for ino %llu root %llu, %s",
+   ino, root->root_key.objectid, strerror(-ret));
goto out;
+   }
ret = btrfs_add_link(trans, root, ino, BTRFS_FIRST_FREE_OBJECTID, name,
 strlen(name), BTRFS_FT_REG_FILE, NULL, 1, 0);
-   if (ret < 0)
+   if (ret < 0) {
+   error("failed to link ino %llu to '/%s' in root %llu, %s",
+   ino, name, root->root_key.objectid, strerror(-ret));
goto out;
+   }
 
key.objectid = ino;
key.type = BTRFS_INODE_ITEM_KEY;
-- 
2.19.0



[PATCH 1/2] btrfs: convert: Make read_disk_extent() return more meaningful -EIO other -1

2018-09-14 Thread Qu Wenruo
When pread64() returns value smaller than expected, it normally means
EIO, so just return -EIO to replace the intermediate number.
So when IO fails, we should be able to get more meaningful error number
of than EPERM.

Signed-off-by: Qu Wenruo 
---
 convert/source-fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/convert/source-fs.c b/convert/source-fs.c
index b6d08370428a..5660a22cc652 100644
--- a/convert/source-fs.c
+++ b/convert/source-fs.c
@@ -201,7 +201,7 @@ int read_disk_extent(struct btrfs_root *root, u64 bytenr,
ret = 0;
 fail:
if (ret > 0)
-   ret = -1;
+   ret = -EIO;
return ret;
 }
 
-- 
2.19.0



[PATCH 0/2] Enhanced error messages for btrfs-convert

2018-09-14 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/convert_error_messages
As usual, it's based on latest stable tag (v4.17.1).

There is one error report of btrfs-convert, the error message looks
pretty meaningless:

  create btrfs filesystem:
  blocksize: 4096
  nodesize:  16384
  features:  extref, skinny-metadata (default)
  creating ext2 image file
  ERROR: failed to create ext2_saved/image: -1
  WARNING: an error occurred during conversion, filesystem is partially
  created but not finalized and not mountable

After some investigation, the problem turns out to be read failure.
But the error number is intermediate number (-1) returned from
read_disk_extent().

This patchset will first fix the intermediate return number of
read_disk_extent(), then add more error messages for btrfs-convert (at
least convert part) to makes it easier to identify the problem.

In this particular case, it should output things like:

  create btrfs filesystem:
  blocksize: 4096
  nodesize:  16384
  features:  extref, skinny-metadata (default)
  creating ext2 image file
  ERROR: failed to calculate csum for bytenr 2732765184 len 4096, Input/output 
error
  ERROR: failed to create ext2_saved/image: -5
  WARNING: an error occurred during conversion, filesystem is partially
  created but not finalized and not mountable


Qu Wenruo (2):
  btrfs: convert: Make read_disk_extent() return more meaningful -EIO
other -1
  btrfs-progs: convert: Output meaningful error messages for
create_image()

 convert/main.c  | 37 ++---
 convert/source-fs.c |  2 +-
 2 files changed, 31 insertions(+), 8 deletions(-)

-- 
2.19.0



Re: [PATCH v2 4/7] btrfs-progs: lowmem: search key of root again after check_fs_root() under repair

2018-09-14 Thread Su Yue




On 09/14/2018 02:27 PM, Nikolay Borisov wrote:



On 14.09.2018 03:58, Su Yue wrote:



On 09/14/2018 07:37 AM, Qu Wenruo wrote:



On 2018/9/13 上午4:49, damenly...@gmail.com wrote:

From: Su Yue 

In check_fs_roots_lowmem(), we do search and follow the resulted path
to call check_fs_root(), then call btrfs_next_item() to check next
root.
However, if repair is enabled, the root tree can be cowed, the
existed path can cause strange errors.

Solution:
    If repair, save the key before calling check_fs_root,
    search the saved key again before checking next root.


Both reason and solution looks good.



Signed-off-by: Su Yue 
---
   check/mode-lowmem.c | 15 +++
   1 file changed, 15 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 89a304bbdd69..8fc9edab1d66 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4967,9 +4967,13 @@ int check_fs_roots_lowmem(struct btrfs_fs_info
*fs_info)
   }
     while (1) {
+    struct btrfs_key saved_key;
+
   node = path.nodes[0];
   slot = path.slots[0];
   btrfs_item_key_to_cpu(node, , slot);
+    if (repair)
+    saved_key = key;
   if (key.objectid > BTRFS_LAST_FREE_OBJECTID)
   goto out;
   if (key.type == BTRFS_ROOT_ITEM_KEY &&
@@ -5000,6 +5004,17 @@ int check_fs_roots_lowmem(struct btrfs_fs_info
*fs_info)
   err |= ret;
   }
   next:
+    /*
+ * Since root tree can be cowed during repair,
+ * here search the saved key again.
+ */
+    if (repair) {
+    btrfs_release_path();
+    ret = btrfs_search_slot(NULL, fs_info->tree_root,
+    _key, , 0, 0);
+    /* Repair never deletes trees, search must succeed. */
+    BUG_ON(ret);


But this doesn't look good to me.

Your assumption here is valid (at least for now), but it's possible that
some tree blocks get corrupted in a large root tree, and in that case,
we could still read part of the root tree, but btrfs_search_slot() could
still return -EIO for certain search key.

So I still prefer to do some error handling other than BUG_ON(ret).


Okay, will try it.


Just to emphasize Qu's point - we should strive to remove existing
BUG_ON and should never introduce new ones. btrfs-progs is already quite
messy and we should be improving that.


Understood, thanks for your emphasis.

Su


Thanks,
Su

Thanks,
Qu


+    }
   ret = btrfs_next_item(tree_root, );
   if (ret > 0)
   goto out;
















Re: btrfs convert problem

2018-09-14 Thread Qu Wenruo



On 2018/9/14 下午1:52, Nikolay Borisov wrote:
> 
> 
> On 14.09.2018 02:17, Qu Wenruo wrote:
>>
>>
>> On 2018/9/14 上午12:37, Nikolay Borisov wrote:
>>>
>>>
>>> On 13.09.2018 19:15, Serhat Sevki Dincer wrote:
> -1 seems to be EPERM, is your device write-protected, readonly or
> something like that ?

 I am able to use ext4 partition read/write, created a file and wrote
 in it, un/re mounted it, all is ok.
 The drive only has a microusb port and a little led light; no rw
 switch or anything like that..

>>>
>>> What might help is running btrfs convert under strace. So something like :
>>>
>>> sudo strace -f -o strace.log btrfs-convert /dev/sdb1 and then send the log.
>>>
>> strace would greatly help in this case.
>>
>> My guess is something wrong happened in migrate_super_block(), which
>> doesn't handle ret > 0 case well.
>> In that case, it means pwrite 4K doesn't finish in one call, which looks
>> pretty strange.
> 
> So Qu,
> 
> I'm not seeing any EPERM errors, though I see the following so you might
> be right:
> 
> openat(AT_FDCWD, "/dev/sdb1", O_RDWR) = 5
> 
> followed by a lot of preads, the last one of which is:
> 
> pread64(5, 0x557a0abd10b0, 4096, 2732765184) = -1 EIO (Input/output
> error)

With context, it's pretty easy to locate the problem.

It's a bad block of your device.

Please try to use the following command to verify it:

dd if=/dev/sdb1 of=/dev/null bs=1 count=4096 skip=2732765184

And it would be nicer to check kernel dmesg to see if there is clue there.

The culprit code is read_disk_extent(), which reset ret to -1 when error
happens.
I'll send patch(es) to fix it and enhance error messages for btrfs-convert.


There is a workaround to let btrfs-convert continue, you could use
--no-datasum option to skip datasum generation, thus btrfs-convert won't
try to read such data.
But that's just ostrich algorithm.

I'd recommend to read all files in the original fs and locate which
file(s) are affected. And backup data asap.

Thanks,
Qu

> 
> This is then followed by a a lot of pwrites, the last 10 or so syscalls
> are:
> 
> pwrite64(5, "\220\255\262\244\0\0\0\0\0\0"..., 16384, 91127808) = 16384
> 
> pwrite64(5, "|IO\233\0\0\0\0\0\0"..., 16384, 91144192) = 16384
> 
> pwrite64(5, "x\2501n\0\0\0\0\0\0"..., 16384, 91160576) = 16384
> 
> pwrite64(5, "\252\254l)\0\0\0\0\0\0"..., 16384, 91176960) = 16384
> 
> pwrite64(5, "P\256\331\373\0\0\0\0\0\0"..., 16384, 91193344) = 16384
> 
> pwrite64(5, "\3\230\230+\0\0\0\0\0\0"..., 4096, 83951616) = 4096
> 
> write(2, "ERROR: ", 7) = 7
> 
> write(2, "failed to "..., 37) = 37
> 
> write(2, "\n", 1) = 1
> 
> close(4) = 0
> 
> write(2, "WARNING: ", 9) = 9
> 
> write(2, "an error o"..., 104) = 104
> 
> write(2, "\n", 1) = 1
> 
> exit_group(1) = ?
> 
> but looking at migrate_super_block it's not doing that many writes -
> just writing the sb at BTRFS_SUPER_INFO_OFFSET and then zeroing out
> 0..BTRFS_SUPER_INFO_OFFSET. Also the number of pwrites:
> grep -c pwrite ~/Downloads/btrfs-issue/convert-strace.log
> 198
> 
> So the failure is not originating from migrate_super_block.
> 
>>
>> Thanks,
>> Qu
>>


Re: [PATCH v2] Btrfs: skip set_page_dirty if eb is dirty

2018-09-14 Thread Nikolay Borisov



On 13.09.2018 20:44, Liu Bo wrote:
> As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
> are dirty, so no need to set pages dirty again.
> 
> Ftrace showed that the loop took 10us on my dev box, so removing this
> can save us at least 10us if eb is already dirty.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Nikolay Borisov 

> ---
> v2: Update the patch's description.
> 
>  fs/btrfs/extent_io.c | 11 +++
>  fs/btrfs/extent_io.h |  2 +-
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 628f1aef34b0..fb2bf50134a1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5165,11 +5165,11 @@ void clear_extent_buffer_dirty(struct extent_buffer 
> *eb)
>   WARN_ON(atomic_read(>refs) == 0);
>  }
>  
> -int set_extent_buffer_dirty(struct extent_buffer *eb)
> +bool set_extent_buffer_dirty(struct extent_buffer *eb)
>  {
>   int i;
>   int num_pages;
> - int was_dirty = 0;
> + bool was_dirty;
>  
>   check_buffer_tree_ref(eb);
>  
> @@ -5179,8 +5179,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
>   WARN_ON(atomic_read(>refs) == 0);
>   WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
>  
> - for (i = 0; i < num_pages; i++)
> - set_page_dirty(eb->pages[i]);
> + if (!was_dirty) {
> + for (i = 0; i < num_pages; i++)
> + set_page_dirty(eb->pages[i]);
> + }
> +
>   return was_dirty;
>  }
>  
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index b4d03e677e1d..f2ab42d57f02 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -479,7 +479,7 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, 
> unsigned long start,
>  void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long 
> start,
>   unsigned long pos, unsigned long len);
>  void clear_extent_buffer_dirty(struct extent_buffer *eb);
> -int set_extent_buffer_dirty(struct extent_buffer *eb);
> +bool set_extent_buffer_dirty(struct extent_buffer *eb);
>  void set_extent_buffer_uptodate(struct extent_buffer *eb);
>  void clear_extent_buffer_uptodate(struct extent_buffer *eb);
>  int extent_buffer_under_io(struct extent_buffer *eb);
> 


Re: [PATCH v2 4/7] btrfs-progs: lowmem: search key of root again after check_fs_root() under repair

2018-09-14 Thread Nikolay Borisov



On 14.09.2018 03:58, Su Yue wrote:
> 
> 
> On 09/14/2018 07:37 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/9/13 上午4:49, damenly...@gmail.com wrote:
>>> From: Su Yue 
>>>
>>> In check_fs_roots_lowmem(), we do search and follow the resulted path
>>> to call check_fs_root(), then call btrfs_next_item() to check next
>>> root.
>>> However, if repair is enabled, the root tree can be cowed, the
>>> existed path can cause strange errors.
>>>
>>> Solution:
>>>    If repair, save the key before calling check_fs_root,
>>>    search the saved key again before checking next root.
>>
>> Both reason and solution looks good.
>>
>>>
>>> Signed-off-by: Su Yue 
>>> ---
>>>   check/mode-lowmem.c | 15 +++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
>>> index 89a304bbdd69..8fc9edab1d66 100644
>>> --- a/check/mode-lowmem.c
>>> +++ b/check/mode-lowmem.c
>>> @@ -4967,9 +4967,13 @@ int check_fs_roots_lowmem(struct btrfs_fs_info
>>> *fs_info)
>>>   }
>>>     while (1) {
>>> +    struct btrfs_key saved_key;
>>> +
>>>   node = path.nodes[0];
>>>   slot = path.slots[0];
>>>   btrfs_item_key_to_cpu(node, , slot);
>>> +    if (repair)
>>> +    saved_key = key;
>>>   if (key.objectid > BTRFS_LAST_FREE_OBJECTID)
>>>   goto out;
>>>   if (key.type == BTRFS_ROOT_ITEM_KEY &&
>>> @@ -5000,6 +5004,17 @@ int check_fs_roots_lowmem(struct btrfs_fs_info
>>> *fs_info)
>>>   err |= ret;
>>>   }
>>>   next:
>>> +    /*
>>> + * Since root tree can be cowed during repair,
>>> + * here search the saved key again.
>>> + */
>>> +    if (repair) {
>>> +    btrfs_release_path();
>>> +    ret = btrfs_search_slot(NULL, fs_info->tree_root,
>>> +    _key, , 0, 0);
>>> +    /* Repair never deletes trees, search must succeed. */
>>> +    BUG_ON(ret);
>>
>> But this doesn't look good to me.
>>
>> Your assumption here is valid (at least for now), but it's possible that
>> some tree blocks get corrupted in a large root tree, and in that case,
>> we could still read part of the root tree, but btrfs_search_slot() could
>> still return -EIO for certain search key.
>>
>> So I still prefer to do some error handling other than BUG_ON(ret).
>>
> Okay, will try it.

Just to emphasize Qu's point - we should strive to remove existing
BUG_ON and should never introduce new ones. btrfs-progs is already quite
messy and we should be improving that.

> 
> Thanks,
> Su
>> Thanks,
>> Qu
>>
>>> +    }
>>>   ret = btrfs_next_item(tree_root, );
>>>   if (ret > 0)
>>>   goto out;
>>>
>>
>>
> 
> 
> 


Re: [PATCH v2] Btrfs: assert page dirty bit

2018-09-14 Thread Nikolay Borisov



On 13.09.2018 20:46, Liu Bo wrote:
> Just in case that someone breaks the rule that pages are dirty as long
> as eb is dirty.
> 
> Signed-off-by: Liu Bo 
Reviewed-by: Nikolay Borisov 
> ---
> v2: fix typo of CONFIG_BTRFS_DEBUG
> 
>  fs/btrfs/extent_io.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fb2bf50134a1..f88231171009 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5184,6 +5184,11 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
>   set_page_dirty(eb->pages[i]);
>   }
>  
> +#ifdef CONFIG_BTRFS_DEBUG
> + for (i = 0; i < num_pages; i++)
> + ASSERT(PageDirty(eb->pages[i]));
> +#endif
> +
>   return was_dirty;
>  }
>  
> 


Re: btrfs convert problem

2018-09-14 Thread Qu Wenruo



On 2018/9/14 下午1:52, Nikolay Borisov wrote:
> 
> 
> On 14.09.2018 02:17, Qu Wenruo wrote:
>>
>>
>> On 2018/9/14 上午12:37, Nikolay Borisov wrote:
>>>
>>>
>>> On 13.09.2018 19:15, Serhat Sevki Dincer wrote:
> -1 seems to be EPERM, is your device write-protected, readonly or
> something like that ?

 I am able to use ext4 partition read/write, created a file and wrote
 in it, un/re mounted it, all is ok.
 The drive only has a microusb port and a little led light; no rw
 switch or anything like that..

>>>
>>> What might help is running btrfs convert under strace. So something like :
>>>
>>> sudo strace -f -o strace.log btrfs-convert /dev/sdb1 and then send the log.
>>>
>> strace would greatly help in this case.
>>
>> My guess is something wrong happened in migrate_super_block(), which
>> doesn't handle ret > 0 case well.
>> In that case, it means pwrite 4K doesn't finish in one call, which looks
>> pretty strange.
> 
> So Qu,
> 
> I'm not seeing any EPERM errors, though I see the following so you might
> be right:
> 
> openat(AT_FDCWD, "/dev/sdb1", O_RDWR) = 5
> 
> followed by a lot of preads, the last one of which is:
> 
> pread64(5, 0x557a0abd10b0, 4096, 2732765184) = -1 EIO (Input/output
> error)

Would you please provide the full log?

This pread() looks like some data read.

But without context, it's a little hard to say.

Thanks,
Qu

> 
> This is then followed by a a lot of pwrites, the last 10 or so syscalls
> are:
> 
> pwrite64(5, "\220\255\262\244\0\0\0\0\0\0"..., 16384, 91127808) = 16384
> 
> pwrite64(5, "|IO\233\0\0\0\0\0\0"..., 16384, 91144192) = 16384
> 
> pwrite64(5, "x\2501n\0\0\0\0\0\0"..., 16384, 91160576) = 16384
> 
> pwrite64(5, "\252\254l)\0\0\0\0\0\0"..., 16384, 91176960) = 16384
> 
> pwrite64(5, "P\256\331\373\0\0\0\0\0\0"..., 16384, 91193344) = 16384
> 
> pwrite64(5, "\3\230\230+\0\0\0\0\0\0"..., 4096, 83951616) = 4096
> 
> write(2, "ERROR: ", 7) = 7
> 
> write(2, "failed to "..., 37) = 37
> 
> write(2, "\n", 1) = 1
> 
> close(4) = 0
> 
> write(2, "WARNING: ", 9) = 9
> 
> write(2, "an error o"..., 104) = 104
> 
> write(2, "\n", 1) = 1
> 
> exit_group(1) = ?
> 
> but looking at migrate_super_block it's not doing that many writes -
> just writing the sb at BTRFS_SUPER_INFO_OFFSET and then zeroing out
> 0..BTRFS_SUPER_INFO_OFFSET. Also the number of pwrites:
> grep -c pwrite ~/Downloads/btrfs-issue/convert-strace.log
> 198
> 
> So the failure is not originating from migrate_super_block.
> 
>>
>> Thanks,
>> Qu
>>