Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Chris Murphy
On Tue, Nov 25, 2014 at 8:22 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> From my perspective, however, btrfs is simply incompatible with lvm
> snapshots, because the basic assumptions are incompatible.  Btrfs assumes
> UUIDs will be exactly what they say on the label, /unique/, while lvm's
> snapshot feature directly breaks that uniqueness by copying the (former)
> UUID, thus making the former UUID no longer unique and thus no longer
> truly UUID.

The seed device has a mechanism to change volume UUID without
rewriting a bunch of stuff in the original, the gotcha is that it
requires adding a device.

man fsfreeze says "fsfreeze is unncessary for device-mapper devices.
The device-mapper (and LVM) automatically freezes filesystem on the
device when a snapshot creation is requested." So if it's possible to
communicate snapshotting/freezing to the fs at snapshot time, then
maybe btrfs could 'btrfstune -S 1' the volume in the snapshot. That
way that snapshot actually contains a btrfs seed device, which is read
only. At least the snapshot copy isn't going to get obliterated in an
accident; even though most people would probably want the origin LV to
be protected while considering the snapshot disposable.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Chris Murphy
On Tue, Nov 25, 2014 at 7:11 PM, Zygo Blaxell  wrote:
> On Tue, Nov 25, 2014 at 03:46:32PM -0700, Chris Murphy wrote:
>> What happens when all btrfs LVs are unmounted, and you lvchange -an
>> the LVs (the pair) you do not want mounted; and then btrfs dev scan;
>> and then mount one of the devices? It should only find the matching LV
>> because the others are deactivated. I know this isn't ideal, but it's
>> better than corruption.
>
> This is one of two possible ways to assemble the btrfs correctly.
> The other is to explicitly name all of the devices when mounting.

OK I didn't realize it was possible to explicitly name all of them,
the last time I'd tried this (about 9 epochs ago) mount didn't
understand being passed two devices before the mount point.

>
> The challenge for the poor end-user (or inexperienced sysadmin) is to
> defeat all the defaults in system installers, initramfs-tools, lvm2,
> udev, etc. to prevent btrfs from destroying a filesystem accidentally.

I agree if it finds two identical volumes it should fail to mount with
some coherent error.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Duncan
Goffredo Baroncelli posted on Tue, 25 Nov 2014 22:59:53 +0100 as
excerpted:

> However I still doesn't understood why you want btrfs-w/multiple disk
> over LVM ?

While I'm not an LVM person here, and he already replied with essentially 
the same point, I think it's worth repeating...

Btrfs' checksummed error detection and automatic rewrite from a different 
copy isn't a small thing, and simply isn't available at all with most 
would-be alternatives (zfs being the only similar thing I know of for 
Linux, and of course it has its own issues both technical and social/
legal/license).  That alone is worth running multi-device btrfs to get.  
That makes btrfs a near-mandatory part of the picture, whatever it's on.

And for people wanting LVM's volume management (including partitioning 
without many of the limitations), the direct result is multi-device btrfs 
on lvm.

>From my perspective, however, btrfs is simply incompatible with lvm 
snapshots, because the basic assumptions are incompatible.  Btrfs assumes 
UUIDs will be exactly what they say on the label, /unique/, while lvm's 
snapshot feature directly breaks that uniqueness by copying the (former) 
UUID, thus making the former UUID no longer unique and thus no longer 
truly UUID.  Thus, part of the lvm /feature/ of snapshots is in direct 
contradiction to a basic assumption of btrfs, that UUIDs are exactly 
that, unique, making that feature directly incompatible with btrfs on a 
very basic level.

So people can have their btrfs on lvm, but if they do, they have to forego 
LVM snapshots because btrfs isn't compatible with their usage.  To me 
it's as simple as that, and people can choose either btrfs or lvm 
snapshots, but not both, it's one XOR the other.  So for me it's simply 
choose the one you will have the most difficulty doing without and forgo 
the other one.  Not a problem, just make your choice and move on.

OTOH, there's that common signature about the reasonable man folding to 
the circumstance while the unreasonable man insisting on folding the 
circumstance to his wishes instead, so progress depends on the 
unreasonable man...

But that's exactly what I see here, an unreasonable man insisting that 
entirely logical circumstance bend to his will.  Which, given someone to 
actually code it up, it might well do. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Apparent metadata corruption (file that simultaneously does/does not exist) on kernel 3.17.3

2014-11-25 Thread Daniel Miranda
Alright, I'll just have to understand how to build btrfs-progs now,
since I'm currently just using the packages from the Fedora repo.

Thanks for all the help and time spent so far,
Daniel


On Wed, Nov 26, 2014 at 12:41 AM, Qu Wenruo  wrote:
> Hi Daniel,
>
> With your btrfs-image dump, I tested with my patchset sent to maillist, my
> patchset succeeds fixing the image.
>
> You can get the patchset and then apply it on 3.17.2, and --repair should
> fix it.
> The file with nlink error will be moved to 'lost+found' dir.
>
> Although the best fixing should be just adding the missing dir_index,
> but currently the patchset does quite well and does not need to do any
> modify.
>
> The patchset can be extracted using patchwork:
> 0001: https://patchwork.kernel.org/patch/5364131/mbox/
> 0002: https://patchwork.kernel.org/patch/5364141/mbox/
> 0003: https://patchwork.kernel.org/patch/5364101/mbox/
> 0004 v2: https://patchwork.kernel.org/patch/5383611/mbox/
> 0005 v2: https://patchwork.kernel.org/patch/5383601/mbox/
> 0006: https://patchwork.kernel.org/patch/5364151/mbox
>
> Any feedback is welcomed to improve the patches.
>
> Thanks,
> Qu
>
>
>  Original Message 
> Subject: Re: Apparent metadata corruption (file that simultaneously
> does/does not exist) on kernel 3.17.3
> From: Daniel Miranda 
> To: Qu Wenruo 
> Date: 2014年11月25日 15:42
>>
>> I just ran the repair but the ghost file has not disappeared,
>> unfortunately.
>>
>> On Tue, Nov 25, 2014 at 5:26 AM, Qu Wenruo 
>> wrote:
>>>
>>>  Original Message 
>>> Subject: Re: Apparent metadata corruption (file that simultaneously
>>> does/does not exist) on kernel 3.17.3
>>> From: Daniel Miranda 
>>> To: Qu Wenruo 
>>> Date: 2014年11月25日 15:20

 Here are the logs. I'll send you a link to my dump directly after I
 finish uploading it. Please notify me when you have downloaded it so I
 can delete it.

 checking extents
 checking free space cache
 checking fs roots
 root 5 inode 17149868 errors 2000, link count wrong
   unresolved ref dir 17182377 index 245 namelen 8 name string.h
 filetype 1 errors 1, no dir item
>>>
>>> link count error seems resolved by Josef's patch commit already in
>>> 3.17.2.
>>> If using 3.17.2, josef's commit will rebuild the dir item and dir index.

 root 5 inode 17182377 errors 200, dir isize wrong
>>>
>>> This isize error seems caused by previous line.
>>> If 3.17.2 can repair above problem, it should not be a problem and will
>>> disappear.
>>>
>>> According to the above output, btrfsck --repair with btrfs-progs 3.17.2
>>> has
>>> a good chance repairing it.
>>> Just have a try.
>>>
>>> Thanks,
>>> Qu
>>>
 Checking filesystem on /dev/mapper/fedora_daniel--pc-root
 UUID: fef8f718-0622-4cb1-9597-749650d366a4
 found 55108022156 bytes used err is 1
 total csum bytes: 89787396
 total tree bytes: 2303455232
 total fs tree bytes: 2024841216
 total extent tree bytes: 145272832
 btree space waste bytes: 529672422
 file data blocks allocated: 253414481920
referenced 94127726592
 Btrfs v3.17


 Regards,
 Daniel

 On Tue, Nov 25, 2014 at 3:20 AM, Qu Wenruo 
 wrote:
>
>  Original Message 
> Subject: Re: Apparent metadata corruption (file that simultaneously
> does/does not exist) on kernel 3.17.3
> From: Daniel Miranda 
> To: Qu Wenruo 
> Date: 2014年11月25日 13:14
>>
>> I'll go run that and get you the output.
>
> Thanks.
>>
>>
>> I can do the image dump, sure. I don't know how long it might take to
>> upload it somewhere though. Right now `btrfs fi df` shows about 2GiB
>> of metadata (it's a 120GiB volume). I'll see how large it ends up
>> after compression.
>
> 120G volume seems quite small, compared the images I received recently
> (1T
> x2 RAID1 and 4T single).
> With '-c 9' it shouldn't be too huge I think(The 1T raid1 is about 1G
> metadata with -c9).
>
> BTW, btrfs-image dump will have all the filenames and hierarchy, even
> without its data,
> it is still better considering your privacy twice before uploading.
>
> Thanks,
> Qu
>
>> Thanks for the quick response,
>> Daniel
>>
>> On Tue, Nov 25, 2014 at 3:10 AM, Qu Wenruo 
>> wrote:
>>>
>>> Hi,
>>>
>>> What's the btrfsck output? Without --repair option.
>>>
>>> Also, if it is OK for you, would you please dump the btrfs with
>>> 'btrfs-image' command?
>>> '-c 9' option is highly recommended considering the size of it.
>>> This will helps a lot for developers to test the btrfsck repair
>>> function.
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>>  Original Message 
>>> Subject: Apparent metadata corruption (file that simultaneously
>>> does/does
>>> not exist) on kernel 3.17.3
>>> From: Daniel Miranda 

Re: [PATCH 3/3] Btrfs-progs, fsck: move root items repair after root rebuilding

2014-11-25 Thread Wang Shilong
Oops, sorry for sending this twice, just entered one more *.patch…


> 
> I did send these patches a long while ago, but due to some reasons,
> they were not merged, these are important fixes for fsck, without
> these patches, extent tree rebuilding did not work with snapshots.
> 
> Also, /tests/fsck-tests.sh's extent tree rebuild test could always
> detect failure without these patches, unluckily, it need extra
> enviroment setting, so i supposed it was always 'NOTRUN'!
> 
> last patch fix a regression for root rebuilding, root rebuilding
> should be at first, because if root(extent root eg) corrupted,
> root items also won't succeed.
> 
> patches are based on David's integration-20141125
> 
> Wang Shilong (3):
>  Btrfs-progs: fsck: deal with snapshot one by one when rebuilding
>extent tree
>  Btrfs-progs: fsck: add ability to rebuild extent tree with snapshots
>  Btrfs-progs, fsck: move root items repair after root rebuilding
> 
> cmds-check.c | 412 +++
> 1 file changed, 303 insertions(+), 109 deletions(-)
> 
> -- 
> 2.1.0
> 

Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Btrfs-progs, fsck: move root items repair after root rebuilding

2014-11-25 Thread Wang Shilong
If some critical roots are corrupt, reapr_root_items() will fail,
this is detected by fsck_tests.sh's extent rebuilding tests.

Signed-off-by: Wang Shilong 
---
 cmds-check.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index a102752..ae9005e 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7987,22 +7987,6 @@ int cmd_check(int argc, char **argv)
 
root = info->fs_root;
 
-   ret = repair_root_items(info);
-   if (ret < 0)
-   goto close_out;
-   if (repair) {
-   fprintf(stderr, "Fixed %d roots.\n", ret);
-   ret = 0;
-   } else if (ret > 0) {
-   fprintf(stderr,
-  "Found %d roots with an outdated root item.\n",
-  ret);
-   fprintf(stderr,
-   "Please run a filesystem check with the option --repair 
to fix them.\n");
-   ret = 1;
-   goto close_out;
-   }
-
/*
 * repair mode will force us to commit transaction which
 * will make us fail to load log tree when mounting.
@@ -8101,6 +8085,22 @@ int cmd_check(int argc, char **argv)
if (ret)
fprintf(stderr, "Errors found in extent allocation tree or 
chunk allocation\n");
 
+   ret = repair_root_items(info);
+   if (ret < 0)
+   goto close_out;
+   if (repair) {
+   fprintf(stderr, "Fixed %d roots.\n", ret);
+   ret = 0;
+   } else if (ret > 0) {
+   fprintf(stderr,
+  "Found %d roots with an outdated root item.\n",
+  ret);
+   fprintf(stderr,
+   "Please run a filesystem check with the option --repair 
to fix them.\n");
+   ret = 1;
+   goto close_out;
+   }
+
fprintf(stderr, "checking free space cache\n");
ret = check_space_cache(root);
if (ret)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 1/3] Btrfs-progs: fsck: deal with snapshot one by one when rebuilding extent tree

2014-11-25 Thread Wang Shilong
From: Wang Shilong 

Previously, we deal with node block firstly and then leaf block which can
maximize readahead. However, to rebuild extent tree, we need deal with snapshot
one by one.

This patch makes us deal with snapshot one by one if we need rebuild extent
tree otherwise we drop into previous way.

Signed-off-by: Wang Shilong 
Signed-off-by: Wang Shilong 
---
 cmds-check.c | 248 +--
 1 file changed, 158 insertions(+), 90 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 24b6b59..ff2795d 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -128,10 +128,14 @@ struct inode_backref {
char name[0];
 };
 
-struct dropping_root_item_record {
+struct root_item_record {
struct list_head list;
-   struct btrfs_root_item ri;
-   struct btrfs_key found_key;
+   u64 objectid;
+   u64 bytenr;
+   u8 level;
+   u8 drop_level;
+   int level_size;
+   struct btrfs_key drop_key;
 };
 
 #define REF_ERR_NO_DIR_ITEM(1 << 0)
@@ -4618,7 +4622,7 @@ static int run_next_block(struct btrfs_trans_handle 
*trans,
  struct rb_root *dev_cache,
  struct block_group_tree *block_group_cache,
  struct device_extent_tree *dev_extent_cache,
- struct btrfs_root_item *ri)
+ struct root_item_record *ri)
 {
struct extent_buffer *buf;
u64 bytenr;
@@ -4851,11 +4855,8 @@ static int run_next_block(struct btrfs_trans_handle 
*trans,
size = btrfs_level_size(root, level - 1);
btrfs_node_key_to_cpu(buf, &key, i);
if (ri != NULL) {
-   struct btrfs_key drop_key;
-   btrfs_disk_key_to_cpu(&drop_key,
- &ri->drop_progress);
if ((level == ri->drop_level)
-   && is_dropped_key(&key, &drop_key)) {
+   && is_dropped_key(&key, &ri->drop_key)) {
continue;
}
}
@@ -4896,7 +4897,7 @@ static int add_root_to_pending(struct extent_buffer *buf,
   struct cache_tree *pending,
   struct cache_tree *seen,
   struct cache_tree *nodes,
-  struct btrfs_key *root_key)
+  u64 objectid)
 {
if (btrfs_header_level(buf) > 0)
add_pending(nodes, seen, buf->start, buf->len);
@@ -4905,13 +4906,12 @@ static int add_root_to_pending(struct extent_buffer 
*buf,
add_extent_rec(extent_cache, NULL, 0, buf->start, buf->len,
   0, 1, 1, 0, 1, 0, buf->len);
 
-   if (root_key->objectid == BTRFS_TREE_RELOC_OBJECTID ||
+   if (objectid == BTRFS_TREE_RELOC_OBJECTID ||
btrfs_header_backref_rev(buf) < BTRFS_MIXED_BACKREF_REV)
add_tree_backref(extent_cache, buf->start, buf->start,
 0, 1);
else
-   add_tree_backref(extent_cache, buf->start, 0,
-root_key->objectid, 1);
+   add_tree_backref(extent_cache, buf->start, 0, objectid, 1);
return 0;
 }
 
@@ -6481,6 +6481,99 @@ static int check_devices(struct rb_root *dev_cache,
return ret;
 }
 
+static int add_root_item_to_list(struct list_head *head,
+ u64 objectid, u64 bytenr,
+ u8 level, u8 drop_level,
+ int level_size, struct btrfs_key *drop_key)
+{
+
+   struct root_item_record *ri_rec;
+   ri_rec = malloc(sizeof(*ri_rec));
+   if (!ri_rec)
+   return -ENOMEM;
+   ri_rec->bytenr = bytenr;
+   ri_rec->objectid = objectid;
+   ri_rec->level = level;
+   ri_rec->level_size = level_size;
+   ri_rec->drop_level = drop_level;
+   if (drop_key)
+   memcpy(&ri_rec->drop_key, drop_key, sizeof(*drop_key));
+   list_add_tail(&ri_rec->list, head);
+
+   return 0;
+}
+
+static int deal_root_from_list(struct list_head *list,
+  struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct block_info *bits,
+  int bits_nr,
+  struct cache_tree *pending,
+  struct cache_tree *seen,
+  struct cache_tree *reada,
+  struct cache_tree *nodes,
+  struct cache_tree *extent_cache,
+  struct cache_tree *chunk_cache,
+  struct rb_root *dev_cache,
+   

[PATCH 0/3] Extent tree rebuilding with snapshots patches

2014-11-25 Thread Wang Shilong
I did send these patches a long while ago, but due to some reasons,
they were not merged, these are important fixes for fsck, without
these patches, extent tree rebuilding did not work with snapshots.

Also, /tests/fsck-tests.sh's extent tree rebuild test could always
detect failure without these patches, unluckily, it need extra
enviroment setting, so i supposed it was always 'NOTRUN'!

last patch fix a regression for root rebuilding, root rebuilding
should be at first, because if root(extent root eg) corrupted,
root items also won't succeed.

patches are based on David's integration-20141125

Wang Shilong (3):
  Btrfs-progs: fsck: deal with snapshot one by one when rebuilding
extent tree
  Btrfs-progs: fsck: add ability to rebuild extent tree with snapshots
  Btrfs-progs, fsck: move root items repair after root rebuilding

 cmds-check.c | 412 +++
 1 file changed, 303 insertions(+), 109 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 2/3] Btrfs-progs: fsck: add ability to rebuild extent tree with snapshots

2014-11-25 Thread Wang Shilong
From: Wang Shilong 

This patch makes us to rebuild a really corrupt extent tree with snapshots.
To implement this, we have to verify whether a block is FULL BACKREF.

This idea come from Josef Bacik:

1) We walk down the original tree, every eb we encounter has
btrfs_header_owner(eb) == root->objectid.  We add normal references
for this root (BTRFS_TREE_BLOCK_REF_KEY) for this root.  World peace
is achieved.

2) We walk down the snapshotted tree.  Say we didn't change anything
at all, it was just a clean snapshot and then boom.  So the
btrfs_header_owner(root->node) == root->objectid, so normal backref.
We walk down to the next level, where btrfs_header_owner(eb) !=
root->objectid, but the level above did, so we add normal refs for all
of these blocks.  We go down the next level, now our
btrfs_header_owner(parent) != root->objectid and
btrfs_header_owner(eb) != root->objectid.  This is where we need to
now go back and see if btrfs_header_owner(eb) currently has a ref on
eb.  If it does we are done, move on to the next block in this same
level, we don't have to go further down.

3) Harder case, we snapshotted and then changed things in the original
root.  Do the same thing as in step 2, but now we get down to
btrfs_header_owner(eb) != root->objectid && btrfs_header_owner(parent)
!= root->objectid.  We lookup the references we have for eb and notice
that btrfs_header_owner(eb) no longer refers to eb.  So now we must
set FULL_BACKREF on this extent reference and add a
SHARED_BLOCK_REF_KEY for this eb using the parent->start as the
offset.  And we need to keep walking down and doing the same thing
until we either hit level 0 or btrfs_header_owner(eb) has a ref on the
block.

Signed-off-by: Wang Shilong 
Signed-off-by: Wang Shilong 
---
 cmds-check.c | 132 +--
 1 file changed, 129 insertions(+), 3 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ff2795d..a102752 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -112,6 +112,7 @@ struct extent_record {
unsigned int owner_ref_checked:1;
unsigned int is_root:1;
unsigned int metadata:1;
+   unsigned int flag_block_full_backref:1;
 };
 
 struct inode_backref {
@@ -4608,6 +4609,127 @@ static int is_dropped_key(struct btrfs_key *key,
return 0;
 }
 
+static int calc_extent_flag(struct btrfs_root *root,
+  struct cache_tree *extent_cache,
+  struct extent_buffer *buf,
+  struct root_item_record *ri,
+  u64 *flags)
+{
+   int i;
+   int nritems = btrfs_header_nritems(buf);
+   struct btrfs_key key;
+   struct extent_record *rec;
+   struct cache_extent *cache;
+   struct data_backref *dback;
+   struct tree_backref *tback;
+   struct extent_buffer *new_buf;
+   u64 owner = 0;
+   u64 bytenr;
+   u64 offset;
+   u64 ptr;
+   int size;
+   int ret;
+   u8 level;
+
+   /*
+* Except file/reloc tree, we can not have
+* FULL BACKREF MODE
+*/
+   if (ri->objectid < BTRFS_FIRST_FREE_OBJECTID)
+   goto normal;
+   /*
+* root node
+*/
+   if (buf->start == ri->bytenr)
+   goto normal;
+   if (btrfs_is_leaf(buf)) {
+   /*
+* we are searching from original root, world
+* peace is achieved, we use normal backref.
+*/
+   owner = btrfs_header_owner(buf);
+   if (owner == ri->objectid)
+   goto normal;
+   /*
+* we check every eb here, and if any of
+* eb dosen't have original root refers
+* to this eb, we set full backref flag for
+* this extent, otherwise normal backref.
+*/
+   for (i = 0; i < nritems; i++) {
+   struct btrfs_file_extent_item *fi;
+   btrfs_item_key_to_cpu(buf, &key, i);
+
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   continue;
+   fi = btrfs_item_ptr(buf, i,
+   struct btrfs_file_extent_item);
+   if (btrfs_file_extent_type(buf, fi) ==
+   BTRFS_FILE_EXTENT_INLINE)
+   continue;
+   if (btrfs_file_extent_disk_bytenr(buf, fi) == 0)
+   continue;
+   bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
+   cache = lookup_cache_extent(extent_cache, bytenr, 1);
+   if (!cache)
+   goto full_backref;
+   offset = btrfs_file_extent_offset(buf, fi);
+   rec = container_of(cache, struct extent_record, cache);
+   dback = find_da

[PATCH 3/3] Btrfs-progs, fsck: move root items repair after root rebuilding

2014-11-25 Thread Wang Shilong
If some critical roots are corrupt, reapr_root_items() will fail,
this is detected by fsck_tests.sh's extent rebuilding tests.

Signed-off-by: Wang Shilong 
---
 cmds-check.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index a102752..ae9005e 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7987,22 +7987,6 @@ int cmd_check(int argc, char **argv)
 
root = info->fs_root;
 
-   ret = repair_root_items(info);
-   if (ret < 0)
-   goto close_out;
-   if (repair) {
-   fprintf(stderr, "Fixed %d roots.\n", ret);
-   ret = 0;
-   } else if (ret > 0) {
-   fprintf(stderr,
-  "Found %d roots with an outdated root item.\n",
-  ret);
-   fprintf(stderr,
-   "Please run a filesystem check with the option --repair 
to fix them.\n");
-   ret = 1;
-   goto close_out;
-   }
-
/*
 * repair mode will force us to commit transaction which
 * will make us fail to load log tree when mounting.
@@ -8101,6 +8085,22 @@ int cmd_check(int argc, char **argv)
if (ret)
fprintf(stderr, "Errors found in extent allocation tree or 
chunk allocation\n");
 
+   ret = repair_root_items(info);
+   if (ret < 0)
+   goto close_out;
+   if (repair) {
+   fprintf(stderr, "Fixed %d roots.\n", ret);
+   ret = 0;
+   } else if (ret > 0) {
+   fprintf(stderr,
+  "Found %d roots with an outdated root item.\n",
+  ret);
+   fprintf(stderr,
+   "Please run a filesystem check with the option --repair 
to fix them.\n");
+   ret = 1;
+   goto close_out;
+   }
+
fprintf(stderr, "checking free space cache\n");
ret = check_space_cache(root);
if (ret)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 2/3] Btrfs-progs: fsck: add ability to rebuild extent tree with snapshots

2014-11-25 Thread Wang Shilong
From: Wang Shilong 

This patch makes us to rebuild a really corrupt extent tree with snapshots.
To implement this, we have to verify whether a block is FULL BACKREF.

This idea come from Josef Bacik:

1) We walk down the original tree, every eb we encounter has
btrfs_header_owner(eb) == root->objectid.  We add normal references
for this root (BTRFS_TREE_BLOCK_REF_KEY) for this root.  World peace
is achieved.

2) We walk down the snapshotted tree.  Say we didn't change anything
at all, it was just a clean snapshot and then boom.  So the
btrfs_header_owner(root->node) == root->objectid, so normal backref.
We walk down to the next level, where btrfs_header_owner(eb) !=
root->objectid, but the level above did, so we add normal refs for all
of these blocks.  We go down the next level, now our
btrfs_header_owner(parent) != root->objectid and
btrfs_header_owner(eb) != root->objectid.  This is where we need to
now go back and see if btrfs_header_owner(eb) currently has a ref on
eb.  If it does we are done, move on to the next block in this same
level, we don't have to go further down.

3) Harder case, we snapshotted and then changed things in the original
root.  Do the same thing as in step 2, but now we get down to
btrfs_header_owner(eb) != root->objectid && btrfs_header_owner(parent)
!= root->objectid.  We lookup the references we have for eb and notice
that btrfs_header_owner(eb) no longer refers to eb.  So now we must
set FULL_BACKREF on this extent reference and add a
SHARED_BLOCK_REF_KEY for this eb using the parent->start as the
offset.  And we need to keep walking down and doing the same thing
until we either hit level 0 or btrfs_header_owner(eb) has a ref on the
block.

Signed-off-by: Wang Shilong 
Signed-off-by: Wang Shilong 
---
 cmds-check.c | 132 +--
 1 file changed, 129 insertions(+), 3 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ff2795d..a102752 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -112,6 +112,7 @@ struct extent_record {
unsigned int owner_ref_checked:1;
unsigned int is_root:1;
unsigned int metadata:1;
+   unsigned int flag_block_full_backref:1;
 };
 
 struct inode_backref {
@@ -4608,6 +4609,127 @@ static int is_dropped_key(struct btrfs_key *key,
return 0;
 }
 
+static int calc_extent_flag(struct btrfs_root *root,
+  struct cache_tree *extent_cache,
+  struct extent_buffer *buf,
+  struct root_item_record *ri,
+  u64 *flags)
+{
+   int i;
+   int nritems = btrfs_header_nritems(buf);
+   struct btrfs_key key;
+   struct extent_record *rec;
+   struct cache_extent *cache;
+   struct data_backref *dback;
+   struct tree_backref *tback;
+   struct extent_buffer *new_buf;
+   u64 owner = 0;
+   u64 bytenr;
+   u64 offset;
+   u64 ptr;
+   int size;
+   int ret;
+   u8 level;
+
+   /*
+* Except file/reloc tree, we can not have
+* FULL BACKREF MODE
+*/
+   if (ri->objectid < BTRFS_FIRST_FREE_OBJECTID)
+   goto normal;
+   /*
+* root node
+*/
+   if (buf->start == ri->bytenr)
+   goto normal;
+   if (btrfs_is_leaf(buf)) {
+   /*
+* we are searching from original root, world
+* peace is achieved, we use normal backref.
+*/
+   owner = btrfs_header_owner(buf);
+   if (owner == ri->objectid)
+   goto normal;
+   /*
+* we check every eb here, and if any of
+* eb dosen't have original root refers
+* to this eb, we set full backref flag for
+* this extent, otherwise normal backref.
+*/
+   for (i = 0; i < nritems; i++) {
+   struct btrfs_file_extent_item *fi;
+   btrfs_item_key_to_cpu(buf, &key, i);
+
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   continue;
+   fi = btrfs_item_ptr(buf, i,
+   struct btrfs_file_extent_item);
+   if (btrfs_file_extent_type(buf, fi) ==
+   BTRFS_FILE_EXTENT_INLINE)
+   continue;
+   if (btrfs_file_extent_disk_bytenr(buf, fi) == 0)
+   continue;
+   bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
+   cache = lookup_cache_extent(extent_cache, bytenr, 1);
+   if (!cache)
+   goto full_backref;
+   offset = btrfs_file_extent_offset(buf, fi);
+   rec = container_of(cache, struct extent_record, cache);
+   dback = find_da

[PATCH resend 1/3] Btrfs-progs: fsck: deal with snapshot one by one when rebuilding extent tree

2014-11-25 Thread Wang Shilong
From: Wang Shilong 

Previously, we deal with node block firstly and then leaf block which can
maximize readahead. However, to rebuild extent tree, we need deal with snapshot
one by one.

This patch makes us deal with snapshot one by one if we need rebuild extent
tree otherwise we drop into previous way.

Signed-off-by: Wang Shilong 
Signed-off-by: Wang Shilong 
---
 cmds-check.c | 248 +--
 1 file changed, 158 insertions(+), 90 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 24b6b59..ff2795d 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -128,10 +128,14 @@ struct inode_backref {
char name[0];
 };
 
-struct dropping_root_item_record {
+struct root_item_record {
struct list_head list;
-   struct btrfs_root_item ri;
-   struct btrfs_key found_key;
+   u64 objectid;
+   u64 bytenr;
+   u8 level;
+   u8 drop_level;
+   int level_size;
+   struct btrfs_key drop_key;
 };
 
 #define REF_ERR_NO_DIR_ITEM(1 << 0)
@@ -4618,7 +4622,7 @@ static int run_next_block(struct btrfs_trans_handle 
*trans,
  struct rb_root *dev_cache,
  struct block_group_tree *block_group_cache,
  struct device_extent_tree *dev_extent_cache,
- struct btrfs_root_item *ri)
+ struct root_item_record *ri)
 {
struct extent_buffer *buf;
u64 bytenr;
@@ -4851,11 +4855,8 @@ static int run_next_block(struct btrfs_trans_handle 
*trans,
size = btrfs_level_size(root, level - 1);
btrfs_node_key_to_cpu(buf, &key, i);
if (ri != NULL) {
-   struct btrfs_key drop_key;
-   btrfs_disk_key_to_cpu(&drop_key,
- &ri->drop_progress);
if ((level == ri->drop_level)
-   && is_dropped_key(&key, &drop_key)) {
+   && is_dropped_key(&key, &ri->drop_key)) {
continue;
}
}
@@ -4896,7 +4897,7 @@ static int add_root_to_pending(struct extent_buffer *buf,
   struct cache_tree *pending,
   struct cache_tree *seen,
   struct cache_tree *nodes,
-  struct btrfs_key *root_key)
+  u64 objectid)
 {
if (btrfs_header_level(buf) > 0)
add_pending(nodes, seen, buf->start, buf->len);
@@ -4905,13 +4906,12 @@ static int add_root_to_pending(struct extent_buffer 
*buf,
add_extent_rec(extent_cache, NULL, 0, buf->start, buf->len,
   0, 1, 1, 0, 1, 0, buf->len);
 
-   if (root_key->objectid == BTRFS_TREE_RELOC_OBJECTID ||
+   if (objectid == BTRFS_TREE_RELOC_OBJECTID ||
btrfs_header_backref_rev(buf) < BTRFS_MIXED_BACKREF_REV)
add_tree_backref(extent_cache, buf->start, buf->start,
 0, 1);
else
-   add_tree_backref(extent_cache, buf->start, 0,
-root_key->objectid, 1);
+   add_tree_backref(extent_cache, buf->start, 0, objectid, 1);
return 0;
 }
 
@@ -6481,6 +6481,99 @@ static int check_devices(struct rb_root *dev_cache,
return ret;
 }
 
+static int add_root_item_to_list(struct list_head *head,
+ u64 objectid, u64 bytenr,
+ u8 level, u8 drop_level,
+ int level_size, struct btrfs_key *drop_key)
+{
+
+   struct root_item_record *ri_rec;
+   ri_rec = malloc(sizeof(*ri_rec));
+   if (!ri_rec)
+   return -ENOMEM;
+   ri_rec->bytenr = bytenr;
+   ri_rec->objectid = objectid;
+   ri_rec->level = level;
+   ri_rec->level_size = level_size;
+   ri_rec->drop_level = drop_level;
+   if (drop_key)
+   memcpy(&ri_rec->drop_key, drop_key, sizeof(*drop_key));
+   list_add_tail(&ri_rec->list, head);
+
+   return 0;
+}
+
+static int deal_root_from_list(struct list_head *list,
+  struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct block_info *bits,
+  int bits_nr,
+  struct cache_tree *pending,
+  struct cache_tree *seen,
+  struct cache_tree *reada,
+  struct cache_tree *nodes,
+  struct cache_tree *extent_cache,
+  struct cache_tree *chunk_cache,
+  struct rb_root *dev_cache,
+   

[PATCH 0/3] Extent tree rebuilding with snapshots patches

2014-11-25 Thread Wang Shilong
I did send these patches a long while ago, but due to some reasons,
they were not merged, these are important fixes for fsck, without
these patches, extent tree rebuilding did not work with snapshots.

Also, /tests/fsck-tests.sh's extent tree rebuild test could always
detect failure without these patches, unluckily, it need extra
enviroment setting, so i supposed it was always 'NOTRUN'!

last patch fix a regression for root rebuilding, root rebuilding
should be at first, because if root(extent root eg) corrupted,
root items also won't succeed.

patches are based on David's integration-20141125

Wang Shilong (3):
  Btrfs-progs: fsck: deal with snapshot one by one when rebuilding
extent tree
  Btrfs-progs: fsck: add ability to rebuild extent tree with snapshots
  Btrfs-progs, fsck: move root items repair after root rebuilding

 cmds-check.c | 412 +++
 1 file changed, 303 insertions(+), 109 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: convert: fix unable to rollback case with removed empty block groups

2014-11-25 Thread Gui Hecheng
Run fstests: btrfs/012 will fail with message:
unable to do rollback

It is because the rollback function checks sequentially each piece of space
to map to a certain block group. If some piece doesn't, rollback refuses to 
continue.

After kernel commit:
commit 47ab2a6c689913db23ccae38349714edf8365e0a
Btrfs: remove empty block groups automatically

Empty block groups are removed, so there are possible gaps:

|--block group 1--| |--block group 2--|
 ^
 |
gap

So the piece of space of the gap belongs to a removed empty block group,
and rollback should detect this case, and feel free to continue.

Signed-off-by: Gui Hecheng 
---
 btrfs-convert.c | 13 +++--
 volumes.c   |  2 ++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index a544fc6..504c7b3 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -2368,8 +2368,17 @@ static int may_rollback(struct btrfs_root *root)
while (1) {
ret = btrfs_map_block(&info->mapping_tree, WRITE, bytenr,
  &length, &multi, 0, NULL);
-   if (ret)
+   if (ret) {
+   if (ret == -ENOENT) {
+   /* removed block group at the tail */
+   if (length == (u64)-1)
+   break;
+
+   /* removed block group in the middle */
+   goto next;
+   }
goto fail;
+   }
 
num_stripes = multi->num_stripes;
physical = multi->stripes[0].physical;
@@ -2377,7 +2386,7 @@ static int may_rollback(struct btrfs_root *root)
 
if (num_stripes != 1 || physical != bytenr)
goto fail;
-
+next:
bytenr += length;
if (bytenr >= total_bytes)
break;
diff --git a/volumes.c b/volumes.c
index a1fd162..a988cdb 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1318,10 +1318,12 @@ again:
ce = search_cache_extent(&map_tree->cache_tree, logical);
if (!ce) {
kfree(multi);
+   *length = (u64)-1;
return -ENOENT;
}
if (ce->start > logical) {
kfree(multi);
+   *length = ce->start - logical;
return -ENOENT;
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: remove dead condition for btrfs_map_block

2014-11-25 Thread Gui Hecheng
The @search_cache_extent() only returns the next cache_extent or NULL,
it will never return the previous cache_extent.
So just remove the dead condition for previous cache_extent handle.

Signed-off-by: Gui Hecheng 
---
 volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 5b007fc..a1fd162 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1320,7 +1320,7 @@ again:
kfree(multi);
return -ENOENT;
}
-   if (ce->start > logical || ce->start + ce->size < logical) {
+   if (ce->start > logical) {
kfree(multi);
return -ENOENT;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Apparent metadata corruption (file that simultaneously does/does not exist) on kernel 3.17.3

2014-11-25 Thread Qu Wenruo

Hi Daniel,

With your btrfs-image dump, I tested with my patchset sent to maillist, 
my patchset succeeds fixing the image.


You can get the patchset and then apply it on 3.17.2, and --repair 
should fix it.

The file with nlink error will be moved to 'lost+found' dir.

Although the best fixing should be just adding the missing dir_index,
but currently the patchset does quite well and does not need to do any 
modify.


The patchset can be extracted using patchwork:
0001: https://patchwork.kernel.org/patch/5364131/mbox/
0002: https://patchwork.kernel.org/patch/5364141/mbox/
0003: https://patchwork.kernel.org/patch/5364101/mbox/
0004 v2: https://patchwork.kernel.org/patch/5383611/mbox/
0005 v2: https://patchwork.kernel.org/patch/5383601/mbox/
0006: https://patchwork.kernel.org/patch/5364151/mbox

Any feedback is welcomed to improve the patches.

Thanks,
Qu


 Original Message 
Subject: Re: Apparent metadata corruption (file that simultaneously 
does/does not exist) on kernel 3.17.3

From: Daniel Miranda 
To: Qu Wenruo 
Date: 2014年11月25日 15:42

I just ran the repair but the ghost file has not disappeared, unfortunately.

On Tue, Nov 25, 2014 at 5:26 AM, Qu Wenruo  wrote:

 Original Message 
Subject: Re: Apparent metadata corruption (file that simultaneously
does/does not exist) on kernel 3.17.3
From: Daniel Miranda 
To: Qu Wenruo 
Date: 2014年11月25日 15:20

Here are the logs. I'll send you a link to my dump directly after I
finish uploading it. Please notify me when you have downloaded it so I
can delete it.

checking extents
checking free space cache
checking fs roots
root 5 inode 17149868 errors 2000, link count wrong
  unresolved ref dir 17182377 index 245 namelen 8 name string.h
filetype 1 errors 1, no dir item

link count error seems resolved by Josef's patch commit already in 3.17.2.
If using 3.17.2, josef's commit will rebuild the dir item and dir index.

root 5 inode 17182377 errors 200, dir isize wrong

This isize error seems caused by previous line.
If 3.17.2 can repair above problem, it should not be a problem and will
disappear.

According to the above output, btrfsck --repair with btrfs-progs 3.17.2 has
a good chance repairing it.
Just have a try.

Thanks,
Qu


Checking filesystem on /dev/mapper/fedora_daniel--pc-root
UUID: fef8f718-0622-4cb1-9597-749650d366a4
found 55108022156 bytes used err is 1
total csum bytes: 89787396
total tree bytes: 2303455232
total fs tree bytes: 2024841216
total extent tree bytes: 145272832
btree space waste bytes: 529672422
file data blocks allocated: 253414481920
   referenced 94127726592
Btrfs v3.17


Regards,
Daniel

On Tue, Nov 25, 2014 at 3:20 AM, Qu Wenruo 
wrote:

 Original Message 
Subject: Re: Apparent metadata corruption (file that simultaneously
does/does not exist) on kernel 3.17.3
From: Daniel Miranda 
To: Qu Wenruo 
Date: 2014年11月25日 13:14

I'll go run that and get you the output.

Thanks.


I can do the image dump, sure. I don't know how long it might take to
upload it somewhere though. Right now `btrfs fi df` shows about 2GiB
of metadata (it's a 120GiB volume). I'll see how large it ends up
after compression.

120G volume seems quite small, compared the images I received recently
(1T
x2 RAID1 and 4T single).
With '-c 9' it shouldn't be too huge I think(The 1T raid1 is about 1G
metadata with -c9).

BTW, btrfs-image dump will have all the filenames and hierarchy, even
without its data,
it is still better considering your privacy twice before uploading.

Thanks,
Qu


Thanks for the quick response,
Daniel

On Tue, Nov 25, 2014 at 3:10 AM, Qu Wenruo 
wrote:

Hi,

What's the btrfsck output? Without --repair option.

Also, if it is OK for you, would you please dump the btrfs with
'btrfs-image' command?
'-c 9' option is highly recommended considering the size of it.
This will helps a lot for developers to test the btrfsck repair
function.

Thanks,
Qu


 Original Message 
Subject: Apparent metadata corruption (file that simultaneously
does/does
not exist) on kernel 3.17.3
From: Daniel Miranda 
To: 
Date: 2014年11月25日 13:04

Hello,

After I had some brief stability issues with my computer, it seems
some form of metadata corruption took place in my BTRFS filesystem,
and now a particular file seems to exist, but I cannot access any
details on it or delete it.

If I try to `ls` in the directory it is in, that's what I get:

ls: cannot access string.h: No such file or directory
total 0
drwxr-xr-x. 1 danielkza mock 16 Nov 21 14:18 ./
drwxr-xr-x. 1 danielkza mock  6 Nov 21 14:18 ../
-?? ? ? ? ?? string.h

If I try to delete it I get:

rm: cannot remove ‘string.h’: No such file or directory

I'm using kernel 3.17.3 from Fedora 21. I got no messages on dmesg or
anything of the sort. I know the btrfs fsck situation is complicated,
but is there any utility I should use to try and repair this? Losing
this file is not a problem, it's just one header from the kernel I was
build

[PATCH v2 5/6] btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found' fsck mechanism.

2014-11-25 Thread Qu Wenruo
With the previous btrfs inode operations patches, now we can use
btrfs_mkdir() to create the 'lost+found' dir to do some data salvage in
btrfsck.

This patch along with previous ones will make data salvage easier.

Signed-off-by: Qu Wenruo 
---
Changlog:
v2:
   Fix a bug that returns the parent ino other than the existing dir
   ino.
---
 ctree.h |  2 ++
 inode.c | 92 +
 2 files changed, 94 insertions(+)

diff --git a/ctree.h b/ctree.h
index 17b3b20..ec969ab 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2456,4 +2456,6 @@ int btrfs_add_orphan_item(struct btrfs_trans_handle 
*trans,
  struct btrfs_root *root,
  struct btrfs_path *path,
  u64 ino);
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode);
 #endif
diff --git a/inode.c b/inode.c
index f085d30..e738cb8 100644
--- a/inode.c
+++ b/inode.c
@@ -374,3 +374,95 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+/* Fill inode item with 'mode'. Uid/gid to root/root */
+static void fill_inode_item(struct btrfs_trans_handle *trans,
+   struct btrfs_inode_item *inode_item,
+   u32 mode, u32 nlink)
+{
+   time_t now = time(NULL);
+
+   btrfs_set_stack_inode_generation(inode_item, trans->transid);
+   btrfs_set_stack_inode_uid(inode_item, 0);
+   btrfs_set_stack_inode_gid(inode_item, 0);
+   btrfs_set_stack_inode_size(inode_item, 0);
+   btrfs_set_stack_inode_mode(inode_item, mode);
+   btrfs_set_stack_inode_nlink(inode_item, nlink);
+   btrfs_set_stack_timespec_sec(&inode_item->atime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->atime, 0);
+   btrfs_set_stack_timespec_sec(&inode_item->mtime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->mtime, 0);
+   btrfs_set_stack_timespec_sec(&inode_item->ctime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->ctime, 0);
+}
+
+/*
+ * Unlike kernel btrfs_new_inode(), we only create the INODE_ITEM, without
+ * its backref.
+ * The backref is added by btrfs_add_link().
+ */
+static int btrfs_new_inode(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  u64 ino, u32 mode)
+{
+   struct btrfs_inode_item inode_item = {0};
+   int ret = 0;
+
+   fill_inode_item(trans, &inode_item, mode, 0);
+   ret = btrfs_insert_inode(trans, root, ino, &inode_item);
+   return ret;
+}
+
+/*
+ * Make a dir under the parent inode 'parent_ino' with 'name'
+ * and 'mode', The owner will be root/root.
+ */
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode)
+{
+   struct btrfs_dir_item *dir_item;
+   struct btrfs_path *path;
+   u64 ret_ino;
+   int ret = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   dir_item = btrfs_lookup_dir_item(NULL, root, path, parent_ino,
+name, namelen, 0);
+   if (IS_ERR(dir_item)) {
+   ret = PTR_ERR(dir_item);
+   goto out;
+   }
+
+   if (dir_item) {
+   struct btrfs_key found_key;
+
+   /*
+* Already have conflicting name, check if it is a dir.
+* Either way, no need to continue.
+*/
+   btrfs_dir_item_key_to_cpu(path->nodes[0], dir_item, &found_key);
+   ret_ino = found_key.objectid;
+   if (btrfs_dir_type(path->nodes[0], dir_item) != BTRFS_FT_DIR)
+   ret = -EEXIST;
+   goto out;
+   }
+
+   ret = btrfs_find_free_objectid(NULL, root, parent_ino, &ret_ino);
+   if (ret)
+   goto out;
+   ret = btrfs_new_inode(trans, root, ret_ino, mode | S_IFDIR);
+   if (ret)
+   goto out;
+   ret = btrfs_add_link(trans, root, ret_ino, parent_ino, name, namelen,
+BTRFS_FT_DIR, NULL, 1);
+   if (ret)
+   goto out;
+out:
+   btrfs_free_path(path);
+   if (ret == 0 && ino)
+   *ino = ret_ino;
+   return ret;
+}
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.

2014-11-25 Thread Qu Wenruo
Add btrfs_unlink() and btrfs_add_link() functions in inode.c,
for the incoming btrfs_mkdir() and later inode operations functions.

Signed-off-by: Qu Wenruo 
---
Changlog:
v2:
   Do dir name conflicting check before adding inode_backref or
   dir_item/index.
---
 Makefile |   2 +-
 cmds-check.c |   7 +-
 ctree.h  |  12 ++
 inode.c  | 376 +++
 4 files changed, 390 insertions(+), 7 deletions(-)
 create mode 100644 inode.c

diff --git a/Makefile b/Makefile
index 4cae30c..d7a5cbe 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o qgroup-verify.o backref.o
+ ulist.o qgroup-verify.o backref.o inode.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/cmds-check.c b/cmds-check.c
index 9fc1410..6419caf 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1599,14 +1599,9 @@ static int repair_inode_orphan_item(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path,
struct inode_record *rec)
 {
-   struct btrfs_key key;
int ret;
 
-   key.objectid = BTRFS_ORPHAN_OBJECTID;
-   key.type = BTRFS_ORPHAN_ITEM_KEY;
-   key.offset = rec->ino;
-
-   ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+   ret = btrfs_add_orphan_item(trans, root, path, rec->ino);
btrfs_release_path(path);
if (!ret)
rec->errors &= ~I_ERR_NO_ORPHAN_ITEM;
diff --git a/ctree.h b/ctree.h
index 32b1286..17b3b20 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2444,4 +2444,16 @@ static inline int is_fstree(u64 rootid)
return 1;
return 0;
 }
+
+/* inode.c */
+int btrfs_add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+  u64 ino, u64 parent_ino, char *name, int namelen,
+  u8 type, u64 *index, int add_backref);
+int btrfs_unlink(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+u64 ino, u64 parent_ino, u64 index, const char *name,
+int namelen, int add_orphan);
+int btrfs_add_orphan_item(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path,
+ u64 ino);
 #endif
diff --git a/inode.c b/inode.c
new file mode 100644
index 000..f085d30
--- /dev/null
+++ b/inode.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (C) 2014 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Unlike inode.c in kernel, which can use most of the kernel infrastructure
+ * like inode/dentry things, in user-land, we can only use inode number to
+ * do directly operation on extent buffer, which may cause extra searching,
+ * but should not be a huge problem since progs is less performence sensitive.
+ */
+#include 
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "time.h"
+
+/*
+ * Find a free inode index for later btrfs_add_link().
+ * Currently just search from the largest dir_index and +1.
+ */
+static int btrfs_find_free_dir_index(struct btrfs_root *root, u64 dir_ino,
+u64 *ret_ino)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   u64 ret_val = 2;
+   int ret = 0;
+
+   if (!ret_ino)
+   return 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = dir_ino;
+   key.type = BTRFS_DIR_INDEX_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0)
+   goto out;
+   ret = 0;
+   if (path->slots[0] == 0) {
+   ret = btrfs_prev_leaf(root, path);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Rich Freeman
On Tue, Nov 25, 2014 at 6:13 PM, Chris Murphy  wrote:
> A few years ago companies including Western Digital started shipping
> large cheap drives, think of the "green" drives. These had very high
> TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
> they completely took out the ability to configure this error recovery
> timing so you only get the upward of 2 minutes to actually get a read
> error reported by the drive.

Why sell an $80 hard drive when you can change a few bytes in the
firmware and sell a crippled $80 drive and an otherwise-identical
non-crippled $130 drive?

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: deal with all 'subvol=xxx' options once

2014-11-25 Thread Wang Shilong

> 
> On Tue, Nov 25, 2014 at 04:20:11PM +0800, Wang Shilong wrote:
>> Steps to reproduce:
>> # mkfs.btrfs -f /dev/sdb
>> # mount -t btrfs /dev/sdb /mnt
>> # btrfs sub create /mnt/dir
>> # mount -t btrfs /dev/sdb /mnt -o subvol=dir,subvol=dir
>> 
>> It fails with:
>> mount: mount(2) failed: No such file or directory
> 
> The bug is real, but I don't like the fix. The mount path is hard to
> read already, and I'm afraid your fix adds another unobvious step to the
> whole processing.
> 
> setup_root_args replaces subvol= with subvolid=0 once. I suggest to
> replace all occurences of subvol= here and not rely on the recursive
> behaviour of the mount callbacks.

ok, if you like this way, i will do it.


> 
> The (buggy) way how it works now is that the first occurence of subvol
> will get parsed and passed as
> 
> newroot = vfs_kern_mount(",subvol=second,...,subvolid=0")
> 
> and this will call back again to btrfs_mount and will try to mount the
> subvol 'second' but now relative to 'newroot'.
> 
> Try this:
> 
> # mkfs.btrfs -f /dev/sdb
> # mount -t btrfs /dev/sdb /mnt
> # btrfs sub create /mnt/dir
> # btrfs sub create /mnt/dir/dir2
> # mount -t btrfs /dev/sdb /mnt -o subvol=dir,subvol=dir2
> 
> mount succeeds and the mounted subvolume is dir2.

Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found' fsck mechanism.

2014-11-25 Thread Qu Wenruo
With the previous btrfs inode operations patches, now we can use
btrfs_mkdir() to create the 'lost+found' dir to do some data salvage in
btrfsck.

This patch along with previous ones will make data salvage easier.

Signed-off-by: Qu Wenruo 
---
Changlog:
v2:
   Fix a bug that returns the parent ino other than the existing dir
   ino.
---
 ctree.h |  2 ++
 inode.c | 92 +
 2 files changed, 94 insertions(+)

diff --git a/ctree.h b/ctree.h
index 17b3b20..ec969ab 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2456,4 +2456,6 @@ int btrfs_add_orphan_item(struct btrfs_trans_handle 
*trans,
  struct btrfs_root *root,
  struct btrfs_path *path,
  u64 ino);
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode);
 #endif
diff --git a/inode.c b/inode.c
index f085d30..e738cb8 100644
--- a/inode.c
+++ b/inode.c
@@ -374,3 +374,95 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+/* Fill inode item with 'mode'. Uid/gid to root/root */
+static void fill_inode_item(struct btrfs_trans_handle *trans,
+   struct btrfs_inode_item *inode_item,
+   u32 mode, u32 nlink)
+{
+   time_t now = time(NULL);
+
+   btrfs_set_stack_inode_generation(inode_item, trans->transid);
+   btrfs_set_stack_inode_uid(inode_item, 0);
+   btrfs_set_stack_inode_gid(inode_item, 0);
+   btrfs_set_stack_inode_size(inode_item, 0);
+   btrfs_set_stack_inode_mode(inode_item, mode);
+   btrfs_set_stack_inode_nlink(inode_item, nlink);
+   btrfs_set_stack_timespec_sec(&inode_item->atime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->atime, 0);
+   btrfs_set_stack_timespec_sec(&inode_item->mtime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->mtime, 0);
+   btrfs_set_stack_timespec_sec(&inode_item->ctime, now);
+   btrfs_set_stack_timespec_nsec(&inode_item->ctime, 0);
+}
+
+/*
+ * Unlike kernel btrfs_new_inode(), we only create the INODE_ITEM, without
+ * its backref.
+ * The backref is added by btrfs_add_link().
+ */
+static int btrfs_new_inode(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  u64 ino, u32 mode)
+{
+   struct btrfs_inode_item inode_item = {0};
+   int ret = 0;
+
+   fill_inode_item(trans, &inode_item, mode, 0);
+   ret = btrfs_insert_inode(trans, root, ino, &inode_item);
+   return ret;
+}
+
+/*
+ * Make a dir under the parent inode 'parent_ino' with 'name'
+ * and 'mode', The owner will be root/root.
+ */
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode)
+{
+   struct btrfs_dir_item *dir_item;
+   struct btrfs_path *path;
+   u64 ret_ino;
+   int ret = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   dir_item = btrfs_lookup_dir_item(NULL, root, path, parent_ino,
+name, namelen, 0);
+   if (IS_ERR(dir_item)) {
+   ret = PTR_ERR(dir_item);
+   goto out;
+   }
+
+   if (dir_item) {
+   struct btrfs_key found_key;
+
+   /*
+* Already have conflicting name, check if it is a dir.
+* Either way, no need to continue.
+*/
+   btrfs_dir_item_key_to_cpu(path->nodes[0], dir_item, &found_key);
+   ret_ino = found_key.objectid;
+   if (btrfs_dir_type(path->nodes[0], dir_item) != BTRFS_FT_DIR)
+   ret = -EEXIST;
+   goto out;
+   }
+
+   ret = btrfs_find_free_objectid(NULL, root, parent_ino, &ret_ino);
+   if (ret)
+   goto out;
+   ret = btrfs_new_inode(trans, root, ret_ino, mode | S_IFDIR);
+   if (ret)
+   goto out;
+   ret = btrfs_add_link(trans, root, ret_ino, parent_ino, name, namelen,
+BTRFS_FT_DIR, NULL, 1);
+   if (ret)
+   goto out;
+out:
+   btrfs_free_path(path);
+   if (ret == 0 && ino)
+   *ino = ret_ino;
+   return ret;
+}
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.

2014-11-25 Thread Qu Wenruo
Add btrfs_unlink() and btrfs_add_link() functions in inode.c,
for the incoming btrfs_mkdir() and later inode operations functions.

Signed-off-by: Qu Wenruo 
---
Changlog:
v2:
   Do dir name conflicting check before adding inode_backref or
   dir_item/index.
---
 Makefile |   2 +-
 cmds-check.c |   7 +-
 ctree.h  |  12 ++
 inode.c  | 376 +++
 4 files changed, 390 insertions(+), 7 deletions(-)
 create mode 100644 inode.c

diff --git a/Makefile b/Makefile
index 4cae30c..d7a5cbe 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o qgroup-verify.o backref.o
+ ulist.o qgroup-verify.o backref.o inode.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/cmds-check.c b/cmds-check.c
index 9fc1410..6419caf 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1599,14 +1599,9 @@ static int repair_inode_orphan_item(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path,
struct inode_record *rec)
 {
-   struct btrfs_key key;
int ret;
 
-   key.objectid = BTRFS_ORPHAN_OBJECTID;
-   key.type = BTRFS_ORPHAN_ITEM_KEY;
-   key.offset = rec->ino;
-
-   ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+   ret = btrfs_add_orphan_item(trans, root, path, rec->ino);
btrfs_release_path(path);
if (!ret)
rec->errors &= ~I_ERR_NO_ORPHAN_ITEM;
diff --git a/ctree.h b/ctree.h
index 32b1286..17b3b20 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2444,4 +2444,16 @@ static inline int is_fstree(u64 rootid)
return 1;
return 0;
 }
+
+/* inode.c */
+int btrfs_add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+  u64 ino, u64 parent_ino, char *name, int namelen,
+  u8 type, u64 *index, int add_backref);
+int btrfs_unlink(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+u64 ino, u64 parent_ino, u64 index, const char *name,
+int namelen, int add_orphan);
+int btrfs_add_orphan_item(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path,
+ u64 ino);
 #endif
diff --git a/inode.c b/inode.c
new file mode 100644
index 000..f085d30
--- /dev/null
+++ b/inode.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (C) 2014 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Unlike inode.c in kernel, which can use most of the kernel infrastructure
+ * like inode/dentry things, in user-land, we can only use inode number to
+ * do directly operation on extent buffer, which may cause extra searching,
+ * but should not be a huge problem since progs is less performence sensitive.
+ */
+#include 
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "time.h"
+
+/*
+ * Find a free inode index for later btrfs_add_link().
+ * Currently just search from the largest dir_index and +1.
+ */
+static int btrfs_find_free_dir_index(struct btrfs_root *root, u64 dir_ino,
+u64 *ret_ino)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   u64 ret_val = 2;
+   int ret = 0;
+
+   if (!ret_ino)
+   return 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = dir_ino;
+   key.type = BTRFS_DIR_INDEX_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0)
+   goto out;
+   ret = 0;
+   if (path->slots[0] == 0) {
+   ret = btrfs_prev_leaf(root, path);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+

Re: [PATCH 0/6] btrfs-progs: New 'lost+found' infrastructrue with

2014-11-25 Thread Qu Wenruo


 Original Message 
Subject: Re: [PATCH 0/6] btrfs-progs: New 'lost+found' infrastructrue with
From: David Sterba 
To: Qu Wenruo 
Date: 2014年11月26日 02:32

On Mon, Nov 24, 2014 at 05:06:59PM +0800, Qu Wenruo wrote:

Introduce the new 'lost+found' dir and related infrastructure to create it
in btrfs-progs.

[BUG]
With the new infrastructure, fix a bug that some people reported in both
kernel BZ and maillist, which there is some files' nlink is 1 but backref
points to non-exist parent.
The two reporters all report missing file(chrome config file), so we'd
better not to delete such files but use the 'lost+found' dir.

Well, I don't like introducing the lost+found directory.

My idea is to extend the rescue utilities to extract the unlinked and
copy them to a user defined directory and do not touch the filesystem.
Personally, also mentioned by others (maybe Chris?), I think btrfs 
should only have two fsck facilities:
btrfsck for offline check and recovery, and scrub for online check and 
recovery.


So rescue may finally be merged into btrfsck and extending rescue may 
not be a good idea.


Also, such nlink mismatch is not such a huge bug destroying the whole fs 
or making it unable to mount,
end users may not be happy with the fact they need to extra command 
other than btrfsck to fix such

a small problem.


Or, at least make the in-filesytem lost+found directory creation
optional.
This seems better, but when user gives '--repair' option, they should be 
aware of the fact that the fs

maybe modified by btrfsck.

Still your idea about optional creation of 'lost+found' dir is indeed 
important for end users,

just like e2fsck's annoying but solid prompt.

What about try to prompt user that we are going to modify the fs and ask 
for y/n ?

You've split the patches well so I'm going to pull 1-5
directly. Patch 6 should be updated a bit, I'll look closer and will let
you know.

0004 and 0005 have some small fixes, I'll send the v2 patches soon.

Thanks,
Qu



2. Unify the repair framework.
When writing the 6th patch, I think it is better to build a frame work
that unify the check and repair framework.
In 6th patch, my patchset and Josef's commit 2dc4c001 in fact has some
similar function but do the repair in different time and functions.

I will try to build a unified framework for repair, each repair will be
independent and have its own err number.
And each repair function should work like the following:
1) Check the error number
2) Do the repair
3) Update the related btrfsck record(like newly created inode, deleted inode)

The unification is most welcome, feel free to send me anything that could be
merged as preparatory work (cleanups, safe changes, etc).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Dave Chinner
On Mon, Nov 24, 2014 at 11:33:35PM -0500, Theodore Ts'o wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > > +static void flush_sb_dirty_time(struct super_block *sb)
> > > +{
>   ...
> > > +}
> > 
> > This just seems wrong to me, not to mention extremely expensive when we have
> > millions of cached inodes on the superblock.
> 
> #1, It only gets called on a sync(2) or syncfs(2), which can be pretty
> expensive as it is, so I didn't really worry about it.
> 
> #2, We're already iterating over all of the inodes in the sync(2) or
> syncfs(2) path, so the code path in question is already O(N) in the
> number of inodes.
> 
> > Why can't we just add a function like mark_inode_dirty_time() which
> > puts the inode on the dirty inode list with a writeback time 24
> > hours in the future rather than 30s in the future?
> 
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).

Create another list - we already have multiple dirty inode lists in
the struct bdi_writeback

> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat. 

We already have an i_wb_list in the inode for tracking the dirty
list the inode belongs to.

> Given #1 and #2 above,
> yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
> we're not iterating over all of the inodes twice, but I believe that
> is a better trade-off than bloating the inode structure with an extra
> set of linked lists or increasing the CPU cost to the writeback path
> (which gets executed way more often than the sync or syncfs paths).

There is no need to bloat the inode at all, therefore we shouldn't
be causing sync/syncfs regressions by enabling lazytime...

> > Eviction is too late for this. I'm pretty sure that it won't get
> > this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> > case via write_inode_now().
> 
> Actually, the tracepoint for fs_lazytime_evict() does get triggered
> from time to time; but only when the inode is evicted due to memory
> pressure, i.e., via the evict_inodes() path.

That's indicative of a bug - if it's dirty then you shouldn't be
evicting it. The LRU shrinker explicitly avoids reclaiming dirty
inodes. Also, evict_inodes() is only called in the unmount path,
and that happens after a sync_filesystem() call so that shouldn't be
finding dirty inodes, either

> I thought about possibly doing this in iput_final(), but that would
> mean that whenever we closed the last fd on the file, we would push
> the inode out to disk.

I get the feeling from your responses that you really don't
understand the VFS inode lifecycle or the writeback code works.
Inodes don't get dropped form the inode cache when the last open FD
on them is closed unless they are an unlinked file. The dentry cache
still holds a reference to the inode

> For files that we are writing, that's not so
> bad; but if we enable strictatime with lazytime, then we would be
> updating the atime for inodes that had been only been read on every
> close --- which in the case of say, all of the files in the kernel
> tree, would be a little unfortunate.
> 
> Of course, the combination of strict atime and lazytime would result
> in a pretty heavy write load on a umount or sync(2), so I suspect
> keeping the relatime mode would make sense for most people, but I for
> those people who need strict Posix compliance, it seemed like doing
> something that worked well for strictatime plus lazytime would be a
> good thing, which is why I tried to defer things as much as possible.
> 
> > if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> > 
> > > + spin_lock(&inode->i_lock);
> > > + inode->i_state |= I_DIRTY_SYNC;
> > > + spin_unlock(&inode->i_lock);
> > > + }
> > >   return file->f_op->fsync(file, start, end, datasync);
> > 
> > When we mark the inode I_DIRTY_TIME, we should also be marking it
> > I_DIRTY_SYNC so that all the sync operations know that they should
> > be writing this inode. That's partly why I also think these inodes
> > should be tracked on the dirty inode list
> 
> The whole point of what I was doing is that I_DIRTY_TIME was not part
> of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead
> of I_DIRTY_SYNC.

I_DIRTY_SYNC only matters once you get the inode into the fsync code
or deep into the inode writeback code (i.e.
__writeback_single_inode()). if we don't expire the inode at the
high level writeback code, then the only time we'll get into
__writeback_single_inode() is through specific foreground attempts
to write the inode. In which case, we should be writing the inode if
it is I_DIRTY_TIME, and so I_DIRTY_SYNC will trigger all the correct
code paths to be taken to get us to writ

Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Dave Chinner
On Mon, Nov 24, 2014 at 11:45:08PM -0500, Theodore Ts'o wrote:
> On Tue, Nov 25, 2014 at 12:53:32PM +1100, Dave Chinner wrote:
> > On Fri, Nov 21, 2014 at 02:59:23PM -0500, Theodore Ts'o wrote:
> > > Guarantee that the on-disk timestamps will be no more than 24 hours
> > > stale.
> > > 
> > > Signed-off-by: Theodore Ts'o 
> > 
> > If we put these inodes on the dirty inode list with at writeback
> > time of 24 hours, this is completely unnecessary.
> 
> What do you mean by "a writeback time of 24 hours"?  Do you mean
> creating a new field in the inode which specifies when the writeback
> should happen? 

No.

> I still worry about the dirty inode list getting
> somewhat long large in the strictatime && lazytime case, and the inode
> bloat nazi's coming after us for adding a new field to struct inode
> structure.

Use another pure inode time dirty list, and move the inode to the
existing dirty list when it gets marked I_DIRTY.

> Or do you mean trying to abuse the dirtied_when field in some way?

No abuse necessary at all. Just a different inode_dirtied_after()
check is requires if the inode is on the time dirty list in
move_expired_inodes().

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread John Williams
On Tue, Nov 25, 2014 at 2:30 AM, Liu Bo  wrote:
> On Mon, Nov 24, 2014 at 11:34:46AM -0800, John Williams wrote:
>> For example, Spooky V2 hash is 128 bits and is very fast. It is
>> noncryptographic, but it is more than adequate for data checksums.
>>
>> http://burtleburtle.net/bob/hash/spooky.html
>>
>> SnapRAID uses this hash, and it runs at about 15 GB/sec on my machine
>> (Xeon E3-1270 V2 @ 3.50Ghz)
>
> Thanks for the suggestion, I'll take a look.
>
> Btw, it's not in kernel yet, is it?

No, as far as I know, it is not in the kernel.

By the way, as for the suggestion of blake2 hash, note that it is much
slower than Spooky V2 hash. That is to be expected, since blake2 is a
cryptographic hash (even if it is one that is fast relative to other
cryptographic hashes) and as a class, cryptographic hashes tend to be
an order of magnitude slower than the fastest noncryptographic hashes.

The hashes that I would recommend for use with btrfs checksums are:

1) SpookyHash V2 : for 128 bit hashes on 64-bit systems
http://burtleburtle.net/bob/hash/spooky.html

2) CityHash : for 256-bit hashes on all systems
https://code.google.com/p/cityhash/

3) Murmur3 :for 128-bit hashes on 32-bit systems (since Spooky and
City are not the fastest on most 32-bit systems)
https://code.google.com/p/smhasher/wiki/MurmurHash3

All of those are noncryptographic, but they all have good properties
that should make them more than adequate for data checksums and dedup
usage.

For more information, here are some comparisons of fast hash functions
(note that these comparisons were written 2 to 3 years ago):

http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html
http://research.neustar.biz/tag/spooky-hash/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Chris Murphy
On Tue, Nov 25, 2014 at 2:34 PM, Phillip Susi  wrote:

> I have seen plenty of error logs of people with drives that do
> properly give up and return an error instead of timing out so I get
> the feeling that most drives are properly behaved.  Is there a
> particular make/model of drive that is known to exhibit this silly
> behavior?

The drive will only issue a read error when its ECC absolutely cannot
recover the data, hard fail.

A few years ago companies including Western Digital started shipping
large cheap drives, think of the "green" drives. These had very high
TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
they completely took out the ability to configure this error recovery
timing so you only get the upward of 2 minutes to actually get a read
error reported by the drive. Presumably if the ECC determines it's a
hard fail and no point in reading the same sector 14000 times, it
would issue a read error much sooner. But again, the linux-raid list
if full of cases where this doesn't happen, and merely by changing the
linux SCSI command timer from 30 to 121 seconds, now the drive reports
an explicit read error with LBA information included, and now md can
correct the problem.




>
>>> IIRC, this is true when the drive returns failure as well.  The
>>> whole bio is marked as failed, and the page cache layer then
>>> begins retrying with progressively smaller requests to see if it
>>> can get *some* data out.
>>
>> Well that's very course. It's not at a sector level, so as long as
>> the drive continues to try to read from a particular LBA, but fails
>> to either succeed reading or give up and report a read error,
>> within 30 seconds, then you just get a bunch of wonky system
>> behavior.
>
> I don't understand this response at all.  The drive isn't going to
> keep trying to read the same bad lba; after the kernel times out, it
> resets the drive, and tries reading different smaller parts to see
> which it can read and which it can't.

That's my whole point. When the link is reset, no read error is
submitted by the drive, the md driver has no idea what the drive's
problem was, no idea that it's a read problem, no idea what LBA is
affected, and thus no way of writing over the affected bad sector. If
the SCSI command timer is raised well above 30 seconds, this problem
is resolved. Also replacing the drive with one that definitively
errors out (or can be configured with smartctl -l scterc) before 30
seconds is another option.


>
>> Conversely what I've observed on Windows in such a case, is it
>> tolerates these deep recoveries on consumer drives. So they just
>> get really slow but the drive does seem to eventually recover
>> (until it doesn't). But yeah 2 minutes is a long time. So then the
>> user gets annoyed and reinstalls their system. Since that means
>> writing to the affected drive, the firmware logic causes bad
>> sectors to be dereferenced when the write error is persistent.
>> Problem solved, faster system.
>
> That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
> delays are likely not on an individual request, but from several
> requests that each go into deep recovery, possibly because windows is
> retrying the same sector or a few consecutive sectors are bad.

It doesn't really matter, clearly its time out for drive commands is
much higher than the linux default of 30 seconds.

>
>> Because now you have a member drive that's inconsistent. At least
>> in the md raid case, a certain number of read failures causes the
>> drive to be ejected from the array. Anytime there's a write
>> failure, it's ejected from the array too. What you want is for the
>> drive to give up sooner with an explicit read error, so md can help
>> fix the problem by writing good data to the effected LBA. That
>> doesn't happen when there are a bunch of link resets happening.
>
> What?  It is no different than when it does return an error, with the
> exception that the error is incorrectly applied to the entire request
> instead of just the affected sector.

OK that doesn't actually happen and it would be completely f'n wrong
behavior if it were happening. All the kernel knows is the command
timer has expired, it doesn't know why the drive isn't responding. It
doesn't know there are uncorrectable sector errors causing the
problem. To just assume link resets are the same thing as bad sectors
and to just wholesale start writing possibly a metric shit ton of data
when you don't know what the problem is would be asinine. It might
even be sabotage. Jesus...




>
>> Again, if your drive SCT ERC is configurable, and set to something
>> sane like 70 deciseconds, that read failure happens at MOST 7
>> seconds after the read attempt. And md is notified of *exactly*
>> what sectors are affected, it immediately goes to mirror data, or
>> rebuilds it from parity, and then writes the correct data to the
>> previously reported bad sectors. And that will fix the problem.
>
> Yes... I'm talking abou

Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Chris Murphy
What happens when all btrfs LVs are unmounted, and you lvchange -an
the LVs (the pair) you do not want mounted; and then btrfs dev scan;
and then mount one of the devices? It should only find the matching LV
because the others are deactivated. I know this isn't ideal, but it's
better than corruption.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.17.0-rc7: kernel BUG at fs/btrfs/relocation.c:931!

2014-11-25 Thread Tomasz Chmielewski

I'm still seeing this when running balance with 3.18-rc6:

[95334.066898] BTRFS info (device sdd1): relocating block group 
6468350771200 flags 17

[95344.384279] BTRFS info (device sdd1): found 5371 extents
[95373.555640] BTRFS (device sdd1): parent transid verify failed on 
5568935395328 wanted 70315 found 89269
[95373.574208] BTRFS (device sdd1): parent transid verify failed on 
5568935395328 wanted 70315 found 89269

[95373.574483] [ cut here ]
[95373.574542] kernel BUG at fs/btrfs/relocation.c:242!
[95373.574601] invalid opcode:  [#1] SMP
[95373.574661] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ip_tables x_tables cpufreq_ondemand cpufreq_conservative 
cpufreq_powersave cpufreq_stats nfsd auth_rpcgss oid_registry exportfs 
nfs_acl nfs lockd grace fscache sunrpc ipv6 btrfs xor raid6_pq 
zlib_deflate coretemp hwmon loop pcspkr i2c_i801 i2c_core battery 
tpm_infineon tpm_tis tpm 8250_fintek video parport_pc parport ehci_pci 
lpc_ich ehci_hcd mfd_core button acpi_cpufreq ext4 crc16 jbd2 mbcache 
raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii

[95373.576506] CPU: 1 PID: 6089 Comm: btrfs Not tainted 3.18.0-rc6 #1
[95373.576568] Hardware name: System manufacturer System Product 
Name/P8H77-M PRO, BIOS 1101 02/04/2013
[95373.576683] task: 8807e9b91810 ti: 8807da1b8000 task.ti: 
8807da1b8000
[95373.576794] RIP: 0010:[]  [] 
relocate_block_group+0x432/0x4de [btrfs]

[95373.576933] RSP: 0018:8807da1bbb18  EFLAGS: 00010202
[95373.576993] RAX: 8806327a70f8 RBX: 8806327a7000 RCX: 
00018020
[95373.577056] RDX: 8806327a70d8 RSI: 8806327a70e8 RDI: 
8807ff403900
[95373.577118] RBP: 8807da1bbb88 R08: 0001 R09: 

[95373.577181] R10: 0003 R11: a031f2aa R12: 
8804601de5a0
[95373.577243] R13: 8806327a7108 R14: fff4 R15: 
8806327a7020
[95373.577307] FS:  7f9ccfa99840() GS:88081fa4() 
knlGS:

[95373.577418] CS:  0010 DS:  ES:  CR0: 80050033
[95373.577479] CR2: 7f98c4133000 CR3: 0007dd7bf000 CR4: 
001407e0

[95373.577540] Stack:
[95373.577594]  ea0004962e80 8806327a70e8 ea000c7fdb80 

[95373.577708]  ea000d289600 00ffea000d289640 a805e22b2a30 
1000
[95373.577822]  8802eb7b0240 8806327a7000  
8807f3b5a5a8

[95373.577937] Call Trace:
[95373.578009]  [] 
btrfs_relocate_block_group+0x158/0x278 [btrfs]
[95373.578137]  [] 
btrfs_relocate_chunk.isra.70+0x35/0xa5 [btrfs]

[95373.578263]  [] btrfs_balance+0xa66/0xc6b [btrfs]
[95373.578329]  [] ? 
__alloc_pages_nodemask+0x137/0x702
[95373.578407]  [] btrfs_ioctl_balance+0x220/0x29f 
[btrfs]

[95373.578483]  [] btrfs_ioctl+0x1134/0x22f6 [btrfs]
[95373.578547]  [] ? handle_mm_fault+0x44d/0xa00
[95373.578610]  [] ? avc_has_perm+0x2e/0xf7
[95373.578672]  [] ? __vm_enough_memory+0x25/0x13c
[95373.578736]  [] do_vfs_ioctl+0x3f2/0x43c
[95373.578798]  [] SyS_ioctl+0x4e/0x7d
[95373.578859]  [] ? do_page_fault+0xc/0x11
[95373.578920]  [] system_call_fastpath+0x12/0x17
[95373.578981] Code: 00 00 00 48 39 83 f8 00 00 00 74 02 0f 0b 4c 39 ab 
08 01 00 00 74 02 0f 0b 48 83 7b 20 00 74 02 0f 0b 83 bb 20 01 00 00 00 
74 02 <0f> 0b 83 bb 24 01 00 00 00 74 02 0f 0b 48 8b 73 18 48 8b 7b 08
[95373.579226] RIP  [] 
relocate_block_group+0x432/0x4de [btrfs]

[95373.579352]  RSP 




On 2014-10-04 00:06, Tomasz Chmielewski wrote:

On 2014-10-03 20:17 (Fri), Josef Bacik wrote:

On 10/02/2014 03:27 AM, Tomasz Chmielewski wrote:

Got this when running balance with 3.17.0-rc7:



Give these two patches a try

https://patchwork.kernel.org/patch/4938281/
https://patchwork.kernel.org/patch/4939761/


With these two patches applied on top of 3.13-rc7, it BUGs somewhere 
else now:


[ 2030.858792] BTRFS info (device sdd1): relocating block group
6469424513024 flags 17
[ 2039.674077] BTRFS info (device sdd1): found 20937 extents
[ 2066.726661] BTRFS info (device sdd1): found 20937 extents
[ 2068.048208] BTRFS info (device sdd1): relocating block group
6468350771200 flags 17
[ 2080.796412] BTRFS info (device sdd1): found 46927 extents
[ 2092.703850] parent transid verify failed on 5568935395328 wanted
70315 found 71183
[ 2092.714622] parent transid verify failed on 5568935395328 wanted
70315 found 71183
[ 2092.725269] parent transid verify failed on 5568935395328 wanted
70315 found 71183
[ 2092.725680] [ cut here ]
[ 2092.725740] kernel BUG at fs/btrfs/relocation.c:242!
[ 2092.725800] invalid opcode:  [#1] SMP
[ 2092.725860] Modules linked in: ipt_MASQUERADE iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables cpufreq_ondemand cpufreq_conservative
cpufreq_powersave cpufreq_stats bridge stp llc ipv6 btrfs xor raid6_pq
zlib_deflate coretemp hwmon loop i2c_i801 parpor

Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Zygo Blaxell
On Tue, Nov 25, 2014 at 10:59:53PM +0100, Goffredo Baroncelli wrote:
> On 11/25/2014 09:29 PM, Zygo Blaxell wrote:
> > On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote:
> >> On 11/23/2014 01:19 AM, Zygo Blaxell wrote:
> >> [...]
> >>> md-raid works as long as you specify the devices, and because it's always
> >>> the lowest layer it can ignore LVs (snapshot or otherwise).  It's also
> >>> not a particularly common use case, while making an LV snapshot of a
> >>> filesystem is a typical use case.
> >>
> >> I fully agree; but you still consider a *multi-device* btrfs over lvm...
> >> This is like a dm over lvm... which doesn't make sense at all (as you 
> >> already wrote)
> > 
> > It makes sense for btrfs because btrfs can productively use LVs on
> > different PVs (e.g. btrfs-raid1 on two LVs, one on each PV).  LVM is
> > the bottom layer because not everything in the world is btrfs--things
> > like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs
> > (e.g.  before running btrfsck) have to live on the same physical drives
> > as the btrfs filesystems.
> 
> Let me to summrize
> 
> 1) btrfs-single-disk on lvm works fine
> 2) btrfs-w/multiple-disk on lvm works fine
> 3) btrfs-single-disk on lvm works fine even with snapshot
> 
> 4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot
> 
> However I still doesn't understood why you want btrfs-w/multiple disk over 
> LVM ?

I want to split a few disks into partitions, but I want to create,
move, and resize the partitions from time to time.  Only LVM can do
that without taking the machine down, reducing RAID integrity levels,
hotplugging drives, or leaving installed drives idle most of the time.

I want btrfs-raid1 because of its ability to replace corrupted or lost
data from one disk using the other.  If I run a single-volume btrfs
on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage
stack), I can detect lost data, but not replace it automatically from
the other mirror.

Since I want both things at the same time, I have btrfs w/multiple disks
on LVM.

The LVM snapshots are for providing an 'undo' capability when I experiment
with some btrfs or btrfsck feature that destroys the filesystem.

> > and mounting the filesystem fails at 3.  
>  Are you sure ?
> >>>
> >>> Yes, I'm sure.  I've had to replace filesystems destroyed this way.
> 
> In a previous email you wrote:
> >> Multi-device btrfs fails at 2, 
> So I assumed that the point 3 onwards were related to a "single-disk" btrfs.
> 
> 
> 
> [...]
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: Digital signature


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 6:59 PM, Duncan wrote:
> It's not physical spinup, but electronic device-ready.  It happens
> on SSDs too and they don't have anything to spinup.

If you have an SSD that isn't handling IO within 5 seconds or so of
power on, it is badly broken.

> But, for instance on my old seagate 300-gigs that I used to have in
> 4-way mdraid, when I tried to resume from hibernate the drives
> would be spunup and talking to the kernel, but for some seconds to
> a couple minutes or so after spinup, they'd sometimes return
> something like (example) "Seagrte3x0" instead of "Seagate300".  Of
> course that wasn't the exact string, I think it was the model
> number or perhaps the serial number or something, but looking at
> dmsg I could see the ATA layer up for each of the four devices, the
> connection establish and seem to be returning good data, then the
> mdraid layer would try to assemble and would kick out a drive or
> two due to the device string mismatch compared to what was there 
> before the hibernate.  With the string mismatch, from its
> perspective the device had disappeared and been replaced with
> something else.

Again, these drives were badly broken then.  Even if it needs extra
time to come up for some reason, it shouldn't be reporting that it is
ready and returning incorrect information.

> And now I seen similar behavior resuming from suspend (the old
> hardware wouldn't resume from suspend to ram, only hibernate, the
> new hardware resumes from suspend to ram just fine, but I had
> trouble getting it to resume from hibernate back when I first setup
> and tried it; I've not tried hibernate since and didn't even setup
> swap to hibernate to when I got the SSDs so I've not tried it for a
> couple years) on SSDs with btrfs raid.  Btrfs isn't as informative
> as was mdraid on why it kicks a device, but dmesg says both devices
> are up, while btrfs is suddenly spitting errors on one device.  A
> reboot later and both devices are back in the btrfs and I can do a
> scrub to resync, which generally finds and fixes errors on the
> btrfs that were writable (/home and /var/log), but of course not on
> the btrfs mounted as root, since it's read-only by default.

Several months back I was working on some patches to avoid blocking a
resume until after all disks had spun up ( someone else ended up
getting a different version merged to the mainline kernel ).  I looked
quite hard at the timings of things during suspend and found that my
ssd was ready and handling IO darn near instantly and the hd ( 5900
rpm wd green at the time ) took something like 7 seconds before it was
completing IO.  These days I'm running a raid10 on 3 7200 rpm blues
and it comes right up from suspend with no problems, just as it should.

> The paper specifically mentioned that it wasn't necessarily the
> more expensive devices that were the best, either, but the ones
> that faired best did tend to have longer device-ready times.  The
> conclusion was that a lot of devices are cutting corners on
> device-ready, gambling that in normal use they'll work fine,
> leading to an acceptable return rate, and evidently, the gamble
> pays off most of the time.

I believe I read the same study and don't recall any such conclusion.
 Instead the conclusion was that the badly behaving drives aren't
ordering their internal writes correctly and flushing their metadata
from ram to flash before completing the write request.  The problem
was on the power *loss* side, not the power application.

> The spinning rust in that study faired far better, with I think
> none of the devices scrambling their own firmware, and while there
> was some damage to storage, it was generally far better confined.

That is because they don't have a flash translation layer to get
mucked up and prevent them from knowing where the blocks are on disk.
 The worst thing you get out of a hdd losing power during a write is
the sector it was writing is corrupted and you have to re-write it.

> My experience says otherwise.  Else explain why those problems
> occur in the first two minutes, but don't occur if I hold it at the
> grub prompt "to stabilize"for two minutes, and never during normal
> "post- stabilization" operation.  Of course perhaps there's another
> explanation for that, and I'm conflating the two things.  But so
> far, experience matches the theory.

I don't know what was broken about these drives, only that it wasn't
capacitors since those charge in milliseconds, not seconds.  Further,
all systems using microprocessors ( like the one in the drive that
controls it ) have reset circuitry that prevents them from running
until after any caps have charged enough to get the power rail up to
the required voltage.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb
C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN
1mHR1PR6Vgft11t0+u8TPTo

Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Goffredo Baroncelli
On 11/25/2014 09:29 PM, Zygo Blaxell wrote:
> On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote:
>> On 11/23/2014 01:19 AM, Zygo Blaxell wrote:
>> [...]
>>> md-raid works as long as you specify the devices, and because it's always
>>> the lowest layer it can ignore LVs (snapshot or otherwise).  It's also
>>> not a particularly common use case, while making an LV snapshot of a
>>> filesystem is a typical use case.
>>
>> I fully agree; but you still consider a *multi-device* btrfs over lvm...
>> This is like a dm over lvm... which doesn't make sense at all (as you 
>> already wrote)
> 
> It makes sense for btrfs because btrfs can productively use LVs on
> different PVs (e.g. btrfs-raid1 on two LVs, one on each PV).  LVM is
> the bottom layer because not everything in the world is btrfs--things
> like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs
> (e.g.  before running btrfsck) have to live on the same physical drives
> as the btrfs filesystems.

Let me to summrize

1) btrfs-single-disk on lvm works fine
2) btrfs-w/multiple-disk on lvm works fine
3) btrfs-single-disk on lvm works fine even with snapshot

4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot

However I still doesn't understood why you want btrfs-w/multiple disk over LVM ?



> 
> and mounting the filesystem fails at 3.  
 Are you sure ?
>>>
>>> Yes, I'm sure.  I've had to replace filesystems destroyed this way.

In a previous email you wrote:
>> Multi-device btrfs fails at 2, 
So I assumed that the point 3 onwards were related to a "single-disk" btrfs.



[...]


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 7:05 PM, Chris Murphy wrote:
> I'm not a hard drive engineer, so I can't argue either point. But 
> consumer drives clearly do behave this way. On Linux, the kernel's 
> default 30 second command timer eventually results in what look
> like link errors rather than drive read errors. And instead of the
> problems being fixed with the normal md and btrfs recovery
> mechanisms, the errors simply get worse and eventually there's data
> loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
> of such reports and their solution.

I have seen plenty of error logs of people with drives that do
properly give up and return an error instead of timing out so I get
the feeling that most drives are properly behaved.  Is there a
particular make/model of drive that is known to exhibit this silly
behavior?

>> IIRC, this is true when the drive returns failure as well.  The
>> whole bio is marked as failed, and the page cache layer then
>> begins retrying with progressively smaller requests to see if it
>> can get *some* data out.
> 
> Well that's very course. It's not at a sector level, so as long as
> the drive continues to try to read from a particular LBA, but fails
> to either succeed reading or give up and report a read error,
> within 30 seconds, then you just get a bunch of wonky system
> behavior.

I don't understand this response at all.  The drive isn't going to
keep trying to read the same bad lba; after the kernel times out, it
resets the drive, and tries reading different smaller parts to see
which it can read and which it can't.

> Conversely what I've observed on Windows in such a case, is it 
> tolerates these deep recoveries on consumer drives. So they just
> get really slow but the drive does seem to eventually recover
> (until it doesn't). But yeah 2 minutes is a long time. So then the
> user gets annoyed and reinstalls their system. Since that means
> writing to the affected drive, the firmware logic causes bad
> sectors to be dereferenced when the write error is persistent.
> Problem solved, faster system.

That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
delays are likely not on an individual request, but from several
requests that each go into deep recovery, possibly because windows is
retrying the same sector or a few consecutive sectors are bad.

> Because now you have a member drive that's inconsistent. At least
> in the md raid case, a certain number of read failures causes the
> drive to be ejected from the array. Anytime there's a write
> failure, it's ejected from the array too. What you want is for the
> drive to give up sooner with an explicit read error, so md can help
> fix the problem by writing good data to the effected LBA. That
> doesn't happen when there are a bunch of link resets happening.

What?  It is no different than when it does return an error, with the
exception that the error is incorrectly applied to the entire request
instead of just the affected sector.

> Again, if your drive SCT ERC is configurable, and set to something 
> sane like 70 deciseconds, that read failure happens at MOST 7
> seconds after the read attempt. And md is notified of *exactly*
> what sectors are affected, it immediately goes to mirror data, or
> rebuilds it from parity, and then writes the correct data to the
> previously reported bad sectors. And that will fix the problem.

Yes... I'm talking about when the drive doesn't support that.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E
HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T
rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs
3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J
VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi
VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE=
=r3AP
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Zygo Blaxell
On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote:
> On 11/23/2014 01:19 AM, Zygo Blaxell wrote:
> [...]
> > md-raid works as long as you specify the devices, and because it's always
> > the lowest layer it can ignore LVs (snapshot or otherwise).  It's also
> > not a particularly common use case, while making an LV snapshot of a
> > filesystem is a typical use case.
> 
> I fully agree; but you still consider a *multi-device* btrfs over lvm...
> This is like a dm over lvm... which doesn't make sense at all (as you 
> already wrote)

It makes sense for btrfs because btrfs can productively use LVs on
different PVs (e.g. btrfs-raid1 on two LVs, one on each PV).  LVM is
the bottom layer because not everything in the world is btrfs--things
like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs
(e.g.  before running btrfsck) have to live on the same physical drives
as the btrfs filesystems.

> >>> and mounting the filesystem fails at 3.  
> >> Are you sure ?
> > 
> > Yes, I'm sure.  I've had to replace filesystems destroyed this way.
> > 
> >> [working instance snipped]
> > 
> >> On the basis of the example above, in case you want to mount a 
> >> "single-disk", BTRFS seems me to work properly. You have to pay
> >> attention only to not mount the two filesystem at the same time.
> > 
> > The problem is btrfs stops searching when it sees one disk with each UUID,
> 
> BTRFS doens't search anything. It is udev which "push" the information
> on the kernel module. The btrfs module groups these information by UUID.
> When a new disk is inserted, overwrite the information of the old one.

Same result:  when presented with multiple devices with the same UUID,
one is chosen arbitrarily instead of rejecting all of them.

> > so the set of disks (snapshot vs origin) that you get is *random*.
> > For a pair of origin + snapshots, there's a 50% chance it works, 50%
> > chance it eats your data.
> 
> Sorry but I have to disagree: the code is quite clear 
> (see fs/btrfs/volume.c, near line 512):
> 
> [...]
> 
> } else if (!device->name || strcmp(device->name->str, path)) {
> /*
>  * When FS is already mounted.
>  * 1. If you are here and if the device->name is NULL that
>  *means this device was missing at time of FS mount.
>  * 2. If you are here and if the device->name is different
>  *from 'path' that means either
>  *  a. The same device disappeared and reappeared with
>  * different name. or
>  *  b. The missing-disk-which-was-replaced, has
>  * reappeared now.

If the FS is already mounted then there is no issue.  It's when you're trying
to mount the FS that the fun occurs.

>  *
>  * We must allow 1 and 2a above. But 2b would be a spurious
>  * and unintentional.
> 
> [...]
> 
> The case is the 2a; in this case btrfs store the new name and mount it.
> 
> Anyway I made a small test: I created 1 btrfs filesystem, and 
> made a lvm-snapshot. Then create two different file in the snapshot and in
> the original one. I run a program which mounts randomly the first or
> the latter, checks if the correct file is present; after more than 130 tests I
> never saw your "50% chance it works": it always works.

One btrfs filesystem on two LVs with a snapshot of each LV also present.
So you'd have:

lv00 - btrfs device 1
lv01 - btrfs device 2
lv00snap - snapshot of lv00
lv01snap - snapshot of lv01

If you mount by device UUID then you get one of these results at random:

lv00 + lv01 - OK
lv00snap + lv01snap - also OK
lv00 + lv01snap - failure
lv00snap + lv01 - failure

2 failures, 2 successes = 50% failure rate.

If you mount by the name of one of the devices then you only get the two
rows of the above table that match the device you named, but you still
get one success row and one failure row.

Which result you get seems to depend on the order in which LVM enumerates
the LVs, so if you are doing a mount/umount loop then you won't see any
problems as btrfs will consistently make the same choice of LVs over
and over again.  Rebooting or creating other LVs in between mounts will
definitely cause problems.

> BR
> G.Baroncelli
> 
> > 
> >> BR
> >> G.Baroncelli
> >>
> >>
> >> -- 
> >> gpg @keyserver.linux.it: Goffredo Baroncelli 
> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


signature.asc
Description: Digital signature


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Tue 25-11-14 12:57:16, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
> >   Actually, I'd also prefer to do the writing from iput_final(). My main
> > reason is that shrinker starts behaving very differently when you put
> > inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> > particular:
> > /*
> >  * Referenced or dirty inodes are still in use. Give them another
> >  * pass
> >  * through the LRU as we canot reclaim them now.
> >  */
> > if (atomic_read(&inode->i_count) ||
> > (inode->i_state & ~I_REFERENCED)) {
> > list_del_init(&inode->i_lru);
> > spin_unlock(&inode->i_lock);
> > this_cpu_dec(nr_unused);
> > return LRU_REMOVED;
> > }
> 
> I must be missing something; how would the shirnker behave
> differently?  I_DIRTY_TIME shouldn't have any effect on the shrinker;
> note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
> deliberate, because I didn't want I_DIRTY_TIME to have any affect on
> any of the other parts of the writeback or inode management parts.
  Sure, but the test tests whether the inode has *any other* bit than
I_REFERENCED set. So I_DIRTY_TIME will trigger the test and we just remove
the inode from lru list. You could exclude I_DIRTY_TIME from this test to
avoid this problem but then the shrinker latency would get much larger
because it will suddently do IO in evict(). So I still think doing the
write in iput_final() is the best solution.

> > Regarding your concern that we'd write the inode when file is closed -
> > that's not true. We'll write the inode only after corresponding dentry is
> > evicted and thus drops inode reference. That doesn't seem too bad to me.
> 
> True, fair enough.  It's not quite so lazy, but it should be close
> enough.
> 
> I'm still not seeing the benefit in waiting until the last possible
> minute to write out the timestamps; evict() can block as it is if
> there are any writeback that needs to be completed, and if the
> writeback happens to pages subject to delalloc, the timestamp update
> could happen for free at that point.
  Yeah, doing IO from evict is OK in princible but the change in shrinker
success rate / latency worries me... It would certainly need careful
testing under memory pressure & IO load with lots of outstanding timestamp
updates and see how shrinker behaves (change in CPU consumption, numbers of
evicted inodes, etc.).

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread Bardur Arantsson
On 2014-11-25 17:47, David Sterba wrote:
> On Mon, Nov 24, 2014 at 03:07:45PM -0500, Chris Mason wrote:
>> On Mon, Nov 24, 2014 at 12:23 AM, Liu Bo  wrote:
>>> This brings a strong-but-slow checksum algorithm, sha256.
>>>
>>> Actually btrfs used sha256 at the early time, but then moved to 
>>> crc32c for
>>> performance purposes.
>>>
>>> As crc32c is sort of weak due to its hash collision issue, we need a 
>>> stronger
>>> algorithm as an alternative.
>>>
>>> Users can choose sha256 from mkfs.btrfs via
>>>
>>> $ mkfs.btrfs -C 256 /device
>>
>> Agree with others about -C 256...-C sha256 is only three letters more ;)
>>
>> What's the target for this mode?  Are we trying to find evil people 
>> scribbling on the drive, or are we trying to find bad hardware?
> 
> We could provide an interface for external applications that would make
> use of the strong checksums. Eg. external dedup, integrity db. The
> benefit here is that the checksum is always up to date, so there's no
> need to compute the checksums again. At the obvious cost.

Yes, pleease!

Regards,


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Theodore Ts'o
On Tue, Nov 25, 2014 at 06:30:40PM +0100, Jan Kara wrote:
>   This would be possible and as Boaz says, it might be possible to reuse
> the same list_head in the inode for this. Getting rid of the full scan of
> all superblock inodes would be nice (as the scan gets really expensive for
> large numbers of inodes (think of i_sb_lock contention) and this makes it
> twice as bad) so I'd prefer to do this if possible.

Fair enough, I'll give this a try.  Personally, I've never been that
solicitous towards the efficiency of sync, since if you ever use it,
you tend to destroy performance just because of contention of the disk
drive head caused by the writeback, never mind the i_sb_lock
contention.  ("I am sync(2), the destroyer of tail latency SLO's...")

In fact there has sometimes been discussion about disabling sync(2)
from non-root users, because the opportunity for mischief when a
developer logs and types sync out of reflex is too high.  Of course,
if we ever did this, I'm sure such a patch would never be accepted
upstream, but that's OK, most people don't seem to care about tail
latency outside of Facebook and Google anyway

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fix Penguin Penalty 17th October2014 ( mail-archive.com )

2014-11-25 Thread oscillating66831
Dear Sir

Did your website get hit by Google Penguin update on October 17th 2014? What 
basically is Google Penguin Update? It is actually a code name for Google 
algorithm which aims at decreasing your websites search engine rankings that 
violate Google’s guidelines by using black hat SEO techniques to rank your 
webpage by giving number of spammy links to the page.
 
We are one of those few SEO companies that can help you avoid penalties from 
Google Updates like Penguin and Panda. Our clients have survived all the 
previous and present updates with ease. They have never been hit because we use 
100% white hat SEO techniques to rank Webpages.  Simple thing that we do to 
keep websites away from any Penguin or Panda penalties is follow Google 
guidelines and we give Google users the best answers to their queries.

If you are looking to increase the quality of your websites and to get more 
targeted traffic or save your websites from these Google penalties email us 
back with your interest. 

We will be glad to serve you and help you grow your business.

Regards

Julia kites

SEO Manager ( TOB )
B7 Green Avenue, Amritsar 143001 Punjab

NO CLICK in the subject to STOP EMAILS
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: fix spacing in error messages

2014-11-25 Thread David Sterba
Signed-off-by: David Sterba 
---
 btrfs-list.c | 4 ++--
 qgroup.c | 2 +-
 send-utils.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/btrfs-list.c b/btrfs-list.c
index b6b84935109d..50edcf493869 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -1711,7 +1711,7 @@ int btrfs_list_find_updated_files(int fd, u64 root_id, 
u64 oldest_gen)
ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
e = errno;
if (ret < 0) {
-   fprintf(stderr, "ERROR: can't perform the search- %s\n",
+   fprintf(stderr, "ERROR: can't perform the search - 
%s\n",
strerror(e));
break;
}
@@ -1911,7 +1911,7 @@ int btrfs_list_get_path_rootid(int fd, u64 *treeid)
ret = ioctl(fd, BTRFS_IOC_INO_LOOKUP, &args);
if (ret < 0) {
fprintf(stderr,
-   "ERROR: can't perform the search -%s\n",
+   "ERROR: can't perform the search - %s\n",
strerror(errno));
return ret;
}
diff --git a/qgroup.c b/qgroup.c
index 368b26284544..1a4866cb7345 100644
--- a/qgroup.c
+++ b/qgroup.c
@@ -1160,7 +1160,7 @@ u64 btrfs_get_path_rootid(int fd)
ret = ioctl(fd, BTRFS_IOC_INO_LOOKUP, &args);
if (ret < 0) {
fprintf(stderr,
-   "ERROR: can't perform the search -%s\n",
+   "ERROR: can't perform the search - %s\n",
strerror(errno));
return ret;
}
diff --git a/send-utils.c b/send-utils.c
index 71b6ec1528e5..cbaf2e90acb4 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -556,7 +556,7 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
ret = is_uuid_tree_supported(mnt_fd);
if (ret < 0) {
fprintf(stderr,
-   "ERROR: check if we support uuid tree fails- %s\n",
+   "ERROR: check if we support uuid tree fails - %s\n",
strerror(errno));
return ret;
} else if (ret) {
@@ -579,7 +579,7 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
ret = ioctl(mnt_fd, BTRFS_IOC_TREE_SEARCH, &args);
e = errno;
if (ret < 0) {
-   fprintf(stderr, "ERROR: can't perform the search- %s\n",
+   fprintf(stderr, "ERROR: can't perform the search - 
%s\n",
strerror(e));
return ret;
}
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] btrfs-progs: New 'lost+found' infrastructrue with

2014-11-25 Thread David Sterba
On Mon, Nov 24, 2014 at 05:06:59PM +0800, Qu Wenruo wrote:
> Introduce the new 'lost+found' dir and related infrastructure to create it
> in btrfs-progs.
> 
> [BUG]
> With the new infrastructure, fix a bug that some people reported in both
> kernel BZ and maillist, which there is some files' nlink is 1 but backref
> points to non-exist parent.
> The two reporters all report missing file(chrome config file), so we'd
> better not to delete such files but use the 'lost+found' dir.

Well, I don't like introducing the lost+found directory.

My idea is to extend the rescue utilities to extract the unlinked and
copy them to a user defined directory and do not touch the filesystem.

Or, at least make the in-filesytem lost+found directory creation
optional. You've split the patches well so I'm going to pull 1-5
directly. Patch 6 should be updated a bit, I'll look closer and will let
you know.

> 2. Unify the repair framework.
> When writing the 6th patch, I think it is better to build a frame work
> that unify the check and repair framework.
> In 6th patch, my patchset and Josef's commit 2dc4c001 in fact has some
> similar function but do the repair in different time and functions.
> 
> I will try to build a unified framework for repair, each repair will be
> independent and have its own err number.
> And each repair function should work like the following:
> 1) Check the error number
> 2) Do the repair
> 3) Update the related btrfsck record(like newly created inode, deleted inode)

The unification is most welcome, feel free to send me anything that could be
merged as preparatory work (cleanups, safe changes, etc).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: convert: use task for progress indication of metadata creation

2014-11-25 Thread David Sterba
Fixed locally, no need to resend the patch. JFYI the changes I made

On Sun, Nov 09, 2014 at 11:16:56PM +0100, Silvio Fricke wrote:
> ---
>  Documentation/btrfs-convert.txt |  2 ++
>  Makefile|  6 ++--
>  btrfs-convert.c | 64 
> +
>  3 files changed, 64 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/btrfs-convert.txt b/Documentation/btrfs-convert.txt
> index 555fb35..9cc326c 100644
> --- a/Documentation/btrfs-convert.txt
> +++ b/Documentation/btrfs-convert.txt
> @@ -29,6 +29,8 @@ Roll back to ext2fs.
>  set filesystem label during conversion.
>  -L::
>  use label from the converted filesystem.
> +-p::
> +Show progress of convertation.

conversion

>  EXIT STATUS
>  ---
> diff --git a/Makefile b/Makefile
> index 203597c..f76c6b2 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -8,7 +8,7 @@ AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 
> -DBTRFS_FLAT_INCLUDES -fno-strict-alias
>  CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic
>  objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
> root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
> -   extent-cache.o extent_io.o volumes.o utils.o repair.o \
> +   extent-cache.o extent_io.o volumes.o utils.o repair.o task-util.o \
> qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
> ulist.o qgroup-verify.o backref.o
>  cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o 
> \
> @@ -20,13 +20,13 @@ libbtrfs_objects = send-stream.o send-utils.o rbtree.o 
> btrfs-list.o crc32c.o \
>  uuid-tree.o utils-lib.o rbtree-utils.o
>  libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
>  crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \
> -extent_io.h ioctl.h ctree.h btrfsck.h version.h
> +extent_io.h ioctl.h ctree.h btrfsck.h version.h task-util.h

It's not used in library, removed

>  TESTS = fsck-tests.sh convert-tests.sh
>  
>  INSTALL = install
>  prefix ?= /usr/local
>  bindir = $(prefix)/bin
> -lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
> +lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L. -pthread

dtto

>  libdir ?= $(prefix)/lib
>  incdir = $(prefix)/include/btrfs
>  LIBS = $(lib_LIBS) $(libs_static)
> diff --git a/btrfs-convert.c b/btrfs-convert.c
> index a544fc6..f8ba920 100644
> --- a/btrfs-convert.c
> +++ b/btrfs-convert.c
> @@ -38,6 +38,7 @@
>  #include "transaction.h"
>  #include "crc32c.h"
>  #include "utils.h"
> +#include "task-util.h"
>  #include 
>  #include 
>  #include 
> @@ -45,6 +46,41 @@
>  #define INO_OFFSET (BTRFS_FIRST_FREE_OBJECTID - EXT2_ROOT_INO)
>  #define EXT2_IMAGE_SUBVOL_OBJECTID BTRFS_FIRST_FREE_OBJECTID
>  
> +struct private {

We'd still like to keep the namespace C++ clean, renamed to task_ctx

> + uint32_t max_copy_inodes;
> + uint32_t cur_copy_inodes;
> + struct task_info *info;
> +};
> +
> +static void *print_copied_inodes(void *p)
> +{
> + struct private *priv = p;
> + const char work_indicator[] = { '.', 'o', 'O', 'o' };
> + uint32_t count = 0;
> +
> + task_period_start(priv->info, 1000 /* 1s */);
> + while (1) {
> + count++;
> + printf("copy inodes [%c] [%10d/%10d]\r",
> +work_indicator[count % 4], priv->cur_copy_inodes,
> +priv->max_copy_inodes);
> + fflush(stdout);
> + task_period_wait(priv->info);
> + }
> +
> + return NULL;
> +}
> +
> +static int after_copied_inodes(void *p)
> +{
> + struct private *priv = p;
> +
> + printf("\n");
> + task_period_stop(priv->info);
> +
> + return 0;
> +}
> +
>  /*
>   * Open Ext2fs in readonly mode, read block allocation bitmap and
>   * inode bitmap into memory.
> @@ -1036,7 +1072,7 @@ fail:
>   * scan ext2's inode bitmap and copy all used inodes.
>   */
>  static int copy_inodes(struct btrfs_root *root, ext2_filsys ext2_fs,
> -int datacsum, int packing, int noxattr)
> +int datacsum, int packing, int noxattr, struct private 
> *p)
>  {
>   int ret;
>   errcode_t err;
> @@ -1068,6 +1104,7 @@ static int copy_inodes(struct btrfs_root *root, 
> ext2_filsys ext2_fs,
>   objectid, ext2_fs, ext2_ino,
>   &ext2_inode, datacsum, packing,
>   noxattr);
> + p->cur_copy_inodes++;
>   if (ret)
>   return ret;
>   if (trans->blocks_used >= 4096) {
> @@ -2197,7 +2234,7 @@ err:
>  }
>  
>  static int do_convert(const char *devname, int datacsum, int packing, int 
> noxattr,
> -int copylabel, const char *fslabel)
> +int copylabel, const char *fslabel, int progress)
>  {
>   int i, ret;
>   int fd = -1;
> @@ -2275,11 +2312,23 @@ static int do_convert(const char *devname, int 

Re: [PATCH 1/2] btrfs-progs: add task-utils

2014-11-25 Thread David Sterba
Hi,

so started reviewing the patches for inclusion in the 3.18 branch and
found a few things that I've fixed locally, this is just FYI.

On Sun, Nov 09, 2014 at 11:16:55PM +0100, Silvio Fricke wrote:
> Signed-off-by: Silvio Fricke 
> ---
>  task-util.c | 121 
> 
>  task-util.h |  33 +
>  2 files changed, 154 insertions(+)
>  create mode 100644 task-util.c
>  create mode 100644 task-util.h
> 
> diff --git a/task-util.c b/task-util.c
> new file mode 100644
> index 000..9268df7
> --- /dev/null
> +++ b/task-util.c

Renamed the file to task-utils.c

> diff --git a/task-util.h b/task-util.h
> new file mode 100644
> index 000..95f7b5b
> --- /dev/null
> +++ b/task-util.h
> @@ -0,0 +1,33 @@
> +
> +#ifndef __PROGRESS_
> +#define __PROGRESS_

__TASK_UTILS_H__

> +
> +#include 
> +
> +struct periodic_info {
> + int timer_fd;
> + unsigned long long wakeups_missed;
> +};
> +
> +struct task_info {
> + struct periodic_info periodic;
> + pthread_t id;
> + void *private_data;
> + void *(*threadfn)(void *);
> + int (*postfn)(void *);
> +};
> +
> +/* task life cycle */
> +struct task_info *task_init(void *(*threadfn)(void *), int (*postfn)(void *),
> + void *thread_private);
> +int task_start(struct task_info *info);
> +void task_stop(struct task_info *info);
> +void task_deinit(struct task_info *info);
> +
> +/* periodic life cycle */
> +int task_period_start(struct task_info *info, unsigned int period_ms);
> +void task_period_wait(struct task_info *info);
> +void task_period_stop(struct task_info *info);
> +
> +#endif /* __PROGRESS_ */

dtto
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Theodore Ts'o
On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
>   Actually, I'd also prefer to do the writing from iput_final(). My main
> reason is that shrinker starts behaving very differently when you put
> inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> particular:
> /*
>  * Referenced or dirty inodes are still in use. Give them another
>  * pass
>  * through the LRU as we canot reclaim them now.
>  */
> if (atomic_read(&inode->i_count) ||
> (inode->i_state & ~I_REFERENCED)) {
> list_del_init(&inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }

I must be missing something; how would the shirnker behave
differently?  I_DIRTY_TIME shouldn't have any effect on the shrinker;
note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
deliberate, because I didn't want I_DIRTY_TIME to have any affect on
any of the other parts of the writeback or inode management parts.

> Regarding your concern that we'd write the inode when file is closed -
> that's not true. We'll write the inode only after corresponding dentry is
> evicted and thus drops inode reference. That doesn't seem too bad to me.

True, fair enough.  It's not quite so lazy, but it should be close
enough.

I'm still not seeing the benefit in waiting until the last possible
minute to write out the timestamps; evict() can block as it is if
there are any writeback that needs to be completed, and if the
writeback happens to pages subject to delalloc, the timestamp update
could happen for free at that point.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread David Sterba
On Tue, Nov 25, 2014 at 06:10:29PM +0100, Goffredo Baroncelli wrote:
> > Can the chunk list also display the usage inside the chunks?
> Unfortunately not. I don't know how it would be possible to get this info.

It takes some more parsing of the fs structures, doable via the
SEARCH_TREE ioctl. I have an unpolished patch to show the logical chunks
(though without actual chunk usage), similar can be done for per-device
chunks. The patch was a debugging helper so it hasn't been submitted yet
until it's more user friendly. I'll keep the per-device request in mind.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: deal with all 'subvol=xxx' options once

2014-11-25 Thread David Sterba
On Tue, Nov 25, 2014 at 04:20:11PM +0800, Wang Shilong wrote:
> Steps to reproduce:
>  # mkfs.btrfs -f /dev/sdb
>  # mount -t btrfs /dev/sdb /mnt
>  # btrfs sub create /mnt/dir
>  # mount -t btrfs /dev/sdb /mnt -o subvol=dir,subvol=dir
> 
> It fails with:
>  mount: mount(2) failed: No such file or directory

The bug is real, but I don't like the fix. The mount path is hard to
read already, and I'm afraid your fix adds another unobvious step to the
whole processing.

setup_root_args replaces subvol= with subvolid=0 once. I suggest to
replace all occurences of subvol= here and not rely on the recursive
behaviour of the mount callbacks.

The (buggy) way how it works now is that the first occurence of subvol
will get parsed and passed as

newroot = vfs_kern_mount(",subvol=second,...,subvolid=0")

and this will call back again to btrfs_mount and will try to mount the
subvol 'second' but now relative to 'newroot'.

Try this:

# mkfs.btrfs -f /dev/sdb
# mount -t btrfs /dev/sdb /mnt
# btrfs sub create /mnt/dir
# btrfs sub create /mnt/dir/dir2
# mount -t btrfs /dev/sdb /mnt -o subvol=dir,subvol=dir2

mount succeeds and the mounted subvolume is dir2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] ext4: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Fri 21-11-14 14:59:24, Ted Tso wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
> 
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> added support.
...
> diff --git a/fs/inode.c b/fs/inode.c
> index f0d6232..89cfca7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
> long ino)
>  }
>  EXPORT_SYMBOL(ilookup);
>  
> +/**
> + * find_active_inode_nowait - find an active inode in the inode cache
> + * @sb:  super block of file system to search
> + * @ino: inode number to search for
> + *
> + * Search for an active inode @ino in the inode cache, and if the
> + * inode is in the cache, the inode is returned with an incremented
> + * reference count.  If the inode is being freed or is newly
> + * initialized, return nothing instead of trying to wait for the inode
> + * initialization or destruction to be complete.
> + */
> +struct inode *find_active_inode_nowait(struct super_block *sb,
> +unsigned long ino)
> +{
> + struct hlist_head *head = inode_hashtable + hash(sb, ino);
> + struct inode *inode, *ret_inode = NULL;
> +
> + spin_lock(&inode_hash_lock);
> + hlist_for_each_entry(inode, head, i_hash) {
> + if ((inode->i_ino != ino) ||
> + (inode->i_sb != sb))
> + continue;
> + spin_lock(&inode->i_lock);
> + if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
> + __iget(inode);
> + ret_inode = inode;
> + }
> + spin_unlock(&inode->i_lock);
> + goto out;
> + }
> +out:
> + spin_unlock(&inode_hash_lock);
> + return ret_inode;
> +}
> +EXPORT_SYMBOL(find_active_inode_nowait);
> +
  Please move this function definition into a separate patch so that it
isn't hidden in an ext4-specific patch. Otherwise it looks good.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Jan Kara
On Fri 21-11-14 14:59:23, Ted Tso wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
  Hum, how about reusing i_dirtied_when for this. Using that field even
makes a good sence to me...

Honza

> Signed-off-by: Theodore Ts'o 
> ---
>  fs/fs-writeback.c  | 1 +
>  fs/inode.c | 7 ++-
>  include/linux/fs.h | 1 +
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ce7de22..eb04277 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>   if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>   trace_writeback_dirty_inode_start(inode, flags);
>  
> + inode->i_ts_dirty_day = 0;
>   if (sb->s_op->dirty_inode)
>   sb->s_op->dirty_inode(inode, flags);
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 6e91aca..f0d6232 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
> struct inode *inode,
>   */
>  static int update_time(struct inode *inode, struct timespec *time, int flags)
>  {
> + int days_since_boot = jiffies / (HZ * 86400);
>   int ret;
>  
>   if (inode->i_op->update_time) {
> @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
> timespec *time, int flags)
>   if (flags & S_MTIME)
>   inode->i_mtime = *time;
>   }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> + (!inode->i_ts_dirty_day ||
> +  inode->i_ts_dirty_day == days_since_boot)) {
>   spin_lock(&inode->i_lock);
>   inode->i_state |= I_DIRTY_TIME;
>   spin_unlock(&inode->i_lock);
> + inode->i_ts_dirty_day = days_since_boot;
>   return 0;
>   }
> + inode->i_ts_dirty_day = 0;
>   if (inode->i_op->write_time)
>   return inode->i_op->write_time(inode);
>   mark_inode_dirty_sync(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 489b2f2..e3574cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -575,6 +575,7 @@ struct inode {
>   struct timespec i_ctime;
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
> + unsigned short  i_ts_dirty_day;
>   unsigned inti_blkbits;
>   blkcnt_ti_blocks;
>  
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > > +static void flush_sb_dirty_time(struct super_block *sb)
> > > +{
>   ...
> > > +}
> > 
> > This just seems wrong to me, not to mention extremely expensive when we have
> > millions of cached inodes on the superblock.
> 
> #1, It only gets called on a sync(2) or syncfs(2), which can be pretty
> expensive as it is, so I didn't really worry about it.
> 
> #2, We're already iterating over all of the inodes in the sync(2) or
> syncfs(2) path, so the code path in question is already O(N) in the
> number of inodes.
> 
> > Why can't we just add a function like mark_inode_dirty_time() which
> > puts the inode on the dirty inode list with a writeback time 24
> > hours in the future rather than 30s in the future?
> 
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).
  I agree this isn't going to work easily. Currently flusher relies on
dirty list being sorted by i_dirtied_when and that gets harder to maintain
if we ever have inodes with i_dirtied_when in future (we'd have to sort-in
newly dirtied inodes).

> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat.  Given #1 and #2 above,
> yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
> we're not iterating over all of the inodes twice, but I believe that
> is a better trade-off than bloating the inode structure with an extra
> set of linked lists or increasing the CPU cost to the writeback path
> (which gets executed way more often than the sync or syncfs paths).
  This would be possible and as Boaz says, it might be possible to reuse
the same list_head in the inode for this. Getting rid of the full scan of
all superblock inodes would be nice (as the scan gets really expensive for
large numbers of inodes (think of i_sb_lock contention) and this makes it
twice as bad) so I'd prefer to do this if possible.
 
Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Jan Kara
On Mon 24-11-14 23:33:35, Ted Tso wrote:
> On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > Eviction is too late for this. I'm pretty sure that it won't get
> > this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> > case via write_inode_now().
> 
> Actually, the tracepoint for fs_lazytime_evict() does get triggered
> from time to time; but only when the inode is evicted due to memory
> pressure, i.e., via the evict_inodes() path.
> 
> I thought about possibly doing this in iput_final(), but that would
> mean that whenever we closed the last fd on the file, we would push
> the inode out to disk.  For files that we are writing, that's not so
> bad; but if we enable strictatime with lazytime, then we would be
> updating the atime for inodes that had been only been read on every
> close --- which in the case of say, all of the files in the kernel
> tree, would be a little unfortunate.
  Actually, I'd also prefer to do the writing from iput_final(). My main
reason is that shrinker starts behaving very differently when you put
inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
particular:
/*
 * Referenced or dirty inodes are still in use. Give them another
 * pass
 * through the LRU as we canot reclaim them now.
 */
if (atomic_read(&inode->i_count) ||
(inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}

Regarding your concern that we'd write the inode when file is closed -
that's not true. We'll write the inode only after corresponding dentry is
evicted and thus drops inode reference. That doesn't seem too bad to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread Goffredo Baroncelli
On 11/25/2014 05:08 PM, Martin Steigerwald wrote:
> Hi Goffredo,
> 
> Am Dienstag, 25. November 2014, 16:57:21 schrieb Goffredo Baroncelli:
>> This is a revamp of a my previous patches set[1]. After more than
>> year of attempts these patches were never merged, so I tried to
>> simplify them and to change a bit the focus. The previous patches set
>> had the focus to the disk usage.
>> The aim of these patches now is to show the chunks distribution
>> among the disks. So a new command 'btrfs chunk list' is added:
>>
>> $ sudo ./btrfs chunk list /mnt/btrfs1/
>> Data,RAID6: Size:3.00GiB, Used:1.02MiB
>>/dev/vdb 1.00GiB
>>/dev/vdd 1.00GiB
>>/dev/vde 1.00GiB
>>/dev/vdf 1.00GiB
>>/dev/vdg 1.00GiB
> 
> I still like your previous patch set *a lot* and wonder why exactly is has 
> never been merged. I know there has been quite some discussion about it, but 
> in the end I lost track.

David told [wrote] few minutes ago that this patch is still on the queue.


> 
> Can the chunk list also display the usage inside the chunks?
Unfortunately not. I don't know how it would be possible to get this info.

> 
> Thanks,
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread Goffredo Baroncelli
On 11/25/2014 05:13 PM, David Sterba wrote:
> On Tue, Nov 25, 2014 at 04:57:21PM +0100, Goffredo Baroncelli wrote:
>> This is a revamp of a my previous patches set[1]. After more than
>> year of attempts these patches were never merged, so I tried to
>> simplify them and to change a bit the focus. The previous patches set
>> had the focus to the disk usage.
> 
> The patches have been pending for last releases but I did not manage to
> fix all accounting problems in time and did not want to delay the
> releaes due to other important fixes. The 'new df' patchset is being
> refreshed each time and is part of the integration branches which might
> signify that they haven't been lost.
> 
Thanks David for care of that. Unfortunately I didn't heard anything
so I feared that this patch was lost.

I hoped that simplifying the patches set, it would be more simple to
integrate.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-25 Thread Christoph Hellwig
On Tue, Nov 25, 2014 at 04:51:41PM +0100, David Sterba wrote:
> Does not work because the security.* and system.* namespaces do not call
> the permission() hook, so no patch. However, if the proposed
> inode_is_readonly callback is merged, we can replace the btrfs-specific
> checks with is_readonly check in xattr_permission().

I think that patch should go first in the series.  But I really need to
find some time to review the whole thing before commenting with profound
half-knowledge..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread David Sterba
On Mon, Nov 24, 2014 at 03:07:45PM -0500, Chris Mason wrote:
> On Mon, Nov 24, 2014 at 12:23 AM, Liu Bo  wrote:
> > This brings a strong-but-slow checksum algorithm, sha256.
> > 
> > Actually btrfs used sha256 at the early time, but then moved to 
> > crc32c for
> > performance purposes.
> > 
> > As crc32c is sort of weak due to its hash collision issue, we need a 
> > stronger
> > algorithm as an alternative.
> > 
> > Users can choose sha256 from mkfs.btrfs via
> > 
> > $ mkfs.btrfs -C 256 /device
> 
> Agree with others about -C 256...-C sha256 is only three letters more ;)
> 
> What's the target for this mode?  Are we trying to find evil people 
> scribbling on the drive, or are we trying to find bad hardware?

We could provide an interface for external applications that would make
use of the strong checksums. Eg. external dedup, integrity db. The
benefit here is that the checksum is always up to date, so there's no
need to compute the checksums again. At the obvious cost.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changing label few times killed filesystem?

2014-11-25 Thread Boris Chernov


In attempt to get more information, I have commented out 
BUG_ON(rec->is_root) in cmds-check.c to let btrfsck check my file system 
without failing on this assertion. Below you can see the output. I would 
appreciate any help or ideas...


# btrfsck /dev/sdb1  # Full log can be downloaded here: 
http://pastebin.com/D68vr69J

Checking filesystem on /dev/sdb1
UUID: 787e3bc1-7583-4bd8-a52e-e57fd7fc9243
checking extents
...
ref mismatch on [20987904 16384] extent item 0, found 1
Backref 20987904 parent 3 root 3 not found in extent tree
backpointer mismatch on [20987904 16384]
owner ref check failed [20987904 16384]
...messages like these repeat many times, download full log to see them 
all...

ref mismatch on [29540352 16384] extent item 0, found 1
Backref 29540352 parent 18446744073709551607 root 18446744073709551607 
not found in extent tree

backpointer mismatch on [29540352 16384]
owner ref check failed [29540352 16384]
...
Errors found in extent allocation tree or chunk allocation
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 5 root dir 256 not found
found 409600 bytes used err is 1
total csum bytes: 0
total tree bytes: 49152
total fs tree bytes: 0
total extent tree bytes: 16384
btree space waste bytes: 48246
file data blocks allocated: 0
 referenced 0
Btrfs v3.17
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread David Sterba
On Mon, Nov 24, 2014 at 01:23:05PM +0800, Liu Bo wrote:
> This brings a strong-but-slow checksum algorithm, sha256.
> 
> Actually btrfs used sha256 at the early time, but then moved to crc32c for
> performance purposes.
> 
> As crc32c is sort of weak due to its hash collision issue, we need a stronger
> algorithm as an alternative.
> 
> Users can choose sha256 from mkfs.btrfs via
> 
> $ mkfs.btrfs -C 256 /device

There's already some good feedback so I'll try to cover what hasn't been
mentioned yet.

I think it's better to separate the preparatory works from adding the
algorithm itself. The former can be merged (and tested) independently.

There are several checksum algorithms that trade off speed and strength
so we may want to support more than just sha256. Easy to add but I'd
rather see them added in all at once than one by one.

Another question is if we'd like to use different checksum for data and
metadata. This would not cost any format change if we use the 2 bytes in
super block csum_type.


Optional/crazy/format change stuff:

* per-file checksum algorithm - unlike compression, the whole file would
  have to use the same csum algo
  reflink would work iff the algos match
  snapshotting is unaffected

* per-subvolume checksum algorithm - specify the csum type at creation
  time, or afterwards unless it's modified
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-25 Thread Goffredo Baroncelli
On 11/23/2014 01:19 AM, Zygo Blaxell wrote:
[...]
> md-raid works as long as you specify the devices, and because it's always
> the lowest layer it can ignore LVs (snapshot or otherwise).  It's also
> not a particularly common use case, while making an LV snapshot of a
> filesystem is a typical use case.

I fully agree; but you still consider a *multi-device* btrfs over lvm...
This is like a dm over lvm... which doesn't make sense at all (as you 
already wrote)

> 
>>> and mounting the filesystem fails at 3.  
>> Are you sure ?
> 
> Yes, I'm sure.  I've had to replace filesystems destroyed this way.
> 
>> [working instance snipped]
> 
>> On the basis of the example above, in case you want to mount a 
>> "single-disk", BTRFS seems me to work properly. You have to pay
>> attention only to not mount the two filesystem at the same time.
> 
> The problem is btrfs stops searching when it sees one disk with each UUID,

BTRFS doens't search anything. It is udev which "push" the information
on the kernel module. The btrfs module groups these information by UUID.
When a new disk is inserted, overwrite the information of the old one.


> so the set of disks (snapshot vs origin) that you get is *random*.
> For a pair of origin + snapshots, there's a 50% chance it works, 50%
> chance it eats your data.

Sorry but I have to disagree: the code is quite clear 
(see fs/btrfs/volume.c, near line 512):

[...]

} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
 * 1. If you are here and if the device->name is NULL that
 *means this device was missing at time of FS mount.
 * 2. If you are here and if the device->name is different
 *from 'path' that means either
 *  a. The same device disappeared and reappeared with
 * different name. or
 *  b. The missing-disk-which-was-replaced, has
 * reappeared now.
 *
 * We must allow 1 and 2a above. But 2b would be a spurious
 * and unintentional.

[...]

The case is the 2a; in this case btrfs store the new name and mount it.

Anyway I made a small test: I created 1 btrfs filesystem, and 
made a lvm-snapshot. Then create two different file in the snapshot and in
the original one. I run a program which mounts randomly the first or
the latter, checks if the correct file is present; after more than 130 tests I
never saw your "50% chance it works": it always works.

BR
G.Baroncelli

> 
>> BR
>> G.Baroncelli
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli 
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread Martin Steigerwald
Hi Goffredo,

Am Dienstag, 25. November 2014, 16:57:21 schrieb Goffredo Baroncelli:
> This is a revamp of a my previous patches set[1]. After more than
> year of attempts these patches were never merged, so I tried to
> simplify them and to change a bit the focus. The previous patches set
> had the focus to the disk usage.
> The aim of these patches now is to show the chunks distribution
> among the disks. So a new command 'btrfs chunk list' is added:
> 
> $ sudo ./btrfs chunk list /mnt/btrfs1/
> Data,RAID6: Size:3.00GiB, Used:1.02MiB
>/dev/vdb  1.00GiB
>/dev/vdd  1.00GiB
>/dev/vde  1.00GiB
>/dev/vdf  1.00GiB
>/dev/vdg  1.00GiB

I still like your previous patch set *a lot* and wonder why exactly is has 
never been merged. I know there has been quite some discussion about it, but 
in the end I lost track.

Can the chunk list also display the usage inside the chunks?

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread David Sterba
On Tue, Nov 25, 2014 at 04:57:21PM +0100, Goffredo Baroncelli wrote:
> This is a revamp of a my previous patches set[1]. After more than
> year of attempts these patches were never merged, so I tried to
> simplify them and to change a bit the focus. The previous patches set
> had the focus to the disk usage.

The patches have been pending for last releases but I did not manage to
fix all accounting problems in time and did not want to delay the
releaes due to other important fixes. The 'new df' patchset is being
refreshed each time and is part of the integration branches which might
signify that they haven't been lost.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 6/6] btrfs: add an is_readonly() so btrfs can use common code for update_time()

2014-11-25 Thread David Sterba
On Tue, Nov 25, 2014 at 12:34:34AM -0500, Theodore Ts'o wrote:
> The only reason btrfs cloned code from the VFS layer was so it could
> add a check to see if a subvolume is read-ony.  Instead of doing that,
> let's add a new inode operation which allows a file system to return
> an error if the inode is read-only, and use that in update_time().
> There may be other places where the VFS layer may want to know that
> btrfs would want to treat an inode is read-only.
> 
> With this commit, there are no remaining users of update_time() in the
> inode operations structure, so we can remove it and simply things
> further.
> 
> Signed-off-by: Theodore Ts'o 
> Cc: linux-btrfs@vger.kernel.org

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/6] fs: split update_time() into update_time() and write_time()

2014-11-25 Thread David Sterba
On Tue, Nov 25, 2014 at 12:34:29AM -0500, Theodore Ts'o wrote:
> In preparation for adding support for the lazytime mount option, we
> need to be able to separate out the update_time() and write_time()
> inode operations.  Currently, only btrfs and xfs uses update_time().
> 
> We needed to preserve update_time() because btrfs wants to have a
> special btrfs_root_readonly() check; otherwise we could drop the
> update_time() inode operation entirely.
> 
> Signed-off-by: Theodore Ts'o 
> Cc: x...@oss.sgi.com
> Cc: linux-btrfs@vger.kernel.org

For the btrfs changes

Acked-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Add man page for the 'btrfs chunk' family commands.

2014-11-25 Thread Goffredo Baroncelli
Add btrfs-chunk(8) man page, and update btrfs(8) man page.

Signed-off-by: Goffredo Baroncelli 
---
 Documentation/Makefile|  1 +
 Documentation/btrfs-chunk.txt | 58 +++
 Documentation/btrfs.txt   |  5 
 3 files changed, 64 insertions(+)
 create mode 100644 Documentation/btrfs-chunk.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ef4f1bd..f47f62b 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -18,6 +18,7 @@ MAN8_TXT += mkfs.btrfs.txt
 MAN8_TXT += btrfs-subvolume.txt
 MAN8_TXT += btrfs-filesystem.txt
 MAN8_TXT += btrfs-balance.txt
+MAN8_TXT += btrfs-chunk.txt
 MAN8_TXT += btrfs-device.txt
 MAN8_TXT += btrfs-scrub.txt
 MAN8_TXT += btrfs-check.txt
diff --git a/Documentation/btrfs-chunk.txt b/Documentation/btrfs-chunk.txt
new file mode 100644
index 000..85d1ea6
--- /dev/null
+++ b/Documentation/btrfs-chunk.txt
@@ -0,0 +1,58 @@
+btrfs-chunk(8)
+===
+
+NAME
+
+btrfs-chunk - control btrfs chunks
+
+SYNOPSIS
+
+*btrfs chunk*  
+
+DESCRIPTION
+---
+*btrfs chunk* is used to control the btrfs chunks.
+
+CHUNK DESCRIPTION
+-
+Block devices are divided into chunks. Chunks may be
+mirrored or striped across multiple devices. The mirroring/striping
+arrangement is transparent to the rest of the filesystem, which simply
+sees the single, logical address space that chunks are mapped into.
+
+There are three main types of chunks: Data, Metadata and System. Sometime
+(when the device is small) Data and Metadata can be grouped togheter; in
+this case the chunk is called 'Data+Metadata'. Usually the Data chunks
+are used to store the file content; but if the data is small enough, this
+may be stored in the Metadata chunk togheter with the filesystem data
+structures.
+
+The striping/mirroring levels (called profiles) may be different
+among the chunk types. The chunk profiles are assigned during the filesystem
+creation and these can be changed by the 'btrfs balance' commands.
+
+See `mkfs.btrfs`(8) and `btrfs-balance`(8) for more details.
+
+SUBCOMMAND
+--
+*list*  ::
+Shows the chunks grouped by type and profile; after each group
+the command lists the devices which host the chunks.
+If a device is not present (i.e. the filesystem is mounted in 'degraded' mode),
+it is marked as 'Missing:'.
+
+EXIT STATUS
+---
+*btrf chunk* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
+`btrfs-balance`(8)
diff --git a/Documentation/btrfs.txt b/Documentation/btrfs.txt
index 3bdc6b4..fc02004 100644
--- a/Documentation/btrfs.txt
+++ b/Documentation/btrfs.txt
@@ -41,6 +41,10 @@ COMMANDS
Balance btrfs filesystem chunks across single or several devices. +
See `btrfs-balance`(8) for details.
 
+*chunk*::
+   Manage filesystem chunk; currently only listing is suported. +
+   See `btrfs-chunk`(8) for details.
+
 *device*::
Manage devices managed by btrfs, including add/delete/scan and so
on. +
@@ -102,6 +106,7 @@ SEE ALSO
 `btrfs-subvolume`(8),
 `btrfs-filesystem`(8),
 `btrfs-balance`(8),
+`btrfs-chunk`(8),
 `btrfs-device`(8),
 `btrfs-scrub`(8),
 `btrfs-check`(8),
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Move group_type_str() and group_profile_str().

2014-11-25 Thread Goffredo Baroncelli
Move group_type_str() and group_profile_str() functions to the
util.c file, because these are now used also by the command
'btrfs chunk list'.

Signed-off-by: Goffredo Baroncelli 
---
 cmds-filesystem.c | 43 ---
 utils.c   | 43 +++
 utils.h   |  3 +++
 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index cd6b3c6..269e758 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -136,49 +136,6 @@ static const char * const cmd_df_usage[] = {
NULL
 };
 
-static char *group_type_str(u64 flag)
-{
-   u64 mask = BTRFS_BLOCK_GROUP_TYPE_MASK |
-   BTRFS_SPACE_INFO_GLOBAL_RSV;
-
-   switch (flag & mask) {
-   case BTRFS_BLOCK_GROUP_DATA:
-   return "Data";
-   case BTRFS_BLOCK_GROUP_SYSTEM:
-   return "System";
-   case BTRFS_BLOCK_GROUP_METADATA:
-   return "Metadata";
-   case BTRFS_BLOCK_GROUP_DATA|BTRFS_BLOCK_GROUP_METADATA:
-   return "Data+Metadata";
-   case BTRFS_SPACE_INFO_GLOBAL_RSV:
-   return "GlobalReserve";
-   default:
-   return "unknown";
-   }
-}
-
-static char *group_profile_str(u64 flag)
-{
-   switch (flag & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
-   case 0:
-   return "single";
-   case BTRFS_BLOCK_GROUP_RAID0:
-   return "RAID0";
-   case BTRFS_BLOCK_GROUP_RAID1:
-   return "RAID1";
-   case BTRFS_BLOCK_GROUP_RAID5:
-   return "RAID5";
-   case BTRFS_BLOCK_GROUP_RAID6:
-   return "RAID6";
-   case BTRFS_BLOCK_GROUP_DUP:
-   return "DUP";
-   case BTRFS_BLOCK_GROUP_RAID10:
-   return "RAID10";
-   default:
-   return "unknown";
-   }
-}
-
 static int get_df(int fd, struct btrfs_ioctl_space_args **sargs_ret)
 {
u64 count = 0;
diff --git a/utils.c b/utils.c
index 2a92416..c9b9e0e 100644
--- a/utils.c
+++ b/utils.c
@@ -2450,3 +2450,46 @@ int find_next_key(struct btrfs_path *path, struct 
btrfs_key *key)
}
return 1;
 }
+
+char *group_type_str(u64 flag)
+{
+   u64 mask = BTRFS_BLOCK_GROUP_TYPE_MASK |
+   BTRFS_SPACE_INFO_GLOBAL_RSV;
+
+   switch (flag & mask) {
+   case BTRFS_BLOCK_GROUP_DATA:
+   return "Data";
+   case BTRFS_BLOCK_GROUP_SYSTEM:
+   return "System";
+   case BTRFS_BLOCK_GROUP_METADATA:
+   return "Metadata";
+   case BTRFS_BLOCK_GROUP_DATA|BTRFS_BLOCK_GROUP_METADATA:
+   return "Data+Metadata";
+   case BTRFS_SPACE_INFO_GLOBAL_RSV:
+   return "GlobalReserve";
+   default:
+   return "unknown";
+   }
+}
+
+char *group_profile_str(u64 flag)
+{
+   switch (flag & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+   case 0:
+   return "single";
+   case BTRFS_BLOCK_GROUP_RAID0:
+   return "RAID0";
+   case BTRFS_BLOCK_GROUP_RAID1:
+   return "RAID1";
+   case BTRFS_BLOCK_GROUP_RAID5:
+   return "RAID5";
+   case BTRFS_BLOCK_GROUP_RAID6:
+   return "RAID6";
+   case BTRFS_BLOCK_GROUP_DUP:
+   return "DUP";
+   case BTRFS_BLOCK_GROUP_RAID10:
+   return "RAID10";
+   default:
+   return "unknown";
+   }
+}
diff --git a/utils.h b/utils.h
index 289e86b..21e3e83 100644
--- a/utils.h
+++ b/utils.h
@@ -161,4 +161,7 @@ static inline u64 btrfs_min_dev_size(u32 leafsize)
 
 int find_next_key(struct btrfs_path *path, struct btrfs_key *key);
 
+char *group_type_str(u64 flag);
+char *group_profile_str(u64 flag);
+
 #endif
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] New 'btrfs chunk list' command

2014-11-25 Thread Goffredo Baroncelli
This is a revamp of a my previous patches set[1]. After more than
year of attempts these patches were never merged, so I tried to
simplify them and to change a bit the focus. The previous patches set
had the focus to the disk usage.
The aim of these patches now is to show the chunks distribution
among the disks. So a new command 'btrfs chunk list' is added:

$ sudo ./btrfs chunk list /mnt/btrfs1/
Data,RAID6: Size:3.00GiB, Used:1.02MiB
   /dev/vdb1.00GiB
   /dev/vdd1.00GiB
   /dev/vde1.00GiB
   /dev/vdf1.00GiB
   /dev/vdg1.00GiB

Metadata,RAID5: Size:1.00GiB, Used:112.00KiB
   /dev/vdb  256.00MiB
   /dev/vdd  256.00MiB
   /dev/vde  256.00MiB
   /dev/vdf  256.00MiB
   /dev/vdg  256.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/vde   32.00MiB
   /dev/vdg   32.00MiB

Unallocated:
   /dev/vdb   48.75GiB
   /dev/vdd   48.75GiB
   /dev/vde   48.72GiB
   /dev/vdf   48.75GiB
   /dev/vdg   48.72GiB


This command is smart enough to highlight that a disk is not preset
(this happens when a mount -o degraded is used). In case 
/dev/vdb disappears:

$ sudo ./btrfs chunk list /mnt/btrfs1/
Data,RAID6: Size:3.00GiB, Used:1.02MiB
   /dev/vdb1.00GiB
   /dev/vde1.00GiB
   /dev/vdf1.00GiB
   /dev/vdg1.00GiB
   [Missing: /dev/vdd] 1.00GiB

Metadata,RAID5: Size:1.00GiB, Used:112.00KiB
   /dev/vdb  256.00MiB
   /dev/vde  256.00MiB
   /dev/vdf  256.00MiB
   /dev/vdg  256.00MiB
   [Missing: /dev/vdd]   256.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/vde   32.00MiB
   /dev/vdg   32.00MiB

Unallocated:
   /dev/vdb   48.75GiB
   /dev/vde   48.72GiB
   /dev/vdf   48.75GiB
   /dev/vdg   48.72GiB
   [Missing: /dev/vdd]48.75GiB


It is interesting to see what happens after a 
'btrfs balance':

$ sudo ./btrfs chunk list /mnt/btrfs1/
Data,RAID6: Size:2.00GiB, Used:792.00KiB
   /dev/vdb1.00GiB
   /dev/vde1.00GiB
   /dev/vdf1.00GiB
   /dev/vdg1.00GiB

Metadata,RAID5: Size:1.03GiB, Used:112.00KiB
   /dev/vdb  352.00MiB
   /dev/vde  352.00MiB
   /dev/vdf  352.00MiB
   /dev/vdg  352.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/vdb   32.00MiB
   /dev/vdf   32.00MiB

Unallocated:
   /dev/vdb   48.63GiB
   /dev/vde   48.66GiB
   /dev/vdf   48.63GiB
   /dev/vdg   48.66GiB
   [Missing: /dev/vdd]50.00GiB

And what happens after a 'btrfs device delete missing':

$ sudo ./btrfs chunk list /mnt/btrfs1/
Data,RAID6: Size:2.00GiB, Used:792.00KiB
   /dev/vdb1.00GiB
   /dev/vde1.00GiB
   /dev/vdf1.00GiB
   /dev/vdg1.00GiB

Metadata,RAID5: Size:1.03GiB, Used:112.00KiB
   /dev/vdb  352.00MiB
   /dev/vde  352.00MiB
   /dev/vdf  352.00MiB
   /dev/vdg  352.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/vdb   32.00MiB
   /dev/vdf   32.00MiB

Unallocated:
   /dev/vdb   48.63GiB
   /dev/vde   48.66GiB
   /dev/vdf   48.63GiB
   /dev/vdg   48.66GiB

Comments are welcome.
BR
G.Baroncelli



[1] My last attempt was done in 
http://comments.gmane.org/gmane.comp.file-systems.btrfs/32647,
then David resent the patches in 
http://www.spinics.net/lists/linux-btrfs/msg33630.html


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] Add the btrfs chunk list command

2014-11-25 Thread Goffredo Baroncelli
This patch adds the 'btrfs chunk' groups command.

Signed-off-by: Goffredo Baroncelli 
---
 Makefile   | 2 +-
 btrfs.c| 1 +
 commands.h | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 4cae30c..1744f9c 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
-  cmds-property.o
+  cmds-property.o cmds-chunk.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index e83349c..25fbb71 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -197,6 +197,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ "device", cmd_device, NULL, &device_cmd_group, 0 },
{ "scrub", cmd_scrub, NULL, &scrub_cmd_group, 0 },
{ "check", cmd_check, cmd_check_usage, NULL, 0 },
+   { "chunk", cmd_chunk, NULL, &chunk_cmd_group, 0 },
{ "rescue", cmd_rescue, NULL, &rescue_cmd_group, 0 },
{ "restore", cmd_restore, cmd_restore_usage, NULL, 0 },
{ "inspect-internal", cmd_inspect, NULL, &inspect_cmd_group, 0 
},
diff --git a/commands.h b/commands.h
index 4d870f6..41d69b0 100644
--- a/commands.h
+++ b/commands.h
@@ -80,6 +80,7 @@ extern const struct cmd_group filesystem_cmd_group;
 extern const struct cmd_group balance_cmd_group;
 extern const struct cmd_group device_cmd_group;
 extern const struct cmd_group scrub_cmd_group;
+extern const struct cmd_group chunk_cmd_group;
 extern const struct cmd_group inspect_cmd_group;
 extern const struct cmd_group property_cmd_group;
 extern const struct cmd_group quota_cmd_group;
@@ -101,6 +102,7 @@ int cmd_balance(int argc, char **argv);
 int cmd_device(int argc, char **argv);
 int cmd_scrub(int argc, char **argv);
 int cmd_check(int argc, char **argv);
+int cmd_chunk(int argc, char **argv);
 int cmd_chunk_recover(int argc, char **argv);
 int cmd_super_recover(int argc, char **argv);
 int cmd_inspect(int argc, char **argv);
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Add the code for the btrfs chunk list command

2014-11-25 Thread Goffredo Baroncelli
Add the code for the btrfs chunk list command. The code iterates
through the chunk, grouping these by chunk type (data, metadata,
system) and chunk profiles (linear, dup, raid1, radi5, raid10,
raid6..); then it displays the devices belong each group.
If a device is missing, it is marked as 'Missing'.

Signed-off-by: Goffredo Baroncelli 
---
 cmds-chunk.c | 699 +++
 1 file changed, 699 insertions(+)
 create mode 100644 cmds-chunk.c

diff --git a/cmds-chunk.c b/cmds-chunk.c
new file mode 100644
index 000..3afa2b1
--- /dev/null
+++ b/cmds-chunk.c
@@ -0,0 +1,699 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "kerncompat.h"
+#include "ctree.h"
+
+#include "commands.h"
+
+#include "version.h"
+
+/*
+ * To store the size information about the chunks:
+ * the chunks info are grouped by the tuple (type, devid, num_stripes),
+ * i.e. if two chunks are of the same type (RAID1, DUP...), are on the
+ * same disk, have the same stripes then their sizes are grouped
+ */
+struct chunk_info {
+   u64 type;
+   u64 size;
+   u64 devid;
+   u64 num_stripes;
+};
+
+/* to store information about the disks */
+struct disk_info {
+   u64 devid;
+   charpath[BTRFS_DEVICE_PATH_NAME_MAX];
+   u64 size;
+};
+
+/*
+ * Add the chunk info to the chunk_info list
+ */
+static int add_info_to_list(struct chunk_info **info_ptr,
+   int *info_count,
+   struct btrfs_chunk *chunk)
+{
+
+   u64 type = btrfs_stack_chunk_type(chunk);
+   u64 size = btrfs_stack_chunk_length(chunk);
+   int num_stripes = btrfs_stack_chunk_num_stripes(chunk);
+   int j;
+
+   for (j = 0 ; j < num_stripes ; j++) {
+   int i;
+   struct chunk_info *p = 0;
+   struct btrfs_stripe *stripe;
+   u64devid;
+
+   stripe = btrfs_stripe_nr(chunk, j);
+   devid = btrfs_stack_stripe_devid(stripe);
+
+   for (i = 0 ; i < *info_count ; i++)
+   if ((*info_ptr)[i].type == type &&
+   (*info_ptr)[i].devid == devid &&
+   (*info_ptr)[i].num_stripes == num_stripes ) {
+   p = (*info_ptr) + i;
+   break;
+   }
+
+   if (!p) {
+   int size = sizeof(struct btrfs_chunk) * (*info_count+1);
+   struct chunk_info *res = realloc(*info_ptr, size);
+
+   if (!res) {
+   fprintf(stderr, "ERROR: not enough memory\n");
+   return -1;
+   }
+
+   *info_ptr = res;
+   p = res + *info_count;
+   (*info_count)++;
+
+   p->devid = devid;
+   p->type = type;
+   p->size = 0;
+   p->num_stripes = num_stripes;
+   }
+
+   p->size += size;
+
+   }
+
+   return 0;
+
+}
+
+/*
+ *  Helper to sort the chunk type
+ */
+static int cmp_chunk_block_group(u64 f1, u64 f2)
+{
+
+   u64 mask;
+
+   if ((f1 & BTRFS_BLOCK_GROUP_TYPE_MASK) ==
+   (f2 & BTRFS_BLOCK_GROUP_TYPE_MASK))
+   mask = BTRFS_BLOCK_GROUP_PROFILE_MASK;
+   else if (f2 & BTRFS_BLOCK_GROUP_SYSTEM)
+   return -1;
+   else if (f1 & BTRFS_BLOCK_GROUP_SYSTEM)
+   return +1;
+   else
+   mask = BTRFS_BLOCK_GROUP_TYPE_MASK;
+
+   if ((f1 & mask) > (f2 & mask))
+   return +1;
+   else if ((f1 & mask) < (f2 & mask))
+   return -1;
+   else
+   return 0;
+}
+
+/*
+ * Helper to sort the chunk
+ */
+static int cmp_chunk_info(const void *a, const void *b)
+{
+   return cmp_chunk_block_group(
+   ((struct chunk_info *)a)->type,
+   ((struct chunk_info *)b)->type);
+}
+
+/*
+ * This function load all the chunk info from the 'fd' filesystem
+ */
+static int load_chunk_

Re: [PATCH-v3 3/6] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Theodore Ts'o
On Tue, Nov 25, 2014 at 03:58:01PM +0100, Rasmus Villemoes wrote:
> 
> I think days_since_boot was a lot clearer than daycode. In any case,
> please make the comment and the code consistent.

Yeah, I was going back and forth between days since the epoch and days
since boot, but found it was more efficient to calculate the days
since boot.  Sure, I'll go back to days_since_boot.

> You should probably divide by the number of seconds in a day, not the
> number of jiffies in a day.

Right, brain hiccup on my part, will fix.

> Isn't div_u64 mostly for when the divisor is not known at compile time?
> Technically, "(u64)uptime.tv_sec / 86400" is of course a u64/u64 division,
> but the compiler should see that the divisor is only 32 bits and hence
> should be able to generate code which is at least as efficient as
> whatever inline asm the arch provides for u64/u32 divisions.

The problem with doing u64/u64 divisions is that the compiler will
call out to a (non-existent) library function on some architectures.
For example, try compiling the following using: with "gcc -S -m32
foo.c"

int main(int argc, char **argv)
{
unsigned long long t = time(0);

printf("%llu\n", t / 86400);
}

You will find in the .S file the following:

...
pushl   $0
pushl   $86400
pushl   %edx
pushl   %eax
call__udivdi3
...

It will work finn compiling for x86_64, but if you do this and push
out a git branch, you will soon get a very nice e-mail from the ktest
bot explaining how you've broken the build on the i386 architecture
since the kernel doesn't provide __udivdi3.

Hence if you are doing any kind of divisions involving u64, you have
to use the functions in include/linux/math64.h, and div_u64 is, per
math64.h:

/**
 * div_u64 - unsigned 64bit divide with 32bit divisor
 *
 * This is the most common 64bit divide and should be used if possible,
 * as many 32bit archs can optimize this variant better than a full 64bit
 * divide.
 */

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-25 Thread David Sterba
On Mon, Nov 24, 2014 at 06:34:30PM +0100, David Sterba wrote:
> The btrfs_root_readonly checks in setxattr and removexattr are
> unnecessary because they're done through xattr_permisssion. I'll send a
> patch to remove them.

Does not work because the security.* and system.* namespaces do not call
the permission() hook, so no patch. However, if the proposed
inode_is_readonly callback is merged, we can replace the btrfs-specific
checks with is_readonly check in xattr_permission().
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Boaz Harrosh
On 11/25/2014 06:33 AM, Theodore Ts'o wrote:
<>
> 
> I was concerned about putting them on the dirty inode list because it
> would be extra inodes for the writeback threads would have to skip
> over and ignore (since they would not be dirty in the inde or data
> pages sense).
> 
> Another solution would be to use a separate linked list for dirtytime
> inodes, but that means adding some extra fields to the inode
> structure, which some might view as bloat.  

You could use the same list-head for both lists. 

If the inode is on the dirty-inode-list then no need to add it
to the list-for-dirtytime, it will be written soon anyway.
else you add it to the list-for-dirtytime.

If you (real)dirty an inode then you first remove it from the
list-for-dirtytime first, and then add it to the dirty-inode-list.

So at each given time it is only on one list

<>

Cheers
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] Implement device scrub/replace for RAID56

2014-11-25 Thread Chris Mason

On Fri, Nov 14, 2014 at 8:50 AM, Miao Xie  wrote:
This patchset implement the device scrub/replace function for RAID56, 
the
most implementation of the common data is similar to the other RAID 
type.

The differentia or difficulty is the parity process. In order to avoid
that problem the data that is easy to be change out the stripe lock,
we do most work in the RAID56 stripe lock context.

And in order to avoid making the code more and more complex, we copy 
some

code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more 
tests

are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace


I'm getting crashes from btrfs/060 with these in place:

[ 1649.712413] BTRFS: assertion failed: logical + PAGE_SIZE <= 
rbio->raid_map[0] + rbio->stripe_len * rbio->nr_data, file: 
fs/btrfs/raid56.c, line: 2248^M

[ 1649.738982] [ cut here ]^M
[ 1649.748727] kernel BUG at fs/btrfs/ctree.h:4020!^M
[ 1649.758039] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC^M
[ 1649.768977] Modules linked in: fuse loop btrfs raid6_pq 
zlib_deflate lzo_compress xor k10temp coretemp hwmon xfs exportfs 
libcrc32c tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables 
xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic 
iptable_filter ip_tables x_tables nfsv3 nfs lockd grace mptctl 
netconsole autofs4 rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc 
ipv6 ext3 jbd dm_mod iTCO_wdt iTCO_vendor_support rtc_cmos ipmi_si 
ipmi_msghandler pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci 
ehci_hcd mlx4_en ptp pps_core mlx4_core ses enclosure sg button 
megaraid_sas^M
[ 1649.872917] CPU: 0 PID: 16687 Comm: kworker/u65:0 Not tainted 
3.18.0-rc6-mason+ #3^M
[ 1649.888171] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, 
BIOS 1.07 05/10/2012^M
[ 1649.903962] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper 
[btrfs]^M
[ 1649.916588] task: 88072557dd90 ti: 88070fdc4000 task.ti: 
88070fdc4000^M
[ 1649.931669] RIP: 0010:[]  [] 
raid56_parity_add_scrub_pages+0x8f/0xa0 [btrfs]^M

[ 1649.952169] RSP: 0018:88070fdc7b68  EFLAGS: 00010292^M
[ 1649.962852] RAX: 0089 RBX: 8804cf681f30 RCX: 
4b4a^M
[ 1649.977177] RDX: 004a RSI: 0001 RDI: 
^M
[ 1649.991496] RBP: 88070fdc7b68 R08: 0001 R09: 
^M
[ 1650.005819] R10: 0001 R11:  R12: 
880689b62800^M
[ 1650.020140] R13: 88024d85cf80 R14: 88075d0dd800 R15: 
0003^M
[ 1650.034459] FS:  () GS:88085fc0() 
knlGS:^M

[ 1650.050757] CS:  0010 DS:  ES:  CR0: 80050033^M
[ 1650.062306] CR2: 7f445b6d0e78 CR3: 01c14000 CR4: 
000407f0^M

[ 1650.076625] Stack:^M
[ 1650.080716]  88070fdc7bc8 a05f2e50 8804cf681fc8 
88070010^M
[ 1650.095761]  88070fdc7b98 0001 880290f92340 
88074eda9f00^M
[ 1650.110792]  8807edbb1700 880639910e20 8807edbb1700 
1000^M

[ 1650.125865] Call Trace:^M
[ 1650.130845]  [] 
scrub_parity_check_and_repair+0x140/0x1e0 [btrfs]^M
[ 1650.146286]  [] scrub_block_put+0x8d/0x90 
[btrfs]^M
[ 1650.158884]  [] ? 
cpuacct_account_field+0xd0/0xd0^M
[ 1650.171493]  [] 
scrub_bio_end_io_worker+0xe9/0x870 [btrfs]^M
[ 1650.185725]  [] normal_work_helper+0x84/0x330 
[btrfs]^M
[ 1650.199041]  [] btrfs_scrub_helper+0x12/0x20 
[btrfs]^M

[ 1650.212165]  [] process_one_work+0x1bf/0x520^M
[ 1650.223892]  [] ? process_one_work+0x13d/0x520^M
[ 1650.235988]  [] worker_thread+0x11e/0x4b0^M
[ 1650.247204]  [] ? __schedule+0x389/0x880^M
[ 1650.258242]  [] ? process_one_work+0x520/0x520^M
[ 1650.270314]  [] kthread+0xde/0x100^M
[ 1650.280302]  [] ? 
__init_kthread_worker+0x70/0x70^M

[ 1650.292894]  [] ret_from_fork+0x7c/0xb0^M
[ 1650.303746]  [] ? 
__init_kthread_worker+0x70/0x70^M
[ 1650.316359] Code: c0 e8 1d 5a 04 e1 0f 0b eb fe b9 c8 08 00 00 48 
c7 c2 71 6b 62 a0 48 c7 c6 b8 c4 62 a0 48 c7 c7 80 c4 62 a0 31 c0 e8 
f8 59 04 e1 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 
48 89 e5 ^M
[ 1650.356466] RIP  [] 
raid56_parity_add_scrub_pages+0x8f/0xa0 [btrfs]^M

[ 1650.372307]  RSP ^M
[ 1650.381427] ---[ end trace 14445249faa12848 ]---^M


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 3/6] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Rasmus Villemoes
On Tue, Nov 25 2014, Theodore Ts'o  wrote:

>  static int update_time(struct inode *inode, struct timespec *time, int flags)
>  {
> + struct timespec uptime;
> + unsigned short daycode;
>   int ret;
>  
>   if (inode->i_op->update_time) {
> @@ -1525,17 +1527,33 @@ static int update_time(struct inode *inode, struct 
> timespec *time, int flags)
>   if (flags & S_CTIME)
>   inode->i_ctime = *time;
>   if (flags & S_MTIME)
> - inode->i_mtime = *time;
> + inode->i_mtime = *time;
>   }
> + /*
> +  * If i_ts_dirty_day is zero, then either we have not deferred
> +  * timestamp updates, or the system has been up for less than
> +  * a day (so days_since_boot is zero), so we defer timestamp
> +  * updates in that case and set the I_DIRTY_TIME flag.  If a
> +  * day or more has passed, then i_ts_dirty_day will be
> +  * different from days_since_boot, and then we should update
> +  * the on-disk inode and then we can clear i_ts_dirty_day.
> +  */

I think days_since_boot was a lot clearer than daycode. In any case,
please make the comment and the code consistent.

>   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
>   !(flags & S_VERSION)) {
>   if (inode->i_state & I_DIRTY_TIME)
>   return 0;
> - spin_lock(&inode->i_lock);
> - inode->i_state |= I_DIRTY_TIME;
> - spin_unlock(&inode->i_lock);
> - return 0;
> + get_monotonic_boottime(&uptime);
> + daycode = div_u64(uptime.tv_sec, (HZ * 86400));

You should probably divide by the number of seconds in a day, not the
number of jiffies in a day.

Isn't div_u64 mostly for when the divisor is not known at compile time?
Technically, "(u64)uptime.tv_sec / 86400" is of course a u64/u64 division,
but the compiler should see that the divisor is only 32 bits and hence
should be able to generate code which is at least as efficient as
whatever inline asm the arch provides for u64/u32 divisions.


Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread Liu Bo
On Mon, Nov 24, 2014 at 03:07:45PM -0500, Chris Mason wrote:
> On Mon, Nov 24, 2014 at 12:23 AM, Liu Bo  wrote:
> >This brings a strong-but-slow checksum algorithm, sha256.
> >
> >Actually btrfs used sha256 at the early time, but then moved to
> >crc32c for
> >performance purposes.
> >
> >As crc32c is sort of weak due to its hash collision issue, we need
> >a stronger
> >algorithm as an alternative.
> >
> >Users can choose sha256 from mkfs.btrfs via
> >
> >$ mkfs.btrfs -C 256 /device
> 
> Agree with others about -C 256...-C sha256 is only three letters more ;)

That's right, #stupidme

> 
> What's the target for this mode?  Are we trying to find evil people
> scribbling on the drive, or are we trying to find bad hardware?

This is actually inspired by ZFS, who offers checksum functions ranging
from the simple-and-fast fletcher2 to the slower-but-secure sha256.

Back to btrfs, crc32c is the only choice.

And also for the slowness of sha256, Intel has a set of instructions for
it, "Intel SHA Extensions", that may help a lot.

Not insisting on it, I'm always open to any suggestions.

Btw, having played with merkle tree for a while, however, making good use
of our existing scrub looks more promising for implemening the feature
that detects changes between mounts. 

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


strange device stats message

2014-11-25 Thread Russell Coker
I am in the middle of replacing /dev/sdb (which is 3TB SATA disk that gives a 
few read errors on every scrub) with /dev/sdc2 (a partition on a new 4TB SATA 
disk).  I am running btrfs-tools version 3.17-1.1 from Debian/Unstable and 
Debian kernel 3.16.0-4-amd64.  I get the following, the last section of which 
seems wrong.  Would this be a bug in the kernel or btrfs-tools?

# btrfs device stats /big
[/dev/sdc2].write_io_errs   0
[/dev/sdc2].read_io_errs0
[/dev/sdc2].flush_io_errs   0
[/dev/sdc2].corruption_errs 0
[/dev/sdc2].generation_errs 0
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs16
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[].write_io_errs   0
[].read_io_errs0
[].flush_io_errs   0
[].corruption_errs 0
[].generation_errs 0

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changing label few times killed filesystem?

2014-11-25 Thread Boris Chernov

On 2014-11-24 02:46, Duncan wrote
> if you were using gmane's web service, that explains things as 
weaverd, the process

> that does the threading on the web side, was down for some days
Yes, I have used gmane blog. Good to know it is not down anymore.

Back on topic. Even after updating to the latest version, btrfsck 
or any of its options including --repair still do not work. Does anyone 
know what "Assertion `rec->is_root` failed" means? Is it worth trying to 
compile my own version of btrfsck without this assertion?
With or without --repair option, it looks like this assertion stops 
btrfsck very early, preventing btrfsck from checking the filesystem or 
attempting to repair it.


# btrfsck /dev/sdb1
Checking filesystem on /dev/sdb1
UUID: 787e3bc1-7583-4bd8-a52e-e57fd7fc9243
checking extents
cmds-check.c:2645: check_owner_ref: Assertion `rec->is_root` failed.
btrfs check[0x41a081]
btrfs check[0x41a0a5]
btrfs check[0x409783]
btrfs check[0x40a45e]
btrfs check[0x41bfa9]
btrfs check[0x40b46a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fb275f24b45]
btrfs check[0x40b497]

# btrfsck --repair /dev/sdb1
enabling repair mode
Fixed 0 roots.
Checking filesystem on /dev/sdb1
UUID: 787e3bc1-7583-4bd8-a52e-e57fd7fc9243
checking extents
cmds-check.c:2645: check_owner_ref: Assertion `rec->is_root` failed.
btrfs check[0x41a081]
btrfs check[0x41a0a5]
btrfs check[0x409783]
btrfs check[0x40a45e]
btrfs check[0x41bfa9]
btrfs check[0x40b46a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fbc5b8dab45]
btrfs check[0x40b497]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread Daniel Cegiełka
2014-11-25 11:30 GMT+01:00 Liu Bo :
> On Mon, Nov 24, 2014 at 11:34:46AM -0800, John Williams wrote:
>> On Mon, Nov 24, 2014 at 12:23 AM, Holger Hoffstätte
>>  wrote:
>>
>> > Would there be room for a compromise with e.g. 128 bits?
>>
>> For example, Spooky V2 hash is 128 bits and is very fast. It is
>> noncryptographic, but it is more than adequate for data checksums.
>>
>> http://burtleburtle.net/bob/hash/spooky.html
>>
>> SnapRAID uses this hash, and it runs at about 15 GB/sec on my machine
>> (Xeon E3-1270 V2 @ 3.50Ghz)
>
> Thanks for the suggestion, I'll take a look.
>
> Btw, it's not in kernel yet, is it?
>

The best option would be blake2b, but it isn't implemented in the
kernel. It is not a problem to use it locally (I can upload the code
stripped for usage in kernel).

from https://blake2.net/

Q: Why do you want BLAKE2 to be fast? Aren't fast hashes bad?

A: You want your hash function to be fast if you are using it to
compute the secure hash of a large amount of data, such as in
distributed filesystems (e.g. Tahoe-LAFS), cloud storage systems (e.g.
OpenStack Swift), intrusion detection systems (e.g. Samhain),
integrity-checking local filesystems (e.g. ZFS), peer-to-peer
file-sharing tools (e.g. BitTorrent), or version control systems (e.g.
git). You only want your hash function to be slow if you're using it
to "stretch" user-supplied passwords, in which case see the next
question.

https://blake2.net/
https://github.com/floodyberry/blake2b-opt

Best regards,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread Liu Bo
On Mon, Nov 24, 2014 at 11:34:46AM -0800, John Williams wrote:
> On Mon, Nov 24, 2014 at 12:23 AM, Holger Hoffstätte
>  wrote:
> 
> > Would there be room for a compromise with e.g. 128 bits?
> 
> For example, Spooky V2 hash is 128 bits and is very fast. It is
> noncryptographic, but it is more than adequate for data checksums.
> 
> http://burtleburtle.net/bob/hash/spooky.html
> 
> SnapRAID uses this hash, and it runs at about 15 GB/sec on my machine
> (Xeon E3-1270 V2 @ 3.50Ghz)

Thanks for the suggestion, I'll take a look.

Btw, it's not in kernel yet, is it?

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: add sha256 checksum option

2014-11-25 Thread Liu Bo
On Mon, Nov 24, 2014 at 08:23:25AM +, Holger Hoffstätte wrote:
> On Mon, 24 Nov 2014 13:23:05 +0800, Liu Bo wrote:
> 
> > This brings a strong-but-slow checksum algorithm, sha256.
> > 
> > Actually btrfs used sha256 at the early time, but then moved to crc32c for
> > performance purposes.
> > 
> > As crc32c is sort of weak due to its hash collision issue, we need a 
> > stronger
> > algorithm as an alternative.
> 
> I'm curious - did you see actual cases where this happened, i.e. a corrupt
> block that would pass crc32 validation? I know some high-integrity use
> cases require a stronger algorithm - just wondering.

Haven't see that so far, but here is a link for crc32c hash collision in
btrfs, http://lwn.net/Articles/529077/, it's not data checksum though,
btrfs's DIR_ITEM also use crc32c hash, if those happen to be data blocks,
something interesting will happen.

> 
> Would there be room for a compromise with e.g. 128 bits?

Yeah, we're good if it's not larger than 256 bits.

> 
> > Users can choose sha256 from mkfs.btrfs via
> > 
> > $ mkfs.btrfs -C 256 /device
> 
> Not sure how others feel about this, but it's probably easier for
> sysadmins to specify the algorithm by name from the set of supported
> ones, similar to how ssh does it ("ssh -C arcfour256").

Urr, my bad, I've made it locally but didn't 'git add' them.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: fix a bug of converting sparse ext2/3/4

2014-11-25 Thread Liu Bo
When converting a sparse ext* filesystem, btrfs-convert adds checksum extents
for empty extents, whose disk_bytenr = 0, and this can end up with some weird
problems, one of them is the failure of reading free space cache inode on
mounting converted btrfs.

The fix is simple, just to skip making checksum on empty extents.

Signed-off-by: Liu Bo 
---
 btrfs-convert.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index a544fc6..02c5e94 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -393,7 +393,7 @@ static int record_file_blocks(struct btrfs_trans_handle 
*trans,
ret = btrfs_record_file_extent(trans, root, objectid, inode, file_pos,
disk_bytenr, num_bytes);
 
-   if (ret || !checksum)
+   if (ret || !checksum || disk_bytenr == 0)
return ret;
 
return csum_disk_extent(trans, root, disk_bytenr, num_bytes);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: deal with all 'subvol=xxx' options once

2014-11-25 Thread Wang Shilong
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount -t btrfs /dev/sdb /mnt
 # btrfs sub create /mnt/dir
 # mount -t btrfs /dev/sdb /mnt -o subvol=dir,subvol=dir

It fails with:
 mount: mount(2) failed: No such file or directory

Btrfs deal with subvolume mounting in a recursive way,
to avoid looping, it will stripe out 'subvol=' string,
then next loop will stop.Problem here is it only deal one
string once, if users specify mount option multiple times.
It will loop several times which is not good, and above
reproducing steps will also return confusing results.

Fix this problem by striping out all 'subvol=xxx' options,
only last is valid.

Signed-off-by: Wang Shilong 
---
v1->v2: error handling and comment update
---
 fs/btrfs/super.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 54bd91e..42f3176 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1115,7 +1115,7 @@ static inline int is_subvolume_inode(struct inode *inode)
  * subvolid=0 to make sure we get the actual tree root for path walking to the
  * subvol we want.
  */
-static char *setup_root_args(char *args)
+static char *__setup_root_args(char *args)
 {
unsigned len = strlen(args) + 2 + 1;
char *src, *dst, *buf;
@@ -1129,13 +1129,12 @@ static char *setup_root_args(char *args)
 */
 
src = strstr(args, "subvol=");
-   /* This shouldn't happen, but just in case.. */
if (!src)
return NULL;
 
buf = dst = kmalloc(len, GFP_NOFS);
if (!buf)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
/*
 * If the subvol= arg is not at the start of the string,
@@ -1161,6 +1160,27 @@ static char *setup_root_args(char *args)
return buf;
 }
 
+static char *setup_root_args(char *args)
+{
+   char *p, *new_args;
+
+   p = new_args = __setup_root_args(args);
+   /* in case users specify subvol=xxx option multiple times */
+   while (p) {
+   p = __setup_root_args(new_args);
+   if (!p)
+   break;
+   if (!IS_ERR(p)) {
+   kfree(new_args);
+   new_args = p;
+   } else {
+   kfree(new_args);
+   return NULL;
+   }
+   }
+   return new_args;
+}
+
 static struct dentry *mount_subvol(const char *subvol_name, int flags,
   const char *device_name, char *data)
 {
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html