Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 8:24 PM, Tomasz Kusmierz  wrote:

> you are throwing a lot of useful data, maybe diverting some of it into wiki ? 
> you know, us normal people might find it useful for making educated choice in 
> some future ? :)

There is a wiki, and it's difficult for keep up to date as it is.
There are just too many changes happening in Btrfs, and  really only
the devs have a birds eye view of what's going on and what will happen
sooner than later.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Eric Sandeen
On 7/6/16 8:35 AM, Holger Hoffstätte wrote:
> On 07/06/16 14:25, Wang Shilong wrote:

...

>> After patch, it will look like:
>>Total   Exclusive  Set shared  Filename
>> skipping not btrfs dir/file: boot
>> skipping not btrfs dir/file: dev
>> skipping not btrfs dir/file: proc
>> skipping not btrfs dir/file: run
>> skipping not btrfs dir/file: sys
>>  0.00B   0.00B   -  //root/.bash_logout
>>  0.00B   0.00B   -  //root/.bash_profile
>>  0.00B   0.00B   -  //root/.bashrc
>>  0.00B   0.00B   -  //root/.cshrc
>>  0.00B   0.00B   -  //root/.tcshrc
>>
>> This works for me to analysis system usage and analysis
>> performaces.
> 
> This is great, but can we please skip the "skipping .." messages?
> Maybe it's just me but I really don't see the value of printing them
> when they don't contribute to the result.
> They also mess up the display. :)


I agree, those messages add no value.


-Eric

> thanks,
> Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()

2016-07-06 Thread Wang Xiaoguang

hello,

On 07/06/2016 08:27 PM, Holger Hoffstätte wrote:

On 07/06/16 12:37, Wang Xiaoguang wrote:

Below test scripts can reproduce this false ENOSPC:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount /dev/loop0 /tmp/mntpoint
cd mntpoint
xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile

Above fallocate(2) operation will fail for ENOSPC reason, but indeed
fs still has free space to satisfy this request. The reason is
btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use
just in time, and it calls btrfs_free_reserved_data_space_noquota() in
the end of btrfs_fallocate(), which is too late and have already added
false unnecessary pressure to enospc system. See call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
 It will add btrfs_space_info's bytes_may_use accordingly.
|-> btrfs_prealloc_file_range()
 It will call btrfs_reserve_extent(), but note that alloc type is
 RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will
 only increase btrfs_space_info's bytes_reserved accordingly, but
 will not decrease btrfs_space_info's bytes_may_use, then obviously
 we have overestimated real needed disk space, and it'll impact
 other processes who do write(2) or fallocate(2) operations, also
 can impact metadata reservation in mixed mode, and bytes_max_use
 will only be decreased in the end of btrfs_fallocate(). To fix
 this false ENOSPC, we need to decrease btrfs_space_info's
 bytes_may_use in btrfs_prealloc_file_range() in time, as what we
 do in cow_file_range(),
 See call graph in :
 cow_file_range()
 |-> extent_clear_unlock_delalloc()
 |-> clear_extent_bit()
 |-> btrfs_clear_bit_hook()
 |-> btrfs_free_reserved_data_space_noquota()
 This function will decrease bytes_may_use accordingly.

So this patch choose to call btrfs_free_reserved_data_space() in
__btrfs_prealloc_file_range() for both successful and failed path.

Also this patch removes some old and useless comments.

Signed-off-by: Wang Xiaoguang 

Verified that the reproducer script indeed fails (with btrfs ~4.7) and
the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating
things without problem. Free space also still seems sane, as far as I
could tell.

So for both patches:

Tested-by: Holger Hoffstätte 

Thanks very much :)

Regards,
Xiaoguang Wang



cheers,
Holger







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()

2016-07-06 Thread Wang Xiaoguang

hello,

On 07/07/2016 03:54 AM, Liu Bo wrote:

On Wed, Jul 06, 2016 at 06:37:52PM +0800, Wang Xiaoguang wrote:

In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|  | extent  |   extent   | ...| extent |
||
|block group reloc_inode |

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/relocation.c | 27 +++
  1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..abc2f69 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode,
u64 num_bytes;
int nr = 0;
int ret = 0;
+   u64 prealloc_start, prealloc_end;
  
  	BUG_ON(cluster->start != cluster->boundary[0]);

inode_lock(inode);
  
-	ret = btrfs_check_data_free_space(inode, cluster->start,

- cluster->end + 1 - cluster->start);
+   start = cluster->start - offset;
+   end = cluster->end - offset;
+   ret = btrfs_check_data_free_space(inode, start, end + 1 - start);
if (ret)
goto out;
  
  	while (nr < cluster->nr) {

-   start = cluster->boundary[nr] - offset;
+   prealloc_start = cluster->boundary[nr] - offset;
if (nr + 1 < cluster->nr)
-   end = cluster->boundary[nr + 1] - 1 - offset;
+   prealloc_end = cluster->boundary[nr + 1] - 1 - offset;
else
-   end = cluster->end - offset;
+   prealloc_end = cluster->end - offset;
  
-		lock_extent(_I(inode)->io_tree, start, end);

-   num_bytes = end + 1 - start;
-   ret = btrfs_prealloc_file_range(inode, 0, start,
+   lock_extent(_I(inode)->io_tree, prealloc_start,
+   prealloc_end);
+   num_bytes = prealloc_end + 1 - prealloc_start;
+   ret = btrfs_prealloc_file_range(inode, 0, prealloc_start,
num_bytes, num_bytes,
-   end + 1, _hint);
-   unlock_extent(_I(inode)->io_tree, start, end);
+   prealloc_end + 1, _hint);
+   unlock_extent(_I(inode)->io_tree, prealloc_start,
+ prealloc_end);

Changing names is unnecessary, we can pick up other names for 
btrfs_{check/free}_data_free_space().

OK, then the changes will be small, thanks.

Regards,
Xiaoguang Wang



Thanks,

-liubo


if (ret)
break;
nr++;
}
-   btrfs_free_reserved_data_space(inode, cluster->start,
-  cluster->end + 1 - cluster->start);
+   btrfs_free_reserved_data_space(inode, start, end + 1 - start);
  out:
inode_unlock(inode);
return ret;
--
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 02:46, Chris Murphy  wrote:
> 

Chaps, I didn’t wanted this to spring up as a performance of btrfs argument,

BUT 

you are throwing a lot of useful data, maybe diverting some of it into wiki ? 
you know, us normal people might find it useful for making educated choice in 
some future ? :)

Interestingly on my RAID10 with 6 disks I only get:

dd if=/mnt/share/asdf of=/dev/zero bs=100M
113+1 records in
113+1 records out
11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s


filefrag -v
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..2471: 2101940598..2101943069:   2472:
   1: 2472..   12583: 1938312686..1938322797:  10112: 2101943070:
   2:12584..   12837: 1937534654..1937534907:254: 1938322798:
   3:12838..   12839: 1937534908..1937534909:  2:
   4:12840..   34109: 1902954063..1902975332:  21270: 1937534910:
   5:34110..   53671: 1900857931..1900877492:  19562: 1902975333:
   6:53672..   54055: 1900877493..1900877876:384:
   7:54056..   54063: 1900877877..1900877884:  8:
   8:54064..   98041: 1900877885..1900921862:  43978:
   9:98042..  117671: 1900921863..1900941492:  19630:
  10:   117672..  118055: 1900941493..1900941876:384:
  11:   118056..  161833: 1900941877..1900985654:  43778:
  12:   161834..  204013: 1900985655..1901027834:  42180:
  13:   204014..  214269: 1901027835..1901038090:  10256:
  14:   214270..  214401: 1901038091..1901038222:132:
  15:   214402..  214407: 1901038223..1901038228:  6:
  16:   214408..  258089: 1901038229..1901081910:  43682:
  17:   258090..  300139: 1901081911..1901123960:  42050:
  18:   300140..  310559: 1901123961..1901134380:  10420:
  19:   310560..  310695: 1901134381..1901134516:136:
  20:   310696..  354251: 1901134517..1901178072:  43556:
  21:   354252..  396389: 1901178073..1901220210:  42138:
  22:   396390..  406353: 1901220211..1901230174:   9964:
  23:   406354..  406515: 1901230175..1901230336:162:
  24:   406516..  406519: 1901230337..1901230340:  4:
  25:   406520..  450115: 1901230341..1901273936:  43596:
  26:   450116..  492161: 1901273937..1901315982:  42046:
  27:   492162..  524199: 1901315983..1901348020:  32038:
  28:   524200..  535355: 1901348021..1901359176:  11156:
  29:   535356..  535591: 1901359177..1901359412:236:
  30:   535592.. 1315369: 1899830240..1900610017: 779778: 1901359413:
  31:  1315370.. 1357435: 1901359413..1901401478:  42066: 1900610018:
  32:  1357436.. 1368091: 1928101070..1928111725:  10656: 1901401479:
  33:  1368092.. 1368231: 1928111726..1928111865:140:
  34:  1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866:
  35:  2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf


If it would be possible to read from 6 disks at once maybe this performance 
would be better for linear read.

Anyway this is a huge diversion from original question, so maybe we will end 
here ?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1] btrfs: Avoid reading out unnecessary extent tree blocks when mounting

2016-07-06 Thread Qu Wenruo
Btrfs_read_block_groups() function is the most time consuming function
if the fs is large and filled with small extents.

For a btrfs filled with 100,000,000 16K sized files, mount the fs takes
about 10 seconds.

While ftrace shows that, btrfs_read_block_groups() takes about 9
seconds, taking up 90% of the mount time.
So it's worthy to speedup btrfs_read_block_groups(), to reduce the
overall mount time.

Btrfs_read_block_groups() calls btrfs_search_slot() to find block group
items.
However the search key is (, BLOCK_GROUP_KEY, 0).
This makes search_slot() always returns previous slot.

Under most case, it's OK since the block group item and previous slot
are in the same leaf.
But for cases where block group item are the first item of a leaf, we
must read out next leaf to get block group item.
This needs extra IO and makes btrfs_read_block_groups() slower.

In fact, before we call btrfs_read_block_groups(), we have already read
out all chunks, which are 1:1 mapped with block group items.
So we can get the exact block group length for btrfs_search_slot(), to
avoid any possible btrfs_next_leaf() to speedup
btrfs_read_block_groups().

With this patch, time spent on btrfs_read_block_groups() is reduced to
7.56s, compared to old 8.94s, about 15% improvement.

Reported-by: Tsutomu Itoh 
Signed-off-by: Qu Wenruo 
---
v2:
  Update commit message
---
 fs/btrfs/extent-tree.c | 61 --
 fs/btrfs/extent_map.h  | 22 ++
 2 files changed, 46 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 82b912a..874f5b3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9648,39 +9648,20 @@ out:
return ret;
 }
 
-static int find_first_block_group(struct btrfs_root *root,
-   struct btrfs_path *path, struct btrfs_key *key)
+int find_block_group(struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct extent_map *chunk_em)
 {
int ret = 0;
-   struct btrfs_key found_key;
-   struct extent_buffer *leaf;
-   int slot;
-
-   ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
-   if (ret < 0)
-   goto out;
+   struct btrfs_key key;
 
-   while (1) {
-   slot = path->slots[0];
-   leaf = path->nodes[0];
-   if (slot >= btrfs_header_nritems(leaf)) {
-   ret = btrfs_next_leaf(root, path);
-   if (ret == 0)
-   continue;
-   if (ret < 0)
-   goto out;
-   break;
-   }
-   btrfs_item_key_to_cpu(leaf, _key, slot);
+   key.objectid = chunk_em->start;
+   key.offset = chunk_em->len;
+   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 
-   if (found_key.objectid >= key->objectid &&
-   found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
-   ret = 0;
-   goto out;
-   }
-   path->slots[0]++;
-   }
-out:
+   ret = btrfs_search_slot(NULL, root, , path, 0, 0);
+   if (ret > 0)
+   ret = -ENOENT;
return ret;
 }
 
@@ -9899,16 +9880,14 @@ int btrfs_read_block_groups(struct btrfs_root *root)
struct btrfs_block_group_cache *cache;
struct btrfs_fs_info *info = root->fs_info;
struct btrfs_space_info *space_info;
-   struct btrfs_key key;
+   struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree;
+   struct extent_map *chunk_em;
struct btrfs_key found_key;
struct extent_buffer *leaf;
int need_clear = 0;
u64 cache_gen;
 
root = info->extent_root;
-   key.objectid = 0;
-   key.offset = 0;
-   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -9921,10 +9900,16 @@ int btrfs_read_block_groups(struct btrfs_root *root)
if (btrfs_test_opt(root, CLEAR_CACHE))
need_clear = 1;
 
+   /* Here we don't lock the map tree, as we are the only reader */
+   chunk_em = first_extent_mapping(_tree->map_tree);
+   /* Not really possible */
+   if (!chunk_em) {
+   ret = -ENOENT;
+   goto error;
+   }
+
while (1) {
-   ret = find_first_block_group(root, path, );
-   if (ret > 0)
-   break;
+   ret = find_block_group(root, path, chunk_em);
if (ret != 0)
goto error;
 
@@ -9958,7 +9943,6 @@ int btrfs_read_block_groups(struct btrfs_root *root)
   sizeof(cache->item));
cache->flags = btrfs_block_group_flags(>item);
 
-   key.objectid = found_key.objectid + 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 5:22 PM, Kai Krakow  wrote:

> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **.

Stripe element (a.k.a. strip, a.k.a. md chunk) size in Btrfs is fixed at 64KiB.

>That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit.

Most any write pattern benefits from raid0 due to less disk
contention, even if the typical file size is smaller than stripe size.
Parallelization is improved even if it's suboptimal. This is really no
different than md raid striping with a 64KiB chunk size.

On Btrfs, it might be that some workloads benefit from metadata
raid10, and others don't. I also think it's hard to estimate without
benchmarking an actual workload with metadata as raid1 vs raid10.



> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.

I think that too would need to be benchmarked and I think it'd need to
be aged as well to see the effect of both file and block group free
space fragmentation. The devil will be in really minute details, all
you have to do is read a few weeks of XFS list stuff with people
talking about optimization or bad performance and almost always it's
not the fault of the file system. And when it is, it depends on the
kernel version as XFS has had substantial changes even over its long
career, including (somewhat) recent changes for metadata heavy
workloads.

> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.

I don't know about that. I think it's about the same. All multiple
device support, except raid56, was introduced at the same time
practically from day 2. Btrfs raid1 and raid10 tolerate only exactly 1
device loss, *maybe* two if you're very lucky, so neither of them are
really scalable.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Kai Krakow
Am Thu, 7 Jul 2016 00:51:16 +0100
schrieb Tomasz Kusmierz :

> > On 7 Jul 2016, at 00:22, Kai Krakow  wrote:
> > 
> > Am Wed, 6 Jul 2016 13:20:15 +0100
> > schrieb Tomasz Kusmierz :
> >   
> >> When I think of it, I did move this folder first when filesystem
> >> was RAID 1 (or not even RAID at all) and then it was upgraded to
> >> RAID 1 then RAID 10. Was there a faulty balance around August
> >> 2014 ? Please remember that I’m using Ubuntu so it was probably
> >> kernel from Ubuntu 14.04 LTS
> >> 
> >> Also, I would like to hear it from horses mouth: dos & donts for a
> >> long term storage where you moderately care about the data: RAID10
> >> - flaky ? would RAID1 give similar performance ?  
> > 
> > The current implementation of RAID0 in btrfs is probably not very
> > optimized. RAID0 is a special case anyways: Stripes have a defined
> > width - I'm not sure what it is for btrfs, probably it's per chunk,
> > so it's 1GB, maybe it's 64k **. That means your data is usually not
> > read from multiple disks in parallel anyways as long as requests
> > are below stripe width (which is probably true for most access
> > patterns except copying files) - there's no immediate performance
> > benefit. This holds true for any RAID0 with read and write patterns
> > below the stripe size. Data is just more evenly distributed across
> > devices and your application will only benefit performance-wise if
> > accesses spread semi-random across the span of the whole file. And
> > at least last time I checked, it was stated that btrfs raid0 does
> > not submit IOs in parallel yet but first reads one stripe, then the
> > next - so it doesn't submit IOs to different devices in parallel.
> > 
> > Getting to RAID1, btrfs is even less optimized: Stripe decision is
> > based on process pids instead of device load, read accesses won't
> > distribute evenly to different stripes per single process, it's
> > only just reading from the same single device - always. Write
> > access isn't faster anyways: Both stripes need to be written -
> > writing RAID1 is single device performance only.
> > 
> > So I guess, at this stage there's no big difference between RAID1
> > and RAID10 in btrfs (except maybe for large file copies), not for
> > single process access patterns and neither for multi process access
> > patterns. Btrfs can only benefit from RAID1 in multi process access
> > patterns currently, as can btrfs RAID0 by design for usual small
> > random access patterns (and maybe large sequential operations). But
> > RAID1 with more than two disks and multi process access patterns is
> > more or less equal to RAID10 because stripes are likely to be on
> > different devices anyways.
> > 
> > In conclusion: RAID1 is simpler than RAID10 and thus its less
> > likely to contain flaws or bugs.
> > 
> > **: Please enlighten me, I couldn't find docs on this matter.  
> 
> :O 
> 
> It’s an eye opener - I think that this should end up on btrfs WIKI …
> seriously !
> 
> Anyway my use case for this is “storage” therefore I predominantly
> copy large files. 

Then RAID10 may be your best option - for local operations. Copying
large files, even a modern single SATA spindle can saturate a gigabit
link. So, if your use case is NAS, and you don't use server side copies
(like modern versions of NFS and Samba support), you won't benefit from
RAID10 vs RAID1 - so just use the simpler implementation.

My personal recommendation: Add a small, high quality SSD to your array
and configure btrfs on top of bcache, configure it for write-around
caching to get best life-time and data safety. This should cache mostly
meta data access in your usecase and improve performance much better
than RAID10 over RAID1. I can recommend Crucial MX series from
personal experience, choose 250GB or higher as 120GB versions of
Crucial MX suffer much lower durability for caching purposes. Adding
bcache to an existing btrfs array is a little painful but easily doable
if you have enough free space to temporarily sacrifice one disk.

BTW: I'm using 3x 1TB btrfs mraid1/draid0 with a single 500GB bcache
SSD in write-back mode and local operation (it's my desktop machine).
The performance is great, bcache decouples some of the performance
downsides the current btrfs raid implementation has. I do daily
backups, so write-back caching is not a real problem (in case it
fails), and btrfs draid0 is also not a problem (mraid1 ensures meta
data integrity, so only file contents are at risk, and covered by
backups). With this setup I can easily saturate my 6Gb onboard SATA
controller, the system boots to usable desktop in 30 seconds from cold
start (including EFI firmware), including autologin to full-blown
KDE, autostart of Chrome and Steam, 2 virtual machine containers
(nspawn-based, one MySQL instance, one ElasticSearch instance), plus
local MySQL and ElasticSearch service (used for development and staging
purposes), and a 

Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 7 Jul 2016, at 00:22, Kai Krakow  wrote:
> 
> Am Wed, 6 Jul 2016 13:20:15 +0100
> schrieb Tomasz Kusmierz :
> 
>> When I think of it, I did move this folder first when filesystem was
>> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
>> then RAID 10. Was there a faulty balance around August 2014 ? Please
>> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
>> 14.04 LTS
>> 
>> Also, I would like to hear it from horses mouth: dos & donts for a
>> long term storage where you moderately care about the data: RAID10 -
>> flaky ? would RAID1 give similar performance ?
> 
> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **. That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit. This holds
> true for any RAID0 with read and write patterns below the stripe size.
> Data is just more evenly distributed across devices and your
> application will only benefit performance-wise if accesses spread
> semi-random across the span of the whole file. And at least last time I
> checked, it was stated that btrfs raid0 does not submit IOs in parallel
> yet but first reads one stripe, then the next - so it doesn't submit
> IOs to different devices in parallel.
> 
> Getting to RAID1, btrfs is even less optimized: Stripe decision is based
> on process pids instead of device load, read accesses won't distribute
> evenly to different stripes per single process, it's only just reading
> from the same single device - always. Write access isn't faster anyways:
> Both stripes need to be written - writing RAID1 is single device
> performance only.
> 
> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.
> 
> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.
> 
> **: Please enlighten me, I couldn't find docs on this matter.

:O 

It’s an eye opener - I think that this should end up on btrfs WIKI … seriously !

Anyway my use case for this is “storage” therefore I predominantly copy large 
files. 


> -- 
> Regards,
> Kai
> 
> Replies to list-only preferred.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/20] xfs: run xfs_repair at the end of each test

2016-07-06 Thread Darrick J. Wong
On Thu, Jul 07, 2016 at 09:13:40AM +1000, Dave Chinner wrote:
> On Mon, Jul 04, 2016 at 09:11:34PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 05, 2016 at 11:56:17AM +0800, Eryu Guan wrote:
> > > On Thu, Jun 16, 2016 at 06:48:01PM -0700, Darrick J. Wong wrote:
> > > > Run xfs_repair twice at the end of each test -- once to rebuild
> > > > the btree indices, and again with -n to check the rebuild work.
> > > > 
> > > > Signed-off-by: Darrick J. Wong 
> > > > ---
> > > >  common/rc |3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/common/rc b/common/rc
> > > > index 1225047..847191e 100644
> > > > --- a/common/rc
> > > > +++ b/common/rc
> > > > @@ -2225,6 +2225,9 @@ _check_xfs_filesystem()
> > > >  ok=0
> > > >  fi
> > > >  
> > > > +$XFS_REPAIR_PROG $extra_options $extra_log_options 
> > > > $extra_rt_options $device >$tmp.repair 2>&1
> > > > +cat $tmp.repair | _fix_malloc  >>$seqres.full
> > > > +
> > > 
> > > Won't this hide fs corruptions? Did I miss anything?
> > 
> > I could've sworn it did:
> > 
> > xfs_repair -n
> > (complain if corrupt)
> > 
> > xfs_repair
> > 
> > xfs_repair -n
> > (complain if still corrupt)
> > 
> > But that first xfs_repair -n hunk disappeared. :(
> > 
> > Ok, will fix and resend.
> 
> Not sure this is the best idea - when repair on an aged test device
> takes 10s, this means the test harness overhead increases by a
> factor of 3. i.e. test takes 1s to run, checking the filesystem
> between tests now takes 30s. i.e. this will badly blow out the run
> time of the test suite on aged test devices
> 
> What does this overhead actually gain us that we couldn't encode
> explicitly into a single test or two? e.g the test itself runs
> repair on the aged test device

I'm primarily using it as a way to expose the new rmap/refcount/rtrmap btree
rebuilding code to a wider variety of filesystems.  But you're right, there's
no need to expose /everyone/ to this behavior.  Shall I rework the change
so that one can turn it on or off as desired?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Kai Krakow
Am Wed, 6 Jul 2016 13:20:15 +0100
schrieb Tomasz Kusmierz :

> When I think of it, I did move this folder first when filesystem was
> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
> then RAID 10. Was there a faulty balance around August 2014 ? Please
> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
> 14.04 LTS
> 
> Also, I would like to hear it from horses mouth: dos & donts for a
> long term storage where you moderately care about the data: RAID10 -
> flaky ? would RAID1 give similar performance ?

The current implementation of RAID0 in btrfs is probably not very
optimized. RAID0 is a special case anyways: Stripes have a defined
width - I'm not sure what it is for btrfs, probably it's per chunk, so
it's 1GB, maybe it's 64k **. That means your data is usually not read
from multiple disks in parallel anyways as long as requests are below
stripe width (which is probably true for most access patterns except
copying files) - there's no immediate performance benefit. This holds
true for any RAID0 with read and write patterns below the stripe size.
Data is just more evenly distributed across devices and your
application will only benefit performance-wise if accesses spread
semi-random across the span of the whole file. And at least last time I
checked, it was stated that btrfs raid0 does not submit IOs in parallel
yet but first reads one stripe, then the next - so it doesn't submit
IOs to different devices in parallel.

Getting to RAID1, btrfs is even less optimized: Stripe decision is based
on process pids instead of device load, read accesses won't distribute
evenly to different stripes per single process, it's only just reading
from the same single device - always. Write access isn't faster anyways:
Both stripes need to be written - writing RAID1 is single device
performance only.

So I guess, at this stage there's no big difference between RAID1 and
RAID10 in btrfs (except maybe for large file copies), not for single
process access patterns and neither for multi process access patterns.
Btrfs can only benefit from RAID1 in multi process access patterns
currently, as can btrfs RAID0 by design for usual small random access
patterns (and maybe large sequential operations). But RAID1 with more
than two disks and multi process access patterns is more or less equal
to RAID10 because stripes are likely to be on different devices anyways.

In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
contain flaws or bugs.

**: Please enlighten me, I couldn't find docs on this matter.

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/20] xfs: run xfs_repair at the end of each test

2016-07-06 Thread Dave Chinner
On Mon, Jul 04, 2016 at 09:11:34PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 05, 2016 at 11:56:17AM +0800, Eryu Guan wrote:
> > On Thu, Jun 16, 2016 at 06:48:01PM -0700, Darrick J. Wong wrote:
> > > Run xfs_repair twice at the end of each test -- once to rebuild
> > > the btree indices, and again with -n to check the rebuild work.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > >  common/rc |3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > 
> > > diff --git a/common/rc b/common/rc
> > > index 1225047..847191e 100644
> > > --- a/common/rc
> > > +++ b/common/rc
> > > @@ -2225,6 +2225,9 @@ _check_xfs_filesystem()
> > >  ok=0
> > >  fi
> > >  
> > > +$XFS_REPAIR_PROG $extra_options $extra_log_options $extra_rt_options 
> > > $device >$tmp.repair 2>&1
> > > +cat $tmp.repair | _fix_malloc>>$seqres.full
> > > +
> > 
> > Won't this hide fs corruptions? Did I miss anything?
> 
> I could've sworn it did:
> 
> xfs_repair -n
> (complain if corrupt)
> 
> xfs_repair
> 
> xfs_repair -n
> (complain if still corrupt)
> 
> But that first xfs_repair -n hunk disappeared. :(
> 
> Ok, will fix and resend.

Not sure this is the best idea - when repair on an aged test device
takes 10s, this means the test harness overhead increases by a
factor of 3. i.e. test takes 1s to run, checking the filesystem
between tests now takes 30s. i.e. this will badly blow out the run
time of the test suite on aged test devices

What does this overhead actually gain us that we couldn't encode
explicitly into a single test or two? e.g the test itself runs
repair on the aged test device

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 has failing disks, but smart is clear

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 23:14, Corey Coughlin  wrote:
> 
> Hi all,
>Hoping you all can help, have a strange problem, think I know what's going 
> on, but could use some verification.  I set up a raid1 type btrfs filesystem 
> on an Ubuntu 16.04 system, here's what it looks like:
> 
> btrfs fi show
> Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
>Total devices 10 FS bytes used 3.42TiB
>devid1 size 1.82TiB used 1.18TiB path /dev/sdd
>devid2 size 698.64GiB used 47.00GiB path /dev/sdk
>devid3 size 931.51GiB used 280.03GiB path /dev/sdm
>devid4 size 931.51GiB used 280.00GiB path /dev/sdl
>devid5 size 1.82TiB used 1.17TiB path /dev/sdi
>devid6 size 1.82TiB used 823.03GiB path /dev/sdj
>devid7 size 698.64GiB used 47.00GiB path /dev/sdg
>devid8 size 1.82TiB used 1.18TiB path /dev/sda
>devid9 size 1.82TiB used 1.18TiB path /dev/sdb
>devid   10 size 1.36TiB used 745.03GiB path /dev/sdh
> 
> I added a couple disks, and then ran a balance operation, and that took about 
> 3 days to finish.  When it did finish, tried a scrub and got this message:
> 
> scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
>scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35
>total bytes scrubbed: 926.45GiB with 18849935 errors
>error details: read=18849935
>corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: > 0
> 
> So that seems bad.  Took a look at the devices and a few of them have errors:
> ...
> [/dev/sdi].generation_errs 0
> [/dev/sdj].write_io_errs   289436740
> [/dev/sdj].read_io_errs289492820
> [/dev/sdj].flush_io_errs   12411
> [/dev/sdj].corruption_errs 0
> [/dev/sdj].generation_errs 0
> [/dev/sdg].write_io_errs   0
> ...
> [/dev/sda].generation_errs 0
> [/dev/sdb].write_io_errs   3490143
> [/dev/sdb].read_io_errs111
> [/dev/sdb].flush_io_errs   268
> [/dev/sdb].corruption_errs 0
> [/dev/sdb].generation_errs 0
> [/dev/sdh].write_io_errs   5839
> [/dev/sdh].read_io_errs2188
> [/dev/sdh].flush_io_errs   11
> [/dev/sdh].corruption_errs 1
> [/dev/sdh].generation_errs 16373
> 
> So I checked the smart data for those disks, they seem perfect, no 
> reallocated sectors, no problems.  But one thing I did notice is that they 
> are all WD Green drives.  So I'm guessing that if they power down and get 
> reassigned to a new /dev/sd* letter, that could lead to data corruption.  I 
> used idle3ctl to turn off the shut down mode on all the green drives in the 
> system, but I'm having trouble getting the filesystem working without the 
> errors.  I tried a 'check --repair' command on it, and it seems to find a lot 
> of verification errors, but it doesn't look like things are getting fixed.
>  But I have all the data on it backed up on another system, so I can recreate 
> this if I need to.  But here's what I want to know:
> 
> 1.  Am I correct about the issues with the WD Green drives, if they change 
> mounts during disk operations, will that corrupt data?
I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB 
of those, actually had for ~3 years. If disk goes down for spin down, and you 
try to access something - kernel & FS & whole system will wait for drive to 
re-spin and everything works OK. I’ve never had a drive being reassigned to 
different /dev/sdX due to spin down / up. 
2 years ago I was having a corruption due to not using ECC ram on my system and 
one of RAM modules started producing errors that were never caught up by CPU / 
MoBo. Long story short, guy here managed to point me to the right direction and 
I started shifting my data to hopefully new and not corrupted FS … but I was 
sceptical of similar issue that you have described AND I did raid1 and while 
mounted I did shift disk from one SATA port to another and FS managed to pick 
up the disk in new location and did not even blinked (as far as I remember 
there was syslog entry to say that disk vanished and then that disk was added)

Last word, you got plenty of errors in your SMART for transfer related stuff, 
please be advised that this may mean:
- faulty cable
- faulty mono controller
- faulty drive controller
- bad RAM - yes, mother board CAN use your ram for storing data and transfer 
related stuff … specially chapter ones. 

> 2.  If that is the case:
>a.) Is there any way I can stop the /dev/sd* mount points from changing?  
> Or can I set up the filesystem using UUIDs or something more solid?  I 
> googled about it, but found conflicting info
Don’t get it the wrong way but I’m personally surprised that anybody still uses 
mount points rather than UUID. Devices change from boot to boot for a lot of 
people and most of distros moved to uuid (2 years ago ? even the swap is 
mounted via UUID now)

>b.) Or, is there something else changing my drive devices?  I have most of 
> drives on an LSI SAS 9201-16i card, is there something I need to 

Re: [PATCH v6 00/20] xfstests: minor fixes for the reflink/dedupe tests

2016-07-06 Thread Darrick J. Wong
On Tue, Jul 05, 2016 at 12:31:30PM +0800, Eryu Guan wrote:
> Hi Darrick,
> 
> On Thu, Jun 16, 2016 at 06:46:02PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > This is the sixth revision of a patchset that adds to xfstests
> > support for testing reverse-mappings of physical blocks to file and
> > metadata (rmap); support for testing multiple file logical blocks to
> > the same physical block (reflink); and implements the beginnings of
> > online metadata scrubbing.
> > 
> > The first eight patches are in Eryu Guan's pull request on 2016-06-15.
> > Those patches haven't changed, but they're not yet in the upstream
> > repo.
> > 
> > If you're going to start using this mess, you probably ought to just
> > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
> > There are also updates for xfs-docs[4].  The kernel patches should
> > apply to dchinner's for-next; xfsprogs patches to for-next; and
> > xfstest to master.  The kernel git tree already has for-next included.
> > 
> > The patches have been xfstested with x64, i386, and armv7l--arm64,
> > ppc64, and ppc64le no longer boot in qemu.  All three architectures
> > pass all 'clone' group tests except xfs/128 (which is the swapext
> > test), and AFAICT don't cause any new failures for the 'auto' group.
> > 
> > This is an extraordinary way to eat your data.  Enjoy! 
> > Comments and questions are, as always, welcome.
> 
> I tested your xfstests patches with your kernel(HEAD f0b34b6 xfs: add
> btree scrub tracepoints) and xfsprogs(HEAD 34bd754 xfs_scrub: create
> online filesystem scrub program), with x86_64 host & 4k block size XFS.
> 
> A './check -g auto' run looked fine overall. Besides the comments I
> replied to some patches, other common minor issues are:
> - space indention in _cleanup not tab
> - bare 'umount $SCRATCH_MNT' not _scratch_unmount
> - whitespace issues in _test|scratch_inject_error
> 
> (I can fix all these minor issues at commit time, if you don't have
> other major updates to these patches).

I don't have any major updates to any of those patches; go ahead.

FWIW I usually have unposted patches at all points in time, so if you want to
fix minor nits in things I've already posted for review and commit them to
upstream, that's fine.  I pull down the latest xfstest git and rebase prior to
sending a new patch series, so I'll absorb whatever you change. :)

When I'm getting ready to do another big release, I inquire with the
maintainers if they're about to push commits upstream to avoid the race
post patches -> upstream push -> rebase patches -> repost patches.

> And the review of changes to xfs/122 needs help from other XFS
> developers :) (09/20 and 10/20)

09/20 (remove rmapx cruft) should be pretty straightforward, since I withdrew
'rmapx' and related changes from xfs.

10/20 (new log items) will probably remain outstanding for a while since
those changes haven't really made it upstream yet.

> And besides the first 8 patches, 15/20 has been in upstream as well.

Oh, ok.

> Thanks,
> Eryu
> 
> P.S.
> The failed tests I saw when testing with reflink-enabled kernel &
> xfsprogs:
> 
> Failures: generic/054 generic/055 generic/108 generic/204 generic/356 
> generic/357 xfs/004 xfs/096 xfs/122 xfs/293
> 
> generic/108 generic/204 and xfs/004 are new failures compared to stock
> kernel and xfsprogs (kernel 4.7-rc5, xfsprogs 4.7-rc1).

I think I have fixes for some of those that will go out during the next
patchbomb.  But thanks for the heads up, I'll have a look at a -g auto
run before I submit again.

--D

> 
> Just FYI.
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 22:41, Henk Slager  wrote:
> 
> On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz  
> wrote:
>> 
>>> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
>>> 
>>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
>>> wrote:
 
 On 6 Jul 2016, at 00:30, Henk Slager  wrote:
 
 On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
 wrote:
 
 I did consider that, but:
 - some files were NOT accessed by anything with 100% certainty (well if
 there is a rootkit on my system or something in that shape than maybe yes)
 - the only application that could access those files is totem (well
 Nautilius checks extension -> directs it to totem) so in that case we would
 hear about out break of totem killing people files.
 - if it was a kernel bug then other large files would be affected.
 
 Maybe I’m wrong and it’s actually related to the fact that all those files
 are located in single location on file system (single folder) that might
 have a historical bug in some structure somewhere ?
 
 
 I find it hard to imagine that this has something to do with the
 folderstructure, unless maybe the folder is a subvolume with
 non-default attributes or so. How the files in that folder are created
 (at full disktransferspeed or during a day or even a week) might give
 some hint. You could run filefrag and see if that rings a bell.
 
 files that are 4096 show:
 1 extent found
>>> 
>>> I actually meant filefrag for the files that are not (yet) truncated
>>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>>> an MBR write.
>> 117 extents found
>> filesize 15468645003
>> 
>> good / bad ?
> 
> 117 extents for a 1.5G file is fine, with -v option you could see the
> fragmentation at the start, but this won't lead to any hint why you
> have the truncate issue.
> 
 I did forgot to add that file system was created a long time ago and it was
 created with leaf & node size = 16k.
 
 
 If this long time ago is >2 years then you have likely specifically
 set node size = 16k, otherwise with older tools it would have been 4K.
 
 You are right I used -l 16K -n 16K
 
 Have you created it as raid10 or has it undergone profile conversions?
 
 Due to lack of spare disks
 (it may sound odd for some but spending for more than 6 disks for home use
 seems like an overkill)
 and due to last I’ve had I had to migrate all data to new file system.
 This played that way that I’ve:
 1. from original FS I’ve removed 2 disks
 2. Created RAID1 on those 2 disks,
 3. shifted 2TB
 4. removed 2 disks from source FS and adde those to destination FS
 5 shifted 2 further TB
 6 destroyed original FS and adde 2 disks to destination FS
 7 converted destination FS to RAID10
 
 FYI, when I convert to raid 10 I use:
 btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
 /path/to/FS
 
 this filesystem has 5 sub volumes. Files affected are located in separate
 folder within a “victim folder” that is within a one sub volume.
 
 
 It could also be that the ondisk format is somewhat corrupted (btrfs
 check should find that ) and that that causes the issue.
 
 
 root@noname_server:/mnt# btrfs check /dev/sdg1
 Checking filesystem on /dev/sdg1
 UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
 checking extents
 checking free space cache
 checking fs roots
 checking csums
 checking root refs
 found 4424060642634 bytes used err is 0
 total csum bytes: 4315954936
 total tree bytes: 4522786816
 total fs tree bytes: 61702144
 total extent tree bytes: 41402368
 btree space waste bytes: 72430813
 file data blocks allocated: 4475917217792
 referenced 4420407603200
 
 No luck there :/
>>> 
>>> Indeed looks all normal.
>>> 
 In-lining on raid10 has caused me some trouble (I had 4k nodes) over
 time, it has happened over a year ago with kernels recent at that
 time, but the fs was converted from raid5
 
 Could you please elaborate on that ? you also ended up with files that got
 truncated to 4096 bytes ?
>>> 
>>> I did not have truncated to 4k files, but your case lets me think of
>>> small files inlining. Default max_inline mount option is 8k and that
>>> means that 0 to ~3k files end up in metadata. I had size corruptions
>>> for several of those small sized files that were updated quite
>>> frequent, also within commit time AFAIK. Btrfs check lists this as
>>> errors 400, although fs operation is not disturbed. I don't know what
>>> happens if those small files are being updated/rewritten and are just
>>> below or just above the max_inline limit.
>>> 
>>> The only thing I was 

raid1 has failing disks, but smart is clear

2016-07-06 Thread Corey Coughlin

Hi all,
Hoping you all can help, have a strange problem, think I know 
what's going on, but could use some verification.  I set up a raid1 type 
btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:


btrfs fi show
Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
Total devices 10 FS bytes used 3.42TiB
devid1 size 1.82TiB used 1.18TiB path /dev/sdd
devid2 size 698.64GiB used 47.00GiB path /dev/sdk
devid3 size 931.51GiB used 280.03GiB path /dev/sdm
devid4 size 931.51GiB used 280.00GiB path /dev/sdl
devid5 size 1.82TiB used 1.17TiB path /dev/sdi
devid6 size 1.82TiB used 823.03GiB path /dev/sdj
devid7 size 698.64GiB used 47.00GiB path /dev/sdg
devid8 size 1.82TiB used 1.18TiB path /dev/sda
devid9 size 1.82TiB used 1.18TiB path /dev/sdb
devid   10 size 1.36TiB used 745.03GiB path /dev/sdh

I added a couple disks, and then ran a balance operation, and that took 
about 3 days to finish.  When it did finish, tried a scrub and got this 
message:


scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 
01:16:35

total bytes scrubbed: 926.45GiB with 18849935 errors
error details: read=18849935
corrected errors: 5860, uncorrectable errors: 18844075, unverified 
errors: 0


So that seems bad.  Took a look at the devices and a few of them have 
errors:

...
[/dev/sdi].generation_errs 0
[/dev/sdj].write_io_errs   289436740
[/dev/sdj].read_io_errs289492820
[/dev/sdj].flush_io_errs   12411
[/dev/sdj].corruption_errs 0
[/dev/sdj].generation_errs 0
[/dev/sdg].write_io_errs   0
...
[/dev/sda].generation_errs 0
[/dev/sdb].write_io_errs   3490143
[/dev/sdb].read_io_errs111
[/dev/sdb].flush_io_errs   268
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdh].write_io_errs   5839
[/dev/sdh].read_io_errs2188
[/dev/sdh].flush_io_errs   11
[/dev/sdh].corruption_errs 1
[/dev/sdh].generation_errs 16373

So I checked the smart data for those disks, they seem perfect, no 
reallocated sectors, no problems.  But one thing I did notice is that 
they are all WD Green drives.  So I'm guessing that if they power down 
and get reassigned to a new /dev/sd* letter, that could lead to data 
corruption.  I used idle3ctl to turn off the shut down mode on all the 
green drives in the system, but I'm having trouble getting the 
filesystem working without the errors.  I tried a 'check --repair' 
command on it, and it seems to find a lot of verification errors, but it 
doesn't look like things are getting fixed.  But I have all the data on 
it backed up on another system, so I can recreate this if I need to.  
But here's what I want to know:


1.  Am I correct about the issues with the WD Green drives, if they 
change mounts during disk operations, will that corrupt data?

2.  If that is the case:
a.) Is there any way I can stop the /dev/sd* mount points from 
changing?  Or can I set up the filesystem using UUIDs or something more 
solid?  I googled about it, but found conflicting info
b.) Or, is there something else changing my drive devices?  I have 
most of drives on an LSI SAS 9201-16i card, is there something I need to 
do to make them fixed?
c.) Or, is there a script or something I can use to figure out if 
the disks will change mounts?
d.) Or, if I wipe everything and rebuild, will the disks with the 
idle3ctl fix work now?


Regardless of whether or not it's a WD Green drive issue, should I just 
wipefs all the disks and rebuild it?  Is there any way to recover this?  
Thanks for any help!



--- Corey
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Henk Slager
On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz  wrote:
>
>> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
>>
>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
>> wrote:
>>>
>>> On 6 Jul 2016, at 00:30, Henk Slager  wrote:
>>>
>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
>>> wrote:
>>>
>>> I did consider that, but:
>>> - some files were NOT accessed by anything with 100% certainty (well if
>>> there is a rootkit on my system or something in that shape than maybe yes)
>>> - the only application that could access those files is totem (well
>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>> hear about out break of totem killing people files.
>>> - if it was a kernel bug then other large files would be affected.
>>>
>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>> are located in single location on file system (single folder) that might
>>> have a historical bug in some structure somewhere ?
>>>
>>>
>>> I find it hard to imagine that this has something to do with the
>>> folderstructure, unless maybe the folder is a subvolume with
>>> non-default attributes or so. How the files in that folder are created
>>> (at full disktransferspeed or during a day or even a week) might give
>>> some hint. You could run filefrag and see if that rings a bell.
>>>
>>> files that are 4096 show:
>>> 1 extent found
>>
>> I actually meant filefrag for the files that are not (yet) truncated
>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>> an MBR write.
> 117 extents found
> filesize 15468645003
>
> good / bad ?

117 extents for a 1.5G file is fine, with -v option you could see the
fragmentation at the start, but this won't lead to any hint why you
have the truncate issue.

>>> I did forgot to add that file system was created a long time ago and it was
>>> created with leaf & node size = 16k.
>>>
>>>
>>> If this long time ago is >2 years then you have likely specifically
>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>
>>> You are right I used -l 16K -n 16K
>>>
>>> Have you created it as raid10 or has it undergone profile conversions?
>>>
>>> Due to lack of spare disks
>>> (it may sound odd for some but spending for more than 6 disks for home use
>>> seems like an overkill)
>>> and due to last I’ve had I had to migrate all data to new file system.
>>> This played that way that I’ve:
>>> 1. from original FS I’ve removed 2 disks
>>> 2. Created RAID1 on those 2 disks,
>>> 3. shifted 2TB
>>> 4. removed 2 disks from source FS and adde those to destination FS
>>> 5 shifted 2 further TB
>>> 6 destroyed original FS and adde 2 disks to destination FS
>>> 7 converted destination FS to RAID10
>>>
>>> FYI, when I convert to raid 10 I use:
>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>> /path/to/FS
>>>
>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>> folder within a “victim folder” that is within a one sub volume.
>>>
>>>
>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>> check should find that ) and that that causes the issue.
>>>
>>>
>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 4424060642634 bytes used err is 0
>>> total csum bytes: 4315954936
>>> total tree bytes: 4522786816
>>> total fs tree bytes: 61702144
>>> total extent tree bytes: 41402368
>>> btree space waste bytes: 72430813
>>> file data blocks allocated: 4475917217792
>>> referenced 4420407603200
>>>
>>> No luck there :/
>>
>> Indeed looks all normal.
>>
>>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>>> time, it has happened over a year ago with kernels recent at that
>>> time, but the fs was converted from raid5
>>>
>>> Could you please elaborate on that ? you also ended up with files that got
>>> truncated to 4096 bytes ?
>>
>> I did not have truncated to 4k files, but your case lets me think of
>> small files inlining. Default max_inline mount option is 8k and that
>> means that 0 to ~3k files end up in metadata. I had size corruptions
>> for several of those small sized files that were updated quite
>> frequent, also within commit time AFAIK. Btrfs check lists this as
>> errors 400, although fs operation is not disturbed. I don't know what
>> happens if those small files are being updated/rewritten and are just
>> below or just above the max_inline limit.
>>
>> The only thing I was thinking of is that your files were started as
>> small, so inline, then extended to multi-GB. In the past, there were
>> 'bad extent/chunk type' issues and it was suggested that the fs would
>> have been an ext4-converted one (which 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 14:45, Chris Murphy wrote:

>> I think it's statistically 0 people changing this from default. It's
>> people with drives that have no SCT ERC support, used in raid1+, who
>> happen to stumble upon this very obscure work around to avoid link
>> resets in the face of media defects. Rare.
>
> Not as much as you think, once someone has this issue, they usually put
> preventative measures in place on any system where it applies.  I'd be
> willing to bet that most sysadmins at big companies like RedHat or Oracle
> are setting this.

SCT ERC yes. Changing the kernel's command timer? I think almost zero.



>> Well they have link resets and their file system presumably face
>> plants as a result of a pile of commands in the queue returning as
>> unsuccessful. So they have premature death of their system, rather
>> than it getting sluggish. This is a long standing indicator on Windows
>> to just reinstall the OS and restore data from backups -> the user has
>> an opportunity to freshen up user data backup, and the reinstallation
>> and restore from backup results in freshly written sectors which is
>> how bad sectors get fixed. The marginally bad sectors get new writes
>> and now read fast (or fast enough), and the persistently bad sectors
>> result in the drive firmware remapping to reserve sectors.
>>
>> The main thing in my opinion is less extension of drive life, as it is
>> the user gets to use the system, albeit sluggish, to make a backup of
>> their data rather than possibly losing it.
>
> The extension of the drive's lifetime is a nice benefit, but not what my
> point was here.  For people in this particular case, it will almost
> certainly only make things better (although at first it may make performance
> worse).

I'm not sure why it makes performance worse. The options are, slower
reads vs a file system that almost certainly face plants upon a link
reset.




>> Basically it's:
>>
>> For SATA and USB drives:
>>
>> if data redundant, then enable short SCT ERC time if supported, if not
>> supported then extend SCSI command timer to 200;
>>
>> if data not redundant, then disable SCT ERC if supported, and extend
>> SCSI command timer to 200.
>>
>> For SCSI (SAS most likely these days), keep things the same as now.
>> But that's only because this is a rare enough configuration now I
>> don't know if we really know the problems there. It may be that their
>> error recovery in 7 seconds is massively better and more reliable than
>> consumer drives over 180 seconds.
>
> I don't see why you would think this is not common.

I was not clear. Single device SAS is probably not common. They're
typically being used in arrays where data is redundant. Using such a
drive with short error recovery as a single boot drive? Probably not
that common.



> Separately, USB gets _really_ complicated if you want to cover everything,
> USB drives may or may not present as non-rotational, may or may not show up
> as SATA or SCSI bridges (there are some of the more expensive flash drives
> that actually use SSD controllers plus USB-SAT chips internally), if they do
> show up as such, may or may not support the required commands (most don't,
> but it's seemingly hit or miss which do).

Yup. Well, do what we can instead of just ignoring the problem? They
can still be polled for features including SCT ERC and if it's not
supported or configurable then fallback to increasing the command
timer. I'm not sure what else can be done anyway.

The main obstacle is squaring the device capability (low level) with
storage stack redundancy 0 or 1 (high level). Something has to be
aware of both to ideally get all devices ideally configured.



>> Yep it's imperfect unless there's the proper cross communication
>> between layers. There are some such things like hardware raid geometry
>> that optionally poke through (when supported by hardware raid drivers)
>> so that things like mkfs.xfs can automatically provide the right sunit
>> swidth for optimized layout; which the device mapper already does
>> automatically. So it could be done it's just a matter of how big of a
>> problem is this to build it, vs just going with a new one size fits
>> all default command timer?
>
> The other problem though is that the existing things pass through
> _read-only_ data, while this requires writable data to be passed through,
> which leads to all kinds of complicated issues potentially.

I'm aware. There are also plenty of bugs even if write were to pass
through. I've encountered more drives than not which accept only one
SCT ERC change per poweron. A 2nd change causes the drive to offline
and vanish off the bus. So no doubt this whole area is fragile enough
not even the drive, controller, enclosure vendors are aware of where
all the bodies are buried.

What I think is fairly well established is that at least on Windows
their lower level stuff including kernel 

Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 1:17 PM, Austin S. Hemmelgarn
 wrote:

> In bash or most other POSIX compliant shells, you can run this:
> echo $?
> to get the return code of the previous command.
>
> In your case though, it may be reporting the FS ready because it had already
> seen all the devices, IIUC, the flag that checks is only set once, and never
> unset, which is not a good design in this case.

Oh dear.

[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi---tz-- 50.00g thintastic
  2  VG Vwi---tz-- 50.00g thintastic
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs dev scan
Scanning for Btrfs filesystems
[root@f24s ~]# echo $?
0
[root@f24s ~]# btrfs device ready /dev/mapper/VG-3
[root@f24s ~]# echo $?
0
[root@f24s ~]# btrfs fi show
warning, device 2 is missing
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
*** Some devices missing


Cute, device 1 is also missing but that's not mentioned. In any case,
the device is still ready even after a dev scan. I guess this isn't
exactly testable all that easily unless I reboot.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()

2016-07-06 Thread Liu Bo
On Wed, Jul 06, 2016 at 06:37:52PM +0800, Wang Xiaoguang wrote:
> In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
> wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
> which indeed are extent's bytenr. The correct value should be
> cluster->[start|end] minus block group's start bytenr.
> 
> start bytenr   cluster->start
> |  | extent  |   extent   | ...| extent |
> ||
> |block group reloc_inode |
> 
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/relocation.c | 27 +++
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 0477dca..abc2f69 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode,
>   u64 num_bytes;
>   int nr = 0;
>   int ret = 0;
> + u64 prealloc_start, prealloc_end;
>  
>   BUG_ON(cluster->start != cluster->boundary[0]);
>   inode_lock(inode);
>  
> - ret = btrfs_check_data_free_space(inode, cluster->start,
> -   cluster->end + 1 - cluster->start);
> + start = cluster->start - offset;
> + end = cluster->end - offset;
> + ret = btrfs_check_data_free_space(inode, start, end + 1 - start);
>   if (ret)
>   goto out;
>  
>   while (nr < cluster->nr) {
> - start = cluster->boundary[nr] - offset;
> + prealloc_start = cluster->boundary[nr] - offset;
>   if (nr + 1 < cluster->nr)
> - end = cluster->boundary[nr + 1] - 1 - offset;
> + prealloc_end = cluster->boundary[nr + 1] - 1 - offset;
>   else
> - end = cluster->end - offset;
> + prealloc_end = cluster->end - offset;
>  
> - lock_extent(_I(inode)->io_tree, start, end);
> - num_bytes = end + 1 - start;
> - ret = btrfs_prealloc_file_range(inode, 0, start,
> + lock_extent(_I(inode)->io_tree, prealloc_start,
> + prealloc_end);
> + num_bytes = prealloc_end + 1 - prealloc_start;
> + ret = btrfs_prealloc_file_range(inode, 0, prealloc_start,
>   num_bytes, num_bytes,
> - end + 1, _hint);
> - unlock_extent(_I(inode)->io_tree, start, end);
> + prealloc_end + 1, _hint);
> + unlock_extent(_I(inode)->io_tree, prealloc_start,
> +   prealloc_end);

Changing names is unnecessary, we can pick up other names for 
btrfs_{check/free}_data_free_space().

Thanks,

-liubo

>   if (ret)
>   break;
>   nr++;
>   }
> - btrfs_free_reserved_data_space(inode, cluster->start,
> -cluster->end + 1 - cluster->start);
> + btrfs_free_reserved_data_space(inode, start, end + 1 - start);
>  out:
>   inode_unlock(inode);
>   return ret;
> -- 
> 2.9.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 14:23, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 12:04 PM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 13:19, Chris Murphy wrote:


On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov 
wrote:


3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.



Where does udev builtin use btrfs itself? I see "btrfs ready $device"
which is not a valid btrfs user space command.

I never get any errors from "btrfs device ready" even when too many
devices are missing. I don't know what it even does or if it's broken.

This is a three device raid1 where I removed 2 devices and "btrfs
device ready" does not complain, it always returns silent for me no
matter what. It's been this way for years as far as I know.

[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi-a-tz-- 50.00g thintastic2.55
  2  VG Vwi-a-tz-- 50.00g thintastic4.00
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1
devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3

[root@f24s ~]# btrfs device ready /dev/mapper/VG-1
[root@f24s ~]#
[root@f24s ~]# lvchange -an VG/1
[root@f24s ~]# lvchange -an VG/2
[root@f24s ~]# btrfs dev scan
Scanning for Btrfs filesystems
[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi---tz-- 50.00g thintastic
  2  VG Vwi---tz-- 50.00g thintastic
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
warning, device 2 is missing
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
*** Some devices missing

[root@f24s ~]# btrfs device ready /dev/mapper/VG-3
[root@f24s ~]#


You won't get any output from it regardless, you have to check the return
code as it's intended to be a tool for scripts and such.


How do I check the return code? When I use strace, no matter what I'm getting

+++ exited with 0 +++

I see both 'brfs device ready' and the udev btrfs builtin test are
calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using
user space tools to check but rather a btrfs ioctl. So clearly that
works or I wouldn't have stalled boots when all devices aren't
present.


In bash or most other POSIX compliant shells, you can run this:
echo $?
to get the return code of the previous command.

In your case though, it may be reporting the FS ready because it had 
already seen all the devices, IIUC, the flag that checks is only set 
once, and never unset, which is not a good design in this case.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 14:45, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 12:43, Chris Murphy wrote:



So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.


I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put 
preventative measures in place on any system where it applies.  I'd be 
willing to bet that most sysadmins at big companies like RedHat or 
Oracle are setting this.




2. People using single disks which have lots of errors may or may not see an
apparent degradation of performance, but will likely have the life
expectancy of their device extended.


Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my 
point was here.  For people in this particular case, it will almost 
certainly only make things better (although at first it may make 
performance worse).




3. Individuals who are not setting this but should be will on average be no
worse off than before other than seeing a bigger performance hit on a disk
error.
4. People with single disks which are new will see no functional change
until the disk has an error.


I follow.




In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
percentage over that (probably something like 25%, which would give roughly
10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the
expected disk timeout to account for things like latency outside of the
disk).


Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make 
judgements based on userspace usage.  Also, the first situation while 
not optimal is still better than what happens now, at least there you 
will get an I/O error in a reasonable amount of time (as opposed to 
after a really long time if ever).


Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.
I don't see why you would think this is not common.  If you count just 
by systems, then it's absolutely outnumbered at least 100 to 1 by 
regular ATA disks.  If you look at individual disks though, the reverse 
is true, because people who use SCSI drives tend to use _lots_ of disks 
(think big data centers, NAS and SAN systems and such).  OTOH, both are 
probably vastly outnumbered by stuff that doesn't use either standard 
for storage...


Separately, USB gets _really_ complicated if you want to cover 
everything, USB drives may or may not present as non-rotational, may or 
may not show 

Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 12:24 PM, Andrei Borzenkov  wrote:
> On Wed, Jul 6, 2016 at 8:19 PM, Chris Murphy  wrote:
>>
>> I'm mainly concerned with rootfs. And I'm mainly concerned with a very
>> simple 2 disk raid1. With a simple user opt in using
>> rootflags=degraded, it should be possible to boot the system. Right
>> now it's not possible. Maybe just deleting 64-btrfs.rules would fix
>> this problem, I haven't tried it.
>>
>
> While deleting this rule will fix your specific degraded 2 disk raid 1
> it will break non-degraded multi-device filesystem. Logic currently
> implemented by systemd assumes that mount is called after
> prerequisites have been fulfilled. Deleting this rule will call mount
> as soon as the very first device is seen; such filesystem is obviously
> not mountable.

Seems like we need more granularity by btrfs ioctl for device ready,
e.g. some way to indicate:

0 all devices ready
1 devices not ready (don't even try to mount)
2 minimum devices ready (degraded mount possible)


Btrfs multiple device single and raid0 only return code 0 or 1. Where
raid 1, 5, 6 could return code 2. The systemd default policy for code
2 could be to wait some amount of time to see if state goes to 0. At
the timeout, try to mount anyway. If rootflags=degraded, it mounts. If
not, mount fails, and we get a dracut prompt.

That's better behavior than now.

> Equivalent of this rule is required under systemd and desired in
> general to avoid polling. On systemd list I outlined possible
> alternative implementation as systemd service instead of really
> hackish udev rule.

I'll go read it there. Thanks.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 12:43, Chris Murphy wrote:

>> So does it make sense to just set the default to 180? Or is there a
>> smarter way to do this? I don't know.
>
> Just thinking about this:
> 1. People who are setting this somewhere will be functionally unaffected.

I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.


> 2. People using single disks which have lots of errors may or may not see an
> apparent degradation of performance, but will likely have the life
> expectancy of their device extended.

Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.


> 3. Individuals who are not setting this but should be will on average be no
> worse off than before other than seeing a bigger performance hit on a disk
> error.
> 4. People with single disks which are new will see no functional change
> until the disk has an error.

I follow.


>
> In an ideal situation, what I'd want to see is:
> 1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
> percentage over that (probably something like 25%, which would give roughly
> 10 seconds for the normal 7 second ERC timer).
> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
> this is reasonable for SCSI disks).
> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
> expected disk timeout to account for things like latency outside of the
> disk).

Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.

Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.




>
>>
>>
 I suspect, but haven't tested, that ZFS On Linux would be equally
 affected, unless they're completely reimplementing their own block
 layer (?) So there are quite a few parties now negatively impacted by
 the current default behavior.
>>>
>>>
>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>> if
>>> your not using enterprise drives', not because of the project itself, but
>>> because it's ZFS.  Part of their minimum recommended hardware
>>> requirements
>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>> there too.
>>
>>
>> http://open-zfs.org/wiki/Hardware
>> "Consistent performance requires hard drives that support error
>> recovery control. "
>>
>> "Drives that lack such functionality can be expected to have
>> arbitrarily high limits. Several minutes is not impossible. Drives
>> with this functionality typically default to 7 seconds. ZFS does not
>> currently adjust this setting on drives. However, it is advisable to
>> write a script to set the error recovery time to a low value, such as
>> 0.1 seconds until ZFS is modified to control it. This must be done on
>> every boot. "
>>
>> They do not explicitly require enterprise drives, but they clearly
>> expect SCT ERC enabled to some sane value.
>>
>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>> 

Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Andrei Borzenkov
On Wed, Jul 6, 2016 at 9:23 PM, Chris Murphy  wrote:
>>> [root@f24s ~]# btrfs fi show
>>> warning, device 2 is missing
>>> Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
>>> Total devices 3 FS bytes used 2.26GiB
>>> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
>>> *** Some devices missing
>>>
>>> [root@f24s ~]# btrfs device ready /dev/mapper/VG-3
>>> [root@f24s ~]#
>>
>> You won't get any output from it regardless, you have to check the return
>> code as it's intended to be a tool for scripts and such.
>
> How do I check the return code? When I use strace, no matter what I'm getting
>
> +++ exited with 0 +++
>
> I see both 'brfs device ready' and the udev btrfs builtin test are
> calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using
> user space tools to check but rather a btrfs ioctl.

Correct. It is possible that ioctl returns correct result only the
very first time; notice that in your example btrfs had seen all other
devices at least once while at boot it is really the case of other
devices missing so far.

Which returns us to the question - how we can reliably query kernel
about mountability of filesystem.

> So clearly that
> works or I wouldn't have stalled boots when all devices aren't
> present.
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Andrei Borzenkov
On Wed, Jul 6, 2016 at 8:19 PM, Chris Murphy  wrote:
>
> I'm mainly concerned with rootfs. And I'm mainly concerned with a very
> simple 2 disk raid1. With a simple user opt in using
> rootflags=degraded, it should be possible to boot the system. Right
> now it's not possible. Maybe just deleting 64-btrfs.rules would fix
> this problem, I haven't tried it.
>

While deleting this rule will fix your specific degraded 2 disk raid 1
it will break non-degraded multi-device filesystem. Logic currently
implemented by systemd assumes that mount is called after
prerequisites have been fulfilled. Deleting this rule will call mount
as soon as the very first device is seen; such filesystem is obviously
not mountable.

Equivalent of this rule is required under systemd and desired in
general to avoid polling. On systemd list I outlined possible
alternative implementation as systemd service instead of really
hackish udev rule.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 12:04 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 13:19, Chris Murphy wrote:
>>
>> On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov 
>> wrote:
>>>
>>> 3) can we query btrfs whether it is mountable in degraded mode?
>>> according to documentation, "btrfs device ready" (which udev builtin
>>> follows) checks "if it has ALL of it’s devices in cache for mounting".
>>> This is required for proper systemd ordering of services.
>>
>>
>> Where does udev builtin use btrfs itself? I see "btrfs ready $device"
>> which is not a valid btrfs user space command.
>>
>> I never get any errors from "btrfs device ready" even when too many
>> devices are missing. I don't know what it even does or if it's broken.
>>
>> This is a three device raid1 where I removed 2 devices and "btrfs
>> device ready" does not complain, it always returns silent for me no
>> matter what. It's been this way for years as far as I know.
>>
>> [root@f24s ~]# lvs
>>   LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
>> Log Cpy%Sync Convert
>>   1  VG Vwi-a-tz-- 50.00g thintastic2.55
>>   2  VG Vwi-a-tz-- 50.00g thintastic4.00
>>   3  VG Vwi-a-tz-- 50.00g thintastic2.54
>>   thintastic VG twi-aotz-- 90.00g   5.05   2.92
>> [root@f24s ~]# btrfs fi show
>> Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
>> Total devices 3 FS bytes used 2.26GiB
>> devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1
>> devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2
>> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
>>
>> [root@f24s ~]# btrfs device ready /dev/mapper/VG-1
>> [root@f24s ~]#
>> [root@f24s ~]# lvchange -an VG/1
>> [root@f24s ~]# lvchange -an VG/2
>> [root@f24s ~]# btrfs dev scan
>> Scanning for Btrfs filesystems
>> [root@f24s ~]# lvs
>>   LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
>> Log Cpy%Sync Convert
>>   1  VG Vwi---tz-- 50.00g thintastic
>>   2  VG Vwi---tz-- 50.00g thintastic
>>   3  VG Vwi-a-tz-- 50.00g thintastic2.54
>>   thintastic VG twi-aotz-- 90.00g   5.05   2.92
>> [root@f24s ~]# btrfs fi show
>> warning, device 2 is missing
>> Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
>> Total devices 3 FS bytes used 2.26GiB
>> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
>> *** Some devices missing
>>
>> [root@f24s ~]# btrfs device ready /dev/mapper/VG-3
>> [root@f24s ~]#
>
> You won't get any output from it regardless, you have to check the return
> code as it's intended to be a tool for scripts and such.

How do I check the return code? When I use strace, no matter what I'm getting

+++ exited with 0 +++

I see both 'brfs device ready' and the udev btrfs builtin test are
calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using
user space tools to check but rather a btrfs ioctl. So clearly that
works or I wouldn't have stalled boots when all devices aren't
present.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount degraded RAID5

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 11:12 AM, Gonzalo Gomez-Arrue Azpiazu
 wrote:
> Hello,
>
> I had a RAID5 with 3 disks and one failed; now the filesystem cannot be 
> mounted.
>
> None of the recommendations that I found seem to work. The situation
> seems to be similar to this one:
> http://www.spinics.net/lists/linux-btrfs/msg56825.html
>
> Any suggestion on what to try next?

Basically if you are degraded *and* it runs into additional errors,
then it's broken because raid5 only protects against one device error.
The main problem is if it can't read the chunk root it's hard for any
tool to recover data because the chunk tree mapping is vital to
finding data.

What do you get for:
btrfs rescue super-recover -v /dev/sdc1

It's a problem with the chunk tree because all of your super blocks
point to the same chunk tree root so there isn't another one to try.

>sudo btrfs-find-root /dev/sdc1
>warning, device 2 is missing
>Couldn't read chunk root
>Open ctree failed

It's bad news. I'm not even sure 'btrfs restore' can help this case.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount degraded RAID5

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 11:50 AM, Tomáš Hrdina  wrote:
> sudo mount -o ro /dev/sdc /shares
> mount: wrong fs type, bad option, bad superblock on /dev/sdc,
>missing codepage or helper program, or other error
>
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
>
>
> sudo mount -o ro,recovery /dev/sdc /shares
> mount: wrong fs type, bad option, bad superblock on /dev/sdc,
>missing codepage or helper program, or other error
>
>In some cases useful info is found in syslog - try
>dmesg | tail or so.


[ 275.688919] BTRFS error (device sda): parent transid verify failed
on 7008533413888 wanted 70175 found 70132

Looks like the generation is too far back for backup roots.

Just for grins, now that all drives are present, what do you get for

# btrfs rescue super-recover -v /dev/sda

Next I suggest btrfs-image -c9 -t4 and optionally -s to sanitize file
names. And also btrfs-debug-tree (this time no -d) redirected to a
file. These two files can be big, about the size of the used amount of
metadata chunks. These go in the cloud at some point, reference them
in a bugzilla.kernel.org bug report by URL. Expect it to be months
before a dev looks at it.

So now what you want to try to do is use restore.
https://btrfs.wiki.kernel.org/index.php/Restore

You can use the information from btrfs-find-root to give restore a -t
value to try. For example:

>Found tree root at 6062830010368 gen 70182 level 1
>Well block 6062434418688(gen: 70181 level: 1) seems good, but
>generation/level doesn't match, want gen: 70182 level: 1
>Well block 6062497202176(gen: 69186 level: 0) seems good, but
>generation/level doesn't match, want gen: 70182 level: 1
>Well block 6062470332416(gen: 69186 level: 0) seems good, but
>generation/level doesn't match, want gen: 70182 level: 1


btrfs restore -t 6062830010368 -v -i /dev/sda 

If that fails totally you can try the next bytenr, for the -t value,
6062434418688. And then the next. Each value down is going backward in
time, so it implies some data loss.

This is not the end. It's just that it's the safest since no changes
to the fs have happened. If you set up some kind of overlay you can be
more aggressive like going right for btrfs check --repair and seeing
if it can fix things, but without the overlay it's possible to totally
break the fs such that even restore won't work.

Once you pretty much have everything important off the volume, you can
get more aggressive with trying to fix it. OR just blow it away and
start over. But I think it's valid to gather as much information about
the file system and try to fix it because the autopsy is the main way
to make Btrfs better.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 13:19, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov  wrote:

3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.


Where does udev builtin use btrfs itself? I see "btrfs ready $device"
which is not a valid btrfs user space command.

I never get any errors from "btrfs device ready" even when too many
devices are missing. I don't know what it even does or if it's broken.

This is a three device raid1 where I removed 2 devices and "btrfs
device ready" does not complain, it always returns silent for me no
matter what. It's been this way for years as far as I know.

[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi-a-tz-- 50.00g thintastic2.55
  2  VG Vwi-a-tz-- 50.00g thintastic4.00
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1
devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3

[root@f24s ~]# btrfs device ready /dev/mapper/VG-1
[root@f24s ~]#
[root@f24s ~]# lvchange -an VG/1
[root@f24s ~]# lvchange -an VG/2
[root@f24s ~]# btrfs dev scan
Scanning for Btrfs filesystems
[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi---tz-- 50.00g thintastic
  2  VG Vwi---tz-- 50.00g thintastic
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
warning, device 2 is missing
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
*** Some devices missing

[root@f24s ~]# btrfs device ready /dev/mapper/VG-3
[root@f24s ~]#
You won't get any output from it regardless, you have to check the 
return code as it's intended to be a tool for scripts and such.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount degraded RAID5

2016-07-06 Thread Tomáš Hrdina
sudo mount -o ro /dev/sdc /shares
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.


sudo mount -o ro,recovery /dev/sdc /shares
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.


dmesg
http://sebsauvage.net/paste/?04d1162dc44d7e55#uY0kIaX66o7Kh+TZAGK2T+CKdRk2jorIWM3w5gfXp8I=

Do you want any other log to see?


For all 3 disks:
sudo smartctl -l scterc,70,70 /dev/sdx
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control set to:
   Read: 70 (7.0 seconds)
  Write: 70 (7.0 seconds)

Thank you
Tomas



 *From:* Chris Murphy
 *Sent:*  Wednesday, July 06, 2016 6:08PM
 *To:* Tomáš Hrdina
*Cc:* Chris Murphy, Btrfs Btrfs
 *Subject:* Re: Unable to mount degraded RAID5

On Wed, Jul 6, 2016 at 2:07 AM, Tomáš Hrdina  wrote:
> Now with 3 disks:
> 
> sudo btrfs check /dev/sda
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> checksum verify failed on 7008807157760 found F192848C wanted 1571393A
> checksum verify failed on 7008807157760 found F192848C wanted 1571393A
> bytenr mismatch, want=7008807157760, have=65536
> Checking filesystem on /dev/sda
> UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e
> checking extents
> parent transid verify failed on 7009468874752 wanted 70180 found 70133
> parent transid verify failed on 7009468874752 wanted 70180 found 70133
> checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
> checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
> bytenr mismatch, want=7009468874752, have=65536
> parent transid verify failed on 7008859045888 wanted 70175 found 70133
> parent transid verify failed on 7008859045888 wanted 70175 found 70133
> checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
> checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
> bytenr mismatch, want=7008859045888, have=65536
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> Ignoring transid failure
> leaf parent key incorrect 7008899547136
> bad block 7008899547136
> Errors found in extent allocation tree or chunk allocation
> parent transid verify failed on 7009074167808 wanted 70175 found 70133
> parent transid verify failed on 7009074167808 wanted 70175 found 70133
> checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
> checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
> bytenr mismatch, want=7009074167808, have=65536

Ok much better than before, these all seem sane with a limited number
of problems. Maybe --repair can fix it, but don't do that yet.




> sudo btrfs-debug-tree -d /dev/sdc
> http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA=

OK good, so now it finds the chunk tree OK. This is good news. I would
try to mount it ro first, if you need to make or refresh a backup. So
in order:

mount -o ro
mount -o ro,recovery

If those don't work lets see what the user and kernel errors are.



> 
>>
> sudo btrfs-find-root /dev/sdc
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> Superblock thinks the generation is 70182
> Superblock thinks the level is 1
> Found tree root at 6062830010368 gen 70182 level 1
> Well block 6062434418688(gen: 70181 level: 1) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1
> Well block 6062497202176(gen: 69186 level: 0) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1
> Well block 6062470332416(gen: 69186 level: 0) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1

This is also a good sign that you can probably get btrfs rescue to
work and point it to one of these older tree roots, if mount won't
work.


> 
>>
> sudo smartctl -l scterc /dev/sda
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> SCT Error Recovery Control:
>Read: Disabled
>   Write: Disabled
> 
>>
> sudo smartctl -l scterc /dev/sdb
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)

Re: Out of space error even though there's 100 GB unused?

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 3:55 AM, Stanislaw Kaminski
 wrote:

> Device unallocated:   97.89GiB

There should be no problem creating any type of block group from this
much space. It's a bug.

I would try regression testing. Kernel 4.5.7 has some changes that may
or may not relate to this (they should only relate when there is no
unallocated space left) so you could try 4.5.6 and 4.5.7. And also
4.4.14.

But also the kernel messages are important. There is this obscure
enospc with error -28, so either with or without enospc_debug mount
option is useful to try in 4.6.3 (I think it's less useful in older
kernels).

But do try nospace_cache first. If that works, you could then mount
with clear_cache one time and see if that provides an enduring fix. It
can take some time to rebuild the cache after clear_cache is used.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Andreas Dilger

> On Jul 6, 2016, at 10:33 AM, Joerg Schilling 
>  wrote:
> 
> "Austin S. Hemmelgarn"  wrote:
> 
>> On 2016-07-06 11:22, Joerg Schilling wrote:
>>> 
>>> 
>>> You are mistaken.
>>> 
>>> stat /proc/$$/as
>>>  File: `/proc/6518/as'
>>>  Size: 2793472 Blocks: 5456   IO Block: 512regular file
>>> Device: 544h/88342528d  Inode: 7557Links: 1
>>> Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> stat /proc/$$/auxv
>>>  File: `/proc/6518/auxv'
>>>  Size: 168 Blocks: 1  IO Block: 512regular file
>>> Device: 544h/88342528d  Inode: 7568Links: 1
>>> Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> Any correct implementation of /proc returns the expected numbers in
>>> st_size as well as in st_blocks.
>> 
>> Odd, because I get 0 for both values on all the files in /proc/self and
>> all the top level files on all kernels I tested prior to sending that
> 
> I tested this with an official PROCFS-2 implementation that was written by
> the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased
> away last weekend.
> 
> You may have done your tests on an inofficial procfs implementation

So, what you are saying is that you don't care about star working properly
on Linux, because it has an "inofficial" procfs implementation, while Solaris
has an "official" implementation?

>>> Now you know why BTRFS is still an incomplete filesystem. In a few years
>>> when it turns 10, this may change. People who implement filesystems of
>>> course need to learn that they need to hide implementation details from
>>> the official user space interfaces.
>> 
>> So in other words you think we should be lying about how much is
>> actually allocated on disk and thus violating the standard directly (and
>> yes, ext4 and everyone else who does this with delayed allocation _is_
>> strictly speaking violating the standard, because _nothing_ is allocated
>> yet)?
> 
> If it returns 0, it would be lying or it would be wrong anyway as it did not
> check fpe available space.
> 
> Also note that I mentioned already that the priciple availability of SEEK_HOLE
> does not help as there is e.g. NFS...

So, it's OK that NFS is not POSIX compliant in various ways, and star will
deal with it, but you aren't willing to fix a heuristic used by star for a
behaviour that is unspecified by POSIX but has caused users to lose data
when archiving from several modern filesystems?

That's fine, so long as GNU tar is fixed to use the safe fallback in such
cases (i.e. trying to archive data from files that are newly created, even
if they report st_blocks == 0).

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov  wrote:
> On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy  wrote:
>> I started a systemd-devel@ thread since that's where most udev stuff
>> gets talked about.
>>
>> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html
>>
>
> Before discussing how to implement it in systemd, we need to decide
> what to implement. I.e.

Fair.


> 1) do you always want to mount filesystem in degraded mode if not
> enough devices are present or only if explicit hint is given?

Right now on Btrfs, it should be explicit. The faulty device concept,
handling, and notification is not mature. It's not a good idea to
silently mount degraded considering Btrfs does not actively catch up
the devices that are behind the next time there's a normal mount. It
only fixes things passively. So the user must opt into degraded mounts
rather than opt out.

The problem is the current udev rule is doing its own check for device
availability. So the mount command with explicit hint doesn't even get
attempted.



> 2) do you want to restrict degrade handling to root only or to other
> filesystems as well? Note that there could be more early boot
> filesystems that absolutely need same treatment (enters separate
> /usr), and there are also normal filesystems that may need be mounted
> even degraded.

I'm mainly concerned with rootfs. And I'm mainly concerned with a very
simple 2 disk raid1. With a simple user opt in using
rootflags=degraded, it should be possible to boot the system. Right
now it's not possible. Maybe just deleting 64-btrfs.rules would fix
this problem, I haven't tried it.


> 3) can we query btrfs whether it is mountable in degraded mode?
> according to documentation, "btrfs device ready" (which udev builtin
> follows) checks "if it has ALL of it’s devices in cache for mounting".
> This is required for proper systemd ordering of services.

Where does udev builtin use btrfs itself? I see "btrfs ready $device"
which is not a valid btrfs user space command.

I never get any errors from "btrfs device ready" even when too many
devices are missing. I don't know what it even does or if it's broken.

This is a three device raid1 where I removed 2 devices and "btrfs
device ready" does not complain, it always returns silent for me no
matter what. It's been this way for years as far as I know.

[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi-a-tz-- 50.00g thintastic2.55
  2  VG Vwi-a-tz-- 50.00g thintastic4.00
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1
devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3

[root@f24s ~]# btrfs device ready /dev/mapper/VG-1
[root@f24s ~]#
[root@f24s ~]# lvchange -an VG/1
[root@f24s ~]# lvchange -an VG/2
[root@f24s ~]# btrfs dev scan
Scanning for Btrfs filesystems
[root@f24s ~]# lvs
  LV VG Attr   LSize  Pool   Origin Data%  Meta%  Move
Log Cpy%Sync Convert
  1  VG Vwi---tz-- 50.00g thintastic
  2  VG Vwi---tz-- 50.00g thintastic
  3  VG Vwi-a-tz-- 50.00g thintastic2.54
  thintastic VG twi-aotz-- 90.00g   5.05   2.92
[root@f24s ~]# btrfs fi show
warning, device 2 is missing
Label: none  uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7
Total devices 3 FS bytes used 2.26GiB
devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3
*** Some devices missing

[root@f24s ~]# btrfs device ready /dev/mapper/VG-3
[root@f24s ~]#




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 12:43, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-05 19:05, Chris Murphy wrote:


Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.


The thing I don't get about this is that while the per-device settings on a
given system are policy, the default value is not, and should be expected to
work correctly (but not necessarily optimally) on as many systems as
possible, so any claim that this should be fixed in udev are bogus by the
regular kernel rules.


Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.

Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
2. People using single disks which have lots of errors may or may not 
see an apparent degradation of performance, but will likely have the 
life expectancy of their device extended.
3. Individuals who are not setting this but should be will on average be 
no worse off than before other than seeing a bigger performance hit on a 
disk error.
4. People with single disks which are new will see no functional change 
until the disk has an error.


In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable 
percentage over that (probably something like 25%, which would give 
roughly 10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer 
(IIRC< this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the 
expected disk timeout to account for things like latency outside of the 
disk).




I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


OTOH, I would not be surprised if the stance there is 'you get no support if
your not using enterprise drives', not because of the project itself, but
because it's ZFS.  Part of their minimum recommended hardware requirements
is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
there too.


http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the 
drive for SCT ERC support.  The problem with doing this is that you 
can't be certain that your underlying device is actually a physical 
storage device or not, and thus you have to check more than just the SCT 
ERC commands, and many people (myself included) don't like tools doing 
things that modify the persistent functioning of their system that the 
tool itself is not intended to do (and messing with block layer settings 
falls into that category for a mkfs tool).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Fwd: Unable to mount degraded RAID5

2016-07-06 Thread Gonzalo Gomez-Arrue Azpiazu
Hello,

I had a RAID5 with 3 disks and one failed; now the filesystem cannot be mounted.

None of the recommendations that I found seem to work. The situation
seems to be similar to this one:
http://www.spinics.net/lists/linux-btrfs/msg56825.html

Any suggestion on what to try next?

Thanks a lot beforehand!

sudo btrfs version
btrfs-progs v4.4

uname -a
Linux ubuntu 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

sudo btrfs fi show
warning, device 2 is missing
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
bytenr mismatch, want=2339175972864, have=65536
Couldn't read chunk root
Label: none  uuid: 495efbc6-2f62-4cd7-962b-7ae3d0e929f1
Total devices 3 FS bytes used 1.29TiB
devid1 size 2.73TiB used 674.03GiB path /dev/sdc1
devid3 size 2.73TiB used 674.03GiB path /dev/sdd1
*** Some devices missing

sudo mount -t btrfs -o ro,degraded,recovery /dev/sdc1 /btrfs
mount: wrong fs type, bad option, bad superblock on /dev/sdc1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

dmesg | tail
[ 2440.036368] BTRFS info (device sdd1): allowing degraded mounts
[ 2440.036383] BTRFS info (device sdd1): enabling auto recovery
[ 2440.036390] BTRFS info (device sdd1): disk space caching is enabled
[ 2440.037928] BTRFS warning (device sdd1): devid 2 uuid
0c7d7db2-6a27-4b19-937b-b6266ba81257 is missing
[ 2440.652085] BTRFS info (device sdd1): bdev (null) errs: wr 1413, rd
362, flush 471, corrupt 0, gen 0
[ 2441.359066] BTRFS error (device sdd1): bad tree block start 0 833766391808
[ 2441.359306] BTRFS error (device sdd1): bad tree block start 0 833766391808
[ 2441.359330] BTRFS: Failed to read block groups: -5
[ 2441.383793] BTRFS: open_ctree failed

sudo btrfs restore /dev/sdc1 /bkp
warning, device 2 is missing
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
bytenr mismatch, want=2339175972864, have=65536
Couldn't read chunk root
Could not open root, trying backup super
warning, device 2 is missing
warning, device 3 is missing
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
bytenr mismatch, want=2339175972864, have=65536
Couldn't read chunk root
Could not open root, trying backup super
warning, device 2 is missing
warning, device 3 is missing
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074
bytenr mismatch, want=2339175972864, have=65536
Couldn't read chunk root
Could not open root, trying backup super

sudo btrfs-show-super -fa /dev/sdc1
http://sebsauvage.net/paste/?d79e9e9c385cf1a5#fNwoEj5o2aQ6T7nDl4vjrFqEJG0SHeVpmGknbbCVnd0=

sudo btrfs-find-root /dev/sdc1
warning, device 2 is missing
Couldn't read chunk root
Open ctree failed
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-05 19:05, Chris Murphy wrote:
>>
>> Related:
>> http://www.spinics.net/lists/raid/msg52880.html
>>
>> Looks like there is some traction to figuring out what to do about
>> this, whether it's a udev rule or something that happens in the kernel
>> itself. Pretty much the only hardware setup unaffected by this are
>> those with enterprise or NAS drives. Every configuration of a consumer
>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
>> RAID Levels are adversely affected by this.
>
> The thing I don't get about this is that while the per-device settings on a
> given system are policy, the default value is not, and should be expected to
> work correctly (but not necessarily optimally) on as many systems as
> possible, so any claim that this should be fixed in udev are bogus by the
> regular kernel rules.

Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


>> I suspect, but haven't tested, that ZFS On Linux would be equally
>> affected, unless they're completely reimplementing their own block
>> layer (?) So there are quite a few parties now negatively impacted by
>> the current default behavior.
>
> OTOH, I would not be surprised if the stance there is 'you get no support if
> your not using enterprise drives', not because of the project itself, but
> because it's ZFS.  Part of their minimum recommended hardware requirements
> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
> there too.

http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Joerg Schilling
"Austin S. Hemmelgarn"  wrote:

> On 2016-07-06 11:22, Joerg Schilling wrote:
> > "Austin S. Hemmelgarn"  wrote:
> >
> >>> It should be obvious that a file that offers content also has allocated 
> >>> blocks.
> >> What you mean then is that POSIX _implies_ that this is the case, but
> >> does not say whether or not it is required.  There are all kinds of
> >> counterexamples to this too, procfs is a POSIX compliant filesystem
> >> (every POSIX certified system has it), yet does not display the behavior
> >> that you expect, every single file in /proc for example reports 0 for
> >> both st_blocks and st_size, and yet all of them very obviously have 
> >> content.
> >
> > You are mistaken.
> >
> > stat /proc/$$/as
> >   File: `/proc/6518/as'
> >   Size: 2793472 Blocks: 5456   IO Block: 512regular file
> > Device: 544h/88342528d  Inode: 7557Links: 1
> > Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > stat /proc/$$/auxv
> >   File: `/proc/6518/auxv'
> >   Size: 168 Blocks: 1  IO Block: 512regular file
> > Device: 544h/88342528d  Inode: 7568Links: 1
> > Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > Any correct implementation of /proc returns the expected numbers in st_size 
> > as
> > well as in st_blocks.
> Odd, because I get 0 for both values on all the files in /proc/self and 
> all the top level files on all kernels I tested prior to sending that 

I tested this with an official PROCFS-2 implementation that was written by 
the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased 
away last weekend.

You may have done your tests on an inofficial procfs implementation

> > Now you know why BTRFS is still an incomplete filesystem. In a few years 
> > when
> > it turns 10, this may change. People who implement filesystems of course 
> > need
> > to learn that they need to hide implementation details from the official 
> > user
> > space interfaces.
> So in other words you think we should be lying about how much is 
> actually allocated on disk and thus violating the standard directly (and 
> yes, ext4 and everyone else who does this with delayed allocation _is_ 
> strictly speaking violating the standard, because _nothing_ is allocated 
> yet)?

If it returns 0, it would be lying or it would be wrong anyway as it did not 
check fpe available space.

Also note that I mentioned already that the priciple availability of SEEK_HOLE 
does not help as there is e.g. NFS...

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ 
http://sourceforge.net/projects/schilytools/files/'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Hugo Mills
On Wed, Jul 06, 2016 at 05:42:33PM +0200, Holger Hoffstätte wrote:
> On 07/06/16 17:20, Hugo Mills wrote:
> > On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote:
> >> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte
> >>  wrote:
> >>> On 07/06/16 14:25, Wang Shilong wrote:
>  'btrfs file du' is a very useful tool to watch my system
>  file usage with snapshot aware.
> 
>  when trying to run following commands:
>  [root@localhost btrfs-progs]# btrfs file du /
>   Total   Exclusive  Set shared  Filename
>  ERROR: Failed to lookup root id - Inappropriate ioctl for device
>  ERROR: cannot check space of '/': Unknown error -1
> 
>  and My Filesystem looks like this:
>  [root@localhost btrfs-progs]# df -Th
>  Filesystem Type  Size  Used Avail Use% Mounted on
>  devtmpfs   devtmpfs   16G 0   16G   0% /dev
>  tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
>  tmpfs  tmpfs  16G  1.4M   16G   1% /run
>  tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
>  /dev/sda3  btrfs  60G   19G   40G  33% /
>  tmpfs  tmpfs  16G  332K   16G   1% /tmp
>  /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
>  /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
>  /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
>  tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
> 
>  So I installed Btrfs as my root partition, but boot partition
>  can be other fs.
> 
>  We can Let btrfs tool aware of this is not a btrfs file or
>  directory and skip those files, so that someone like me
>  could just run 'btrfs file du /' to scan all btrfs filesystems.
> 
>  After patch, it will look like:
> Total   Exclusive  Set shared  Filename
>  skipping not btrfs dir/file: boot
>  skipping not btrfs dir/file: dev
>  skipping not btrfs dir/file: proc
>  skipping not btrfs dir/file: run
>  skipping not btrfs dir/file: sys
>   0.00B   0.00B   -  //root/.bash_logout
>   0.00B   0.00B   -  //root/.bash_profile
>   0.00B   0.00B   -  //root/.bashrc
>   0.00B   0.00B   -  //root/.cshrc
>   0.00B   0.00B   -  //root/.tcshrc
> 
>  This works for me to analysis system usage and analysis
>  performaces.
> >>>
> >>> This is great, but can we please skip the "skipping .." messages?
> >>> Maybe it's just me but I really don't see the value of printing them
> >>> when they don't contribute to the result.
> >>> They also mess up the display. :)
> >>
> >> I don't have a taste whether it needed or not, because it is somehow
> >> useful to let users know some files/directories skipped
> 
> When you run "find /path -type d" you don't get messages for all the
> things you just didn't want to find either.

   No, but you do get messages about unreadable directories from find.

   Your example above would be "You asked for X and  isn't an
X". That's not what these messages are about -- what we're seeing here
is "I tried to do what you asked to , but couldn't".

   Hugo.

> >At the absolute minimum, I think that these messages should go to
> > stderr (like du does when it deosn't have permissions), and should go
> > away with -q. They're still irritating, but at least you can get rid
> > of them easily.
> 
> If anything this should require a --verbose, not the other way
> around. Maybe instead of breaking the output just indicate the
> special status via "-- --" values, or default to 0.00?
> Still, we're explicitly only interested in btrfs stuff and not
> anything else, so printing non-information can only yield noise.
> 
> This is very much orthogonal to not printing anything after an
> otherwise successful command execution.
> 
> -h
> 
> 




-- 
Hugo Mills | "There's a Martian war machine outside -- they want
hugo@... carfax.org.uk | to talk to you about a cure for the common cold."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Stephen Franklin, Babylon 5


signature.asc
Description: Digital signature


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 12:05, Austin S. Hemmelgarn wrote:

On 2016-07-06 11:22, Joerg Schilling wrote:

"Austin S. Hemmelgarn"  wrote:


It should be obvious that a file that offers content also has
allocated blocks.

What you mean then is that POSIX _implies_ that this is the case, but
does not say whether or not it is required.  There are all kinds of
counterexamples to this too, procfs is a POSIX compliant filesystem
(every POSIX certified system has it), yet does not display the behavior
that you expect, every single file in /proc for example reports 0 for
both st_blocks and st_size, and yet all of them very obviously have
content.


You are mistaken.

stat /proc/$$/as
  File: `/proc/6518/as'
  Size: 2793472 Blocks: 5456   IO Block: 512regular file
Device: 544h/88342528d  Inode: 7557Links: 1
Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

stat /proc/$$/auxv
  File: `/proc/6518/auxv'
  Size: 168 Blocks: 1  IO Block: 512regular file
Device: 544h/88342528d  Inode: 7568Links: 1
Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

Any correct implementation of /proc returns the expected numbers in
st_size as
well as in st_blocks.

Odd, because I get 0 for both values on all the files in /proc/self and
all the top level files on all kernels I tested prior to sending that
e-mail, for reference, they include:
* A direct clone of HEAD on torvalds/linux
* 4.6.3 mainline
* 4.1.27 mainline
* 4.6.3 mainline with a small number of local patches on top
* 4.1.19+ from the Raspberry Pi foundation
* 4.4.6-gentoo (mainline with Gentoo patches on top)
* 4.5.5-linode69 (not certain about the patches on top)

Further ones I've now tested that behave like the others listed above:
* 2.4.20-8 from RedHat 9
* 2.6.18-1.2798.fc6 from Fedora Core 6
* 3.11.10-301.fc20 from Fedora 20

IOW, it looks like whatever you're running is an exception here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount degraded RAID5

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 2:07 AM, Tomáš Hrdina  wrote:
> Now with 3 disks:
>
> sudo btrfs check /dev/sda
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> checksum verify failed on 7008807157760 found F192848C wanted 1571393A
> checksum verify failed on 7008807157760 found F192848C wanted 1571393A
> bytenr mismatch, want=7008807157760, have=65536
> Checking filesystem on /dev/sda
> UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e
> checking extents
> parent transid verify failed on 7009468874752 wanted 70180 found 70133
> parent transid verify failed on 7009468874752 wanted 70180 found 70133
> checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
> checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
> bytenr mismatch, want=7009468874752, have=65536
> parent transid verify failed on 7008859045888 wanted 70175 found 70133
> parent transid verify failed on 7008859045888 wanted 70175 found 70133
> checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
> checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
> bytenr mismatch, want=7008859045888, have=65536
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3
> parent transid verify failed on 7008899547136 wanted 70175 found 70133
> Ignoring transid failure
> leaf parent key incorrect 7008899547136
> bad block 7008899547136
> Errors found in extent allocation tree or chunk allocation
> parent transid verify failed on 7009074167808 wanted 70175 found 70133
> parent transid verify failed on 7009074167808 wanted 70175 found 70133
> checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
> checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
> bytenr mismatch, want=7009074167808, have=65536

Ok much better than before, these all seem sane with a limited number
of problems. Maybe --repair can fix it, but don't do that yet.




> sudo btrfs-debug-tree -d /dev/sdc
> http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA=

OK good, so now it finds the chunk tree OK. This is good news. I would
try to mount it ro first, if you need to make or refresh a backup. So
in order:

mount -o ro
mount -o ro,recovery

If those don't work lets see what the user and kernel errors are.



>
>
> sudo btrfs-find-root /dev/sdc
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> parent transid verify failed on 7008807157760 wanted 70175 found 70133
> Superblock thinks the generation is 70182
> Superblock thinks the level is 1
> Found tree root at 6062830010368 gen 70182 level 1
> Well block 6062434418688(gen: 70181 level: 1) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1
> Well block 6062497202176(gen: 69186 level: 0) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1
> Well block 6062470332416(gen: 69186 level: 0) seems good, but
> generation/level doesn't match, want gen: 70182 level: 1

This is also a good sign that you can probably get btrfs rescue to
work and point it to one of these older tree roots, if mount won't
work.


>
>
> sudo smartctl -l scterc /dev/sda
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control:
>Read: Disabled
>   Write: Disabled
>
>
> sudo smartctl -l scterc /dev/sdb
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control:
>Read: 70 (7.0 seconds)
>   Write: 70 (7.0 seconds)
>
>
> sudo smartctl -l scterc /dev/sdc
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control:
>Read: Disabled
>   Write: Disabled



There's good news and bad news. The good news is all the drives
support SCT ERC. The bad news is two of the drives have the wrong
setting for raid1+, including raid5. Issue:

smartctl -l scterc,70,70 /dev/sdX   #for each drive


This is not a persistent setting. The drive being powered off (maybe
even reset) will revert the setting to drive default. Some people use
a udev rule to set this during startup. I think it can also be done
with a systemd unit. You'd want to specify the drives by id, wwn if
available, so that it's always consistent across boots.

The point of this setting is to force the drive to give up on errors
quickly, allowing Btrfs in this case to be informed of the exact
problem (media error and what sector) 

Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 11:22, Joerg Schilling wrote:

"Austin S. Hemmelgarn"  wrote:


It should be obvious that a file that offers content also has allocated blocks.

What you mean then is that POSIX _implies_ that this is the case, but
does not say whether or not it is required.  There are all kinds of
counterexamples to this too, procfs is a POSIX compliant filesystem
(every POSIX certified system has it), yet does not display the behavior
that you expect, every single file in /proc for example reports 0 for
both st_blocks and st_size, and yet all of them very obviously have content.


You are mistaken.

stat /proc/$$/as
  File: `/proc/6518/as'
  Size: 2793472 Blocks: 5456   IO Block: 512regular file
Device: 544h/88342528d  Inode: 7557Links: 1
Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

stat /proc/$$/auxv
  File: `/proc/6518/auxv'
  Size: 168 Blocks: 1  IO Block: 512regular file
Device: 544h/88342528d  Inode: 7568Links: 1
Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

Any correct implementation of /proc returns the expected numbers in st_size as
well as in st_blocks.
Odd, because I get 0 for both values on all the files in /proc/self and 
all the top level files on all kernels I tested prior to sending that 
e-mail, for reference, they include:

* A direct clone of HEAD on torvalds/linux
* 4.6.3 mainline
* 4.1.27 mainline
* 4.6.3 mainline with a small number of local patches on top
* 4.1.19+ from the Raspberry Pi foundation
* 4.4.6-gentoo (mainline with Gentoo patches on top)
* 4.5.5-linode69 (not certain about the patches on top)
It's probably notable that I don't see /proc/$PID/as on any of these 
systems, which implies you're running some significantly different 
kernel version to begin with, and therefore it's not unreasonable to 
assume that what you see is because of some misguided patch that got 
added to allow tar to archive /proc.



In all seriousness though, this started out because stuff wasn't cached
to anywhere near the degree it is today, and there was no such thing as
delayed allocation.  When you said to write, the filesystem allocated
the blocks, regardless of when it actually wrote the data.  IOW, the
behavior that GNU tar is relying on is an implementation detail, not an
API.  Just like df, this breaks under modern designs, not because they
chose to break it, but because it wasn't designed for use with such
implementations.


This seems to be a strange interpretation if what a standard is.
Except what I'm talking about is the _interpretation_ of the standard, 
not the standard itself.  I said nothing about the standard, all it 
requires is that st_blocks be the number of 512 byte blocks allocated by 
the filesystem for the file.  There is nothing in there about it having 
to reflect the expected size of the allocated content on disk.  In fact, 
there's technically nothing in there about how to handle sparse files 
either.


To further explain what I'm trying to say, here's a rough description of 
what happens in SVR4 UFS (and other non-delayed allocation filesystems) 
when you issue a write:
1. The number of new blocks needed to fulfill the write request is 
calculated.
2. If this number is greater than 0, that many new blocks are allocated, 
and st_blocks for that file is functionally updated (I don't recall if 
it was dynamically calculated per call or not)
3. At some indeterminate point in the future, the decision is made to 
flush the cache.

4. The data is written to the appropriate place in the file.

By comparison, in a delayed allocation scenario, 3 happens before 1 and 
2.  1 and 2 obviously have to be strictly ordered WRT each other and 4, 
but based on the POSIX standard, 3 does not have to be strictly ordered 
with regards to any of them (although it is illogical to have it between 
1 and 2 or after 4).  Because it is not required by the standard to have 
3 be strictly ordered and the ordering isn't part of the API itself, 
where it happens in the sequence is an implementation detail.



A new filesystem cannot introduce new rules just because people believe it would
save time.

Saying the file has no blocks when there are no blocks allocated for it
is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a
way to pack file data into the inode if it was small enough.  In that
case, it woulod be perfectly reasonable to return 0 for st_blocks
because the inode table in UFS is a fixed pre-allocated structure, and


Given that inode size is 128, such a change would not break things as the
heuristics would not imply a sparse file here.
OK, so change the heuristic 

Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Holger Hoffstätte
On 07/06/16 17:20, Hugo Mills wrote:
> On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote:
>> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte
>>  wrote:
>>> On 07/06/16 14:25, Wang Shilong wrote:
 'btrfs file du' is a very useful tool to watch my system
 file usage with snapshot aware.

 when trying to run following commands:
 [root@localhost btrfs-progs]# btrfs file du /
  Total   Exclusive  Set shared  Filename
 ERROR: Failed to lookup root id - Inappropriate ioctl for device
 ERROR: cannot check space of '/': Unknown error -1

 and My Filesystem looks like this:
 [root@localhost btrfs-progs]# df -Th
 Filesystem Type  Size  Used Avail Use% Mounted on
 devtmpfs   devtmpfs   16G 0   16G   0% /dev
 tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
 tmpfs  tmpfs  16G  1.4M   16G   1% /run
 tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
 /dev/sda3  btrfs  60G   19G   40G  33% /
 tmpfs  tmpfs  16G  332K   16G   1% /tmp
 /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
 /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
 /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
 tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000

 So I installed Btrfs as my root partition, but boot partition
 can be other fs.

 We can Let btrfs tool aware of this is not a btrfs file or
 directory and skip those files, so that someone like me
 could just run 'btrfs file du /' to scan all btrfs filesystems.

 After patch, it will look like:
Total   Exclusive  Set shared  Filename
 skipping not btrfs dir/file: boot
 skipping not btrfs dir/file: dev
 skipping not btrfs dir/file: proc
 skipping not btrfs dir/file: run
 skipping not btrfs dir/file: sys
  0.00B   0.00B   -  //root/.bash_logout
  0.00B   0.00B   -  //root/.bash_profile
  0.00B   0.00B   -  //root/.bashrc
  0.00B   0.00B   -  //root/.cshrc
  0.00B   0.00B   -  //root/.tcshrc

 This works for me to analysis system usage and analysis
 performaces.
>>>
>>> This is great, but can we please skip the "skipping .." messages?
>>> Maybe it's just me but I really don't see the value of printing them
>>> when they don't contribute to the result.
>>> They also mess up the display. :)
>>
>> I don't have a taste whether it needed or not, because it is somehow
>> useful to let users know some files/directories skipped

When you run "find /path -type d" you don't get messages for all the
things you just didn't want to find either.

>At the absolute minimum, I think that these messages should go to
> stderr (like du does when it deosn't have permissions), and should go
> away with -q. They're still irritating, but at least you can get rid
> of them easily.

If anything this should require a --verbose, not the other way
around. Maybe instead of breaking the output just indicate the
special status via "-- --" values, or default to 0.00?
Still, we're explicitly only interested in btrfs stuff and not
anything else, so printing non-information can only yield noise.

This is very much orthogonal to not printing anything after an
otherwise successful command execution.

-h




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Wang Shilong
On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte
 wrote:
> On 07/06/16 14:25, Wang Shilong wrote:
>> 'btrfs file du' is a very useful tool to watch my system
>> file usage with snapshot aware.
>>
>> when trying to run following commands:
>> [root@localhost btrfs-progs]# btrfs file du /
>>  Total   Exclusive  Set shared  Filename
>> ERROR: Failed to lookup root id - Inappropriate ioctl for device
>> ERROR: cannot check space of '/': Unknown error -1
>>
>> and My Filesystem looks like this:
>> [root@localhost btrfs-progs]# df -Th
>> Filesystem Type  Size  Used Avail Use% Mounted on
>> devtmpfs   devtmpfs   16G 0   16G   0% /dev
>> tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
>> tmpfs  tmpfs  16G  1.4M   16G   1% /run
>> tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
>> /dev/sda3  btrfs  60G   19G   40G  33% /
>> tmpfs  tmpfs  16G  332K   16G   1% /tmp
>> /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
>> /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
>> /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
>> tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
>>
>> So I installed Btrfs as my root partition, but boot partition
>> can be other fs.
>>
>> We can Let btrfs tool aware of this is not a btrfs file or
>> directory and skip those files, so that someone like me
>> could just run 'btrfs file du /' to scan all btrfs filesystems.
>>
>> After patch, it will look like:
>>Total   Exclusive  Set shared  Filename
>> skipping not btrfs dir/file: boot
>> skipping not btrfs dir/file: dev
>> skipping not btrfs dir/file: proc
>> skipping not btrfs dir/file: run
>> skipping not btrfs dir/file: sys
>>  0.00B   0.00B   -  //root/.bash_logout
>>  0.00B   0.00B   -  //root/.bash_profile
>>  0.00B   0.00B   -  //root/.bashrc
>>  0.00B   0.00B   -  //root/.cshrc
>>  0.00B   0.00B   -  //root/.tcshrc
>>
>> This works for me to analysis system usage and analysis
>> performaces.
>
> This is great, but can we please skip the "skipping .." messages?
> Maybe it's just me but I really don't see the value of printing them
> when they don't contribute to the result.
> They also mess up the display. :)

I don't have a taste whether it needed or not, because it is somehow
useful to let users know some files/directories skipped

Wait some other guys opinion for this...

thanks,
Shilong

>
> thanks,
> Holger
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Joerg Schilling
"Austin S. Hemmelgarn"  wrote:

> > It should be obvious that a file that offers content also has allocated 
> > blocks.
> What you mean then is that POSIX _implies_ that this is the case, but 
> does not say whether or not it is required.  There are all kinds of 
> counterexamples to this too, procfs is a POSIX compliant filesystem 
> (every POSIX certified system has it), yet does not display the behavior 
> that you expect, every single file in /proc for example reports 0 for 
> both st_blocks and st_size, and yet all of them very obviously have content.

You are mistaken.

stat /proc/$$/as
  File: `/proc/6518/as'
  Size: 2793472 Blocks: 5456   IO Block: 512regular file
Device: 544h/88342528d  Inode: 7557Links: 1
Access: (0600/-rw---)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

stat /proc/$$/auxv
  File: `/proc/6518/auxv'
  Size: 168 Blocks: 1  IO Block: 512regular file
Device: 544h/88342528d  Inode: 7568Links: 1
Access: (0400/-r)  Uid: (   xx/   joerg)   Gid: (  xx/  bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

Any correct implementation of /proc returns the expected numbers in st_size as 
well as in st_blocks.

> In all seriousness though, this started out because stuff wasn't cached 
> to anywhere near the degree it is today, and there was no such thing as 
> delayed allocation.  When you said to write, the filesystem allocated 
> the blocks, regardless of when it actually wrote the data.  IOW, the 
> behavior that GNU tar is relying on is an implementation detail, not an 
> API.  Just like df, this breaks under modern designs, not because they 
> chose to break it, but because it wasn't designed for use with such 
> implementations.

This seems to be a strange interpretation if what a standard is.

> > A new filesystem cannot introduce new rules just because people believe it 
> > would
> > save time.
> Saying the file has no blocks when there are no blocks allocated for it 
> is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a 
> way to pack file data into the inode if it was small enough.  In that 
> case, it woulod be perfectly reasonable to return 0 for st_blocks 
> because the inode table in UFS is a fixed pre-allocated structure, and 

Given that inode size is 128, such a change would not break things as the 
heuristics would not imply a sparse file here.

> therefore nothing is allocated to the file itself except the inode.  The 
> same applies in the case of a file packed into it's own metadata block 
> on BTRFS, nothing is allocated to that file beyond the metadata block it 
> has to have to store the inode.  In the case of delayed allocation where 
> the file hasn't been flushed, there is nothing allocated, so st_blocks 
> based on a strict interpretation of it's description in POSIX _should_ 
> be 0, because nothing is allocated yet.

Now you know why BTRFS is still an incomplete filesystem. In a few years when 
it turns 10, this may change. People who implement filesystems of course need 
to learn that they need to hide implementation details from the official user 
space interfaces.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ 
http://sourceforge.net/projects/schilytools/files/'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Hugo Mills
On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote:
> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte
>  wrote:
> > On 07/06/16 14:25, Wang Shilong wrote:
> >> 'btrfs file du' is a very useful tool to watch my system
> >> file usage with snapshot aware.
> >>
> >> when trying to run following commands:
> >> [root@localhost btrfs-progs]# btrfs file du /
> >>  Total   Exclusive  Set shared  Filename
> >> ERROR: Failed to lookup root id - Inappropriate ioctl for device
> >> ERROR: cannot check space of '/': Unknown error -1
> >>
> >> and My Filesystem looks like this:
> >> [root@localhost btrfs-progs]# df -Th
> >> Filesystem Type  Size  Used Avail Use% Mounted on
> >> devtmpfs   devtmpfs   16G 0   16G   0% /dev
> >> tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
> >> tmpfs  tmpfs  16G  1.4M   16G   1% /run
> >> tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
> >> /dev/sda3  btrfs  60G   19G   40G  33% /
> >> tmpfs  tmpfs  16G  332K   16G   1% /tmp
> >> /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
> >> /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
> >> /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
> >> tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
> >>
> >> So I installed Btrfs as my root partition, but boot partition
> >> can be other fs.
> >>
> >> We can Let btrfs tool aware of this is not a btrfs file or
> >> directory and skip those files, so that someone like me
> >> could just run 'btrfs file du /' to scan all btrfs filesystems.
> >>
> >> After patch, it will look like:
> >>Total   Exclusive  Set shared  Filename
> >> skipping not btrfs dir/file: boot
> >> skipping not btrfs dir/file: dev
> >> skipping not btrfs dir/file: proc
> >> skipping not btrfs dir/file: run
> >> skipping not btrfs dir/file: sys
> >>  0.00B   0.00B   -  //root/.bash_logout
> >>  0.00B   0.00B   -  //root/.bash_profile
> >>  0.00B   0.00B   -  //root/.bashrc
> >>  0.00B   0.00B   -  //root/.cshrc
> >>  0.00B   0.00B   -  //root/.tcshrc
> >>
> >> This works for me to analysis system usage and analysis
> >> performaces.
> >
> > This is great, but can we please skip the "skipping .." messages?
> > Maybe it's just me but I really don't see the value of printing them
> > when they don't contribute to the result.
> > They also mess up the display. :)
> 
> I don't have a taste whether it needed or not, because it is somehow
> useful to let users know some files/directories skipped

   At the absolute minimum, I think that these messages should go to
stderr (like du does when it deosn't have permissions), and should go
away with -q. They're still irritating, but at least you can get rid
of them easily.

   Hugo.

> Wait some other guys opinion for this...
> 
> thanks,
> Shilong
> 
> >
> > thanks,
> > Holger
> >

-- 
Hugo Mills | "There's a Martian war machine outside -- they want
hugo@... carfax.org.uk | to talk to you about a cure for the common cold."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Stephen Franklin, Babylon 5


signature.asc
Description: Digital signature


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 10:53, Joerg Schilling wrote:

Antonio Diaz Diaz  wrote:


Joerg Schilling wrote:

POSIX requires st_blocks to be != 0 in case that the file contains data.


Please, could you provide a reference? I can't find such requirement at
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html


blkcnt_t st_blocks  Number of blocks allocated for this object.

It should be obvious that a file that offers content also has allocated blocks.
What you mean then is that POSIX _implies_ that this is the case, but 
does not say whether or not it is required.  There are all kinds of 
counterexamples to this too, procfs is a POSIX compliant filesystem 
(every POSIX certified system has it), yet does not display the behavior 
that you expect, every single file in /proc for example reports 0 for 
both st_blocks and st_size, and yet all of them very obviously have content.


Blocks are "allocated" when the OS decides whether the new data will fit on the
medium. The fact that some filesystems may have data in a cache but not yet on
the medium does not matter here. This is how UNIX worked since st_block has
been introduced nearly 40 years ago.
Tradition is the corpse of wisdom.  Backwards comparability is a problem 
just as much as a good thing.


In all seriousness though, this started out because stuff wasn't cached 
to anywhere near the degree it is today, and there was no such thing as 
delayed allocation.  When you said to write, the filesystem allocated 
the blocks, regardless of when it actually wrote the data.  IOW, the 
behavior that GNU tar is relying on is an implementation detail, not an 
API.  Just like df, this breaks under modern designs, not because they 
chose to break it, but because it wasn't designed for use with such 
implementations.


In the case of tar and similar things though, I'd argue that it's not 
sensible to special case files that are 'sparse', it should store any 
long enough run of zeroes as a sparse region, then provide an option to 
say to not make those files sparse when restored.


A new filesystem cannot introduce new rules just because people believe it would
save time.
Saying the file has no blocks when there are no blocks allocated for it 
is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a 
way to pack file data into the inode if it was small enough.  In that 
case, it woulod be perfectly reasonable to return 0 for st_blocks 
because the inode table in UFS is a fixed pre-allocated structure, and 
therefore nothing is allocated to the file itself except the inode.  The 
same applies in the case of a file packed into it's own metadata block 
on BTRFS, nothing is allocated to that file beyond the metadata block it 
has to have to store the inode.  In the case of delayed allocation where 
the file hasn't been flushed, there is nothing allocated, so st_blocks 
based on a strict interpretation of it's description in POSIX _should_ 
be 0, because nothing is allocated yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Paul Eggert

On 07/06/2016 05:09 PM, Joerg Schilling wrote:

you concur that a delayed assignment of the "correct" value for
st_blocks while the contend of the file does not change is not permitted.
I'm not sure I agree even with that. A file system may undergo garbage 
collection and compaction, for instance, in which a file's data do not 
change but its internal representation does.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Joerg Schilling
Paul Eggert  wrote:

> On 07/06/2016 04:53 PM, Joerg Schilling wrote:
> > Antonio Diaz Diaz  wrote:
> >
> >> >Joerg Schilling wrote:
> >>> > >POSIX requires st_blocks to be != 0 in case that the file contains 
> >>> > >data.
> >> >
> >> >Please, could you provide a reference? I can't find such requirement at
> >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
> > blkcnt_t st_blocks  Number of blocks allocated for this object.
>
> This doesn't require that st_blocks must be nonzero if the file contains 
> nonzero data, any more that it requires that st_blocks must be nonzero 
> if the file contains zero data. In either case, metadata outside the 
> scope of st_blocks might contain enough information for the file system 
> to represent all the file's data.

In other words, you concur that a delayed assignment of the "correct" value for 
st_blocks while the contend of the file does not change is not permitted.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ 
http://sourceforge.net/projects/schilytools/files/'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Paul Eggert

On 07/06/2016 04:53 PM, Joerg Schilling wrote:

Antonio Diaz Diaz  wrote:


>Joerg Schilling wrote:

> >POSIX requires st_blocks to be != 0 in case that the file contains data.

>
>Please, could you provide a reference? I can't find such requirement at
>http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html

blkcnt_t st_blocks  Number of blocks allocated for this object.


This doesn't require that st_blocks must be nonzero if the file contains 
nonzero data, any more that it requires that st_blocks must be nonzero 
if the file contains zero data. In either case, metadata outside the 
scope of st_blocks might contain enough information for the file system 
to represent all the file's data.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Joerg Schilling
Antonio Diaz Diaz  wrote:

> Joerg Schilling wrote:
> > POSIX requires st_blocks to be != 0 in case that the file contains data.
>
> Please, could you provide a reference? I can't find such requirement at 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html

blkcnt_t st_blocks  Number of blocks allocated for this object.

It should be obvious that a file that offers content also has allocated blocks.

Blocks are "allocated" when the OS decides whether the new data will fit on the 
medium. The fact that some filesystems may have data in a cache but not yet on 
the medium does not matter here. This is how UNIX worked since st_block has 
been introduced nearly 40 years ago. 

A new filesystem cannot introduce new rules just because people believe it 
would 
save time.



Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ 
http://sourceforge.net/projects/schilytools/files/'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Antonio Diaz Diaz

Joerg Schilling wrote:

POSIX requires st_blocks to be != 0 in case that the file contains data.


Please, could you provide a reference? I can't find such requirement at 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html



Thanks.
Antonio.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Holger Hoffstätte
On 07/06/16 14:25, Wang Shilong wrote:
> 'btrfs file du' is a very useful tool to watch my system
> file usage with snapshot aware.
> 
> when trying to run following commands:
> [root@localhost btrfs-progs]# btrfs file du /
>  Total   Exclusive  Set shared  Filename
> ERROR: Failed to lookup root id - Inappropriate ioctl for device
> ERROR: cannot check space of '/': Unknown error -1
> 
> and My Filesystem looks like this:
> [root@localhost btrfs-progs]# df -Th
> Filesystem Type  Size  Used Avail Use% Mounted on
> devtmpfs   devtmpfs   16G 0   16G   0% /dev
> tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
> tmpfs  tmpfs  16G  1.4M   16G   1% /run
> tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
> /dev/sda3  btrfs  60G   19G   40G  33% /
> tmpfs  tmpfs  16G  332K   16G   1% /tmp
> /dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
> /dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
> /dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
> tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000
> 
> So I installed Btrfs as my root partition, but boot partition
> can be other fs.
> 
> We can Let btrfs tool aware of this is not a btrfs file or
> directory and skip those files, so that someone like me
> could just run 'btrfs file du /' to scan all btrfs filesystems.
> 
> After patch, it will look like:
>Total   Exclusive  Set shared  Filename
> skipping not btrfs dir/file: boot
> skipping not btrfs dir/file: dev
> skipping not btrfs dir/file: proc
> skipping not btrfs dir/file: run
> skipping not btrfs dir/file: sys
>  0.00B   0.00B   -  //root/.bash_logout
>  0.00B   0.00B   -  //root/.bash_profile
>  0.00B   0.00B   -  //root/.bashrc
>  0.00B   0.00B   -  //root/.cshrc
>  0.00B   0.00B   -  //root/.tcshrc
> 
> This works for me to analysis system usage and analysis
> performaces.

This is great, but can we please skip the "skipping .." messages?
Maybe it's just me but I really don't see the value of printing them
when they don't contribute to the result.
They also mess up the display. :)

thanks,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: du: fix to skip not btrfs dir/file

2016-07-06 Thread Wang Shilong
'btrfs file du' is a very useful tool to watch my system
file usage with snapshot aware.

when trying to run following commands:
[root@localhost btrfs-progs]# btrfs file du /
 Total   Exclusive  Set shared  Filename
ERROR: Failed to lookup root id - Inappropriate ioctl for device
ERROR: cannot check space of '/': Unknown error -1

and My Filesystem looks like this:
[root@localhost btrfs-progs]# df -Th
Filesystem Type  Size  Used Avail Use% Mounted on
devtmpfs   devtmpfs   16G 0   16G   0% /dev
tmpfs  tmpfs  16G  368K   16G   1% /dev/shm
tmpfs  tmpfs  16G  1.4M   16G   1% /run
tmpfs  tmpfs  16G 0   16G   0% /sys/fs/cgroup
/dev/sda3  btrfs  60G   19G   40G  33% /
tmpfs  tmpfs  16G  332K   16G   1% /tmp
/dev/sdc   btrfs 2.8T  166G  1.7T   9% /data
/dev/sda2  xfs   2.0G  452M  1.6G  23% /boot
/dev/sda1  vfat  1.9G   11M  1.9G   1% /boot/efi
tmpfs  tmpfs 3.2G   24K  3.2G   1% /run/user/1000

So I installed Btrfs as my root partition, but boot partition
can be other fs.

We can Let btrfs tool aware of this is not a btrfs file or
directory and skip those files, so that someone like me
could just run 'btrfs file du /' to scan all btrfs filesystems.

After patch, it will look like:
   Total   Exclusive  Set shared  Filename
skipping not btrfs dir/file: boot
skipping not btrfs dir/file: dev
skipping not btrfs dir/file: proc
skipping not btrfs dir/file: run
skipping not btrfs dir/file: sys
 0.00B   0.00B   -  //root/.bash_logout
 0.00B   0.00B   -  //root/.bash_profile
 0.00B   0.00B   -  //root/.bashrc
 0.00B   0.00B   -  //root/.cshrc
 0.00B   0.00B   -  //root/.tcshrc

This works for me to analysis system usage and analysis
performaces.

Signed-off-by: Wang Shilong 
---
 cmds-fi-du.c   | 11 ++-
 cmds-inspect.c |  2 +-
 utils.c|  8 
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/cmds-fi-du.c b/cmds-fi-du.c
index 12855a5..bf0e62c 100644
--- a/cmds-fi-du.c
+++ b/cmds-fi-du.c
@@ -389,8 +389,17 @@ static int du_walk_dir(struct du_dir_ctxt *ctxt, struct 
rb_root *shared_extents)
  dirfd(dirstream),
  shared_extents, , ,
  0);
-   if (ret)
+   if (ret == -ENOTTY) {
+   fprintf(stdout,
+   "skipping not btrfs dir/file: 
%s\n",
+   entry->d_name);
+   continue;
+   } else if (ret) {
+   fprintf(stderr,
+   "failed to walk dir/file: %s 
:%s\n",
+   entry->d_name, strerror(-ret));
break;
+   }
 
ctxt->bytes_total += tot;
ctxt->bytes_shared += shr;
diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd..2ae44be 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -323,7 +323,7 @@ static int cmd_inspect_rootid(int argc, char **argv)
 
ret = lookup_ino_rootid(fd, );
if (ret) {
-   error("rootid failed with ret=%d", ret);
+   error("failed to lookup root id: %s", strerror(-ret));
goto out;
}
 
diff --git a/utils.c b/utils.c
index 578fdb0..f73b048 100644
--- a/utils.c
+++ b/utils.c
@@ -2815,6 +2815,8 @@ path:
if (fd < 0)
goto err;
ret = lookup_ino_rootid(fd, );
+   if (ret)
+   error("failed to lookup root id: %s", strerror(-ret));
close(fd);
if (ret < 0)
goto err;
@@ -3497,10 +3499,8 @@ int lookup_ino_rootid(int fd, u64 *rootid)
args.objectid = BTRFS_FIRST_FREE_OBJECTID;
 
ret = ioctl(fd, BTRFS_IOC_INO_LOOKUP, );
-   if (ret < 0) {
-   error("failed to lookup root id: %s", strerror(errno));
-   return ret;
-   }
+   if (ret < 0)
+   return -errno;
 
*rootid = args.treeid;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Tomasz Torcz
On Wed, Jul 06, 2016 at 02:55:37PM +0300, Andrei Borzenkov wrote:
> On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-07-06 05:51, Andrei Borzenkov wrote:
> >>
> >> On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy 
> >> wrote:
> >>>
> >>> I started a systemd-devel@ thread since that's where most udev stuff
> >>> gets talked about.
> >>>
> >>>
> >>> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html
> >>>
> >>
> >> Before discussing how to implement it in systemd, we need to decide
> >> what to implement. I.e.
> >>
> >> 1) do you always want to mount filesystem in degraded mode if not
> >> enough devices are present or only if explicit hint is given?
> >> 2) do you want to restrict degrade handling to root only or to other
> >> filesystems as well? Note that there could be more early boot
> >> filesystems that absolutely need same treatment (enters separate
> >> /usr), and there are also normal filesystems that may need be mounted
> >> even degraded.
> >> 3) can we query btrfs whether it is mountable in degraded mode?
> >> according to documentation, "btrfs device ready" (which udev builtin
> >> follows) checks "if it has ALL of it’s devices in cache for mounting".
> >> This is required for proper systemd ordering of services.
> >
> >
> > To be entirely honest, if it were me, I'd want systemd to fsck off.  If the
> > kernel mount(2) call succeeds, then the filesystem was ready enough to
> > mount, and if it doesn't, then it wasn't, end of story.
> 
> How should user space know when to try mount? What user space is
> supposed to do during boot if mount fails? Do you suggest
> 
> while true; do
>   mount /dev/foo && exit 0
> done
> 
> as part of startup sequence? And note that nowhere is systemd involved so far.

  Getting rid of such loops was the original motivation for the ioctl:
http://www.spinics.net/lists/linux-btrfs/msg17372.html

  Maybe the ioctl need extending? Instead of returning 1/0, it could
take flag saying ”return 1 as soon as degraded mount is possible”?
  
-- 
Tomasz Torcz Morality must always be based on practicality.
xmpp: zdzich...@chrome.pl-- Baron Vladimir Harkonnen

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 08:39, Andrei Borzenkov wrote:



Отправлено с iPhone


6 июля 2016 г., в 15:14, Austin S. Hemmelgarn  написал(а):


On 2016-07-06 07:55, Andrei Borzenkov wrote:
On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 05:51, Andrei Borzenkov wrote:


On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy 
wrote:


I started a systemd-devel@ thread since that's where most udev stuff
gets talked about.


https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html


Before discussing how to implement it in systemd, we need to decide
what to implement. I.e.

1) do you always want to mount filesystem in degraded mode if not
enough devices are present or only if explicit hint is given?
2) do you want to restrict degrade handling to root only or to other
filesystems as well? Note that there could be more early boot
filesystems that absolutely need same treatment (enters separate
/usr), and there are also normal filesystems that may need be mounted
even degraded.
3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.



To be entirely honest, if it were me, I'd want systemd to fsck off.  If the
kernel mount(2) call succeeds, then the filesystem was ready enough to
mount, and if it doesn't, then it wasn't, end of story.


How should user space know when to try mount? What user space is
supposed to do during boot if mount fails? Do you suggest

while true; do
 mount /dev/foo && exit 0
done

as part of startup sequence? And note that nowhere is systemd involved so far.

Nowhere there, except if you have a filesystem in fstab (or a mount unit, which 
I hate for other reasons that I will not go into right now), and you mount it 
and systemd thinks the device isn't ready, it unmounts it _immediately_.  In 
the case of boot, it's because of systemd thinking the device isn't ready that 
you can't mount degraded with a missing device.  In the case of the root 
filesystem at least, the initramfs is expected to handle this, and most of them 
do poll in some way, or have other methods of determining this.  I occasionally 
have issues with it with dracut without systemd, but that's due to a separate 
bug there involving the device mapper.



How this systemd bashing answers my question - how user space knows when it can 
call mount at startup?
You mentioned that systemd wasn't involved, which is patently false if 
it's being used as your init system, and I was admittedly mostly 
responding to that.


Now, to answer the primary question which I forgot to answer:
Userspace doesn't.  Systemd doesn't either but assumes it does and 
checks in a flawed way.  Dracut's polling loop assumes it does but 
sometimes fails in a different way.  There is no way other than calling 
mount right now to know for sure if the mount will succeed, and that 
actually applies to a certain degree to any filesystem (because any 
number of things that are outside of even the kernel's control might 
happen while trying to mount the device.






The whole concept
of trying to track in userspace something the kernel itself tracks and knows
a whole lot more about is absolutely stupid.


It need not be user space. If kernel notifies user space when
filesystem is mountable, problem solved. It could be udev event,
netlink, whatever. Until kernel does it, user space need to either
poll or somehow track it based on available events.

THis I agree could be done better, but it absolutely should not be in 
userspace, the notification needs to come from the kernel, but that leads to 
the problem of knowing whether or not the FS can mount degraded, or only ro, or 
any number of other situations.



It makes some sense when
dealing with LVM or MD, because that is potentially a security issue
(someone could inject a bogus device node that you then mount instead of
your desired target),


I do not understand it at all. MD and LVM has exactly the same problem
- they need to know when they can assemble MD/VG. I miss what it has
to do with security, sorry.

If you don't track whether or not the device is assembled, then someone could 
create an arbitrary device node with the same name and then get you to mount 
that, possibly causing all kinds of issues depending on any number of other 
factors.


Device node is created as soon as array is seen for the first time. If you 
imply someone may replace it, what prevents doing it at any arbitrary time in 
the future?
It's still possible, but it's not as easy because replacing it after 
it's mounted would require a remount to have any effect.  The most 
reliable time to do something like this is during boot before the mount. 
 LVM and/or MD may or may not replace the node properly when they start 
(I don't have enough 

Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Andrei Borzenkov


Отправлено с iPhone

> 6 июля 2016 г., в 15:14, Austin S. Hemmelgarn  
> написал(а):
> 
>> On 2016-07-06 07:55, Andrei Borzenkov wrote:
>> On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn
>>  wrote:
>>> On 2016-07-06 05:51, Andrei Borzenkov wrote:
 
 On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy 
 wrote:
> 
> I started a systemd-devel@ thread since that's where most udev stuff
> gets talked about.
> 
> 
> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html
 
 Before discussing how to implement it in systemd, we need to decide
 what to implement. I.e.
 
 1) do you always want to mount filesystem in degraded mode if not
 enough devices are present or only if explicit hint is given?
 2) do you want to restrict degrade handling to root only or to other
 filesystems as well? Note that there could be more early boot
 filesystems that absolutely need same treatment (enters separate
 /usr), and there are also normal filesystems that may need be mounted
 even degraded.
 3) can we query btrfs whether it is mountable in degraded mode?
 according to documentation, "btrfs device ready" (which udev builtin
 follows) checks "if it has ALL of it’s devices in cache for mounting".
 This is required for proper systemd ordering of services.
>>> 
>>> 
>>> To be entirely honest, if it were me, I'd want systemd to fsck off.  If the
>>> kernel mount(2) call succeeds, then the filesystem was ready enough to
>>> mount, and if it doesn't, then it wasn't, end of story.
>> 
>> How should user space know when to try mount? What user space is
>> supposed to do during boot if mount fails? Do you suggest
>> 
>> while true; do
>>  mount /dev/foo && exit 0
>> done
>> 
>> as part of startup sequence? And note that nowhere is systemd involved so 
>> far.
> Nowhere there, except if you have a filesystem in fstab (or a mount unit, 
> which I hate for other reasons that I will not go into right now), and you 
> mount it and systemd thinks the device isn't ready, it unmounts it 
> _immediately_.  In the case of boot, it's because of systemd thinking the 
> device isn't ready that you can't mount degraded with a missing device.  In 
> the case of the root filesystem at least, the initramfs is expected to handle 
> this, and most of them do poll in some way, or have other methods of 
> determining this.  I occasionally have issues with it with dracut without 
> systemd, but that's due to a separate bug there involving the device mapper.
> 

How this systemd bashing answers my question - how user space knows when it can 
call mount at startup?


>> 
>>> The whole concept
>>> of trying to track in userspace something the kernel itself tracks and knows
>>> a whole lot more about is absolutely stupid.
>> 
>> It need not be user space. If kernel notifies user space when
>> filesystem is mountable, problem solved. It could be udev event,
>> netlink, whatever. Until kernel does it, user space need to either
>> poll or somehow track it based on available events.
> THis I agree could be done better, but it absolutely should not be in 
> userspace, the notification needs to come from the kernel, but that leads to 
> the problem of knowing whether or not the FS can mount degraded, or only ro, 
> or any number of other situations.
>> 
>>> It makes some sense when
>>> dealing with LVM or MD, because that is potentially a security issue
>>> (someone could inject a bogus device node that you then mount instead of
>>> your desired target),
>> 
>> I do not understand it at all. MD and LVM has exactly the same problem
>> - they need to know when they can assemble MD/VG. I miss what it has
>> to do with security, sorry.
> If you don't track whether or not the device is assembled, then someone could 
> create an arbitrary device node with the same name and then get you to mount 
> that, possibly causing all kinds of issues depending on any number of other 
> factors.

Device node is created as soon as array is seen for the first time. If you 
imply someone may replace it, what prevents doing it at any arbitrary time in 
the future?

>> 
>>> but it makes no sense here, because there's no way to
>>> prevent the equivalent from happening in BTRFS.
>>> 
>>> As far as the udev rules, I'm pretty certain that _we_ ship those with
>>> btrfs-progs,
>> 
>> No, you do not. You ship rule to rename devices to be more
>> "user-friendly". But the rule in question has always been part of
>> udev.
> Ah, you're right, I was mistaken about this.
>> 
>>> I have no idea why they're packaged with udev in CentOS (oh
>>> wait, I bet they package every single possible udev rule in that package
>>> just in case, don't they?).
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()

2016-07-06 Thread Holger Hoffstätte
On 07/06/16 12:37, Wang Xiaoguang wrote:
> Below test scripts can reproduce this false ENOSPC:
>   #!/bin/bash
>   dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
>   dev=$(losetup --show -f fs.img)
>   mkfs.btrfs -f -M $dev
>   mkdir /tmp/mntpoint
>   mount /dev/loop0 /tmp/mntpoint
>   cd mntpoint
>   xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile
> 
> Above fallocate(2) operation will fail for ENOSPC reason, but indeed
> fs still has free space to satisfy this request. The reason is
> btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use
> just in time, and it calls btrfs_free_reserved_data_space_noquota() in
> the end of btrfs_fallocate(), which is too late and have already added
> false unnecessary pressure to enospc system. See call graph:
> btrfs_fallocate()
> |-> btrfs_alloc_data_chunk_ondemand()
> It will add btrfs_space_info's bytes_may_use accordingly.
> |-> btrfs_prealloc_file_range()
> It will call btrfs_reserve_extent(), but note that alloc type is
> RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will
> only increase btrfs_space_info's bytes_reserved accordingly, but
> will not decrease btrfs_space_info's bytes_may_use, then obviously
> we have overestimated real needed disk space, and it'll impact
> other processes who do write(2) or fallocate(2) operations, also
> can impact metadata reservation in mixed mode, and bytes_max_use
> will only be decreased in the end of btrfs_fallocate(). To fix
> this false ENOSPC, we need to decrease btrfs_space_info's
> bytes_may_use in btrfs_prealloc_file_range() in time, as what we
> do in cow_file_range(),
> See call graph in :
> cow_file_range()
> |-> extent_clear_unlock_delalloc()
> |-> clear_extent_bit()
> |-> btrfs_clear_bit_hook()
> |-> btrfs_free_reserved_data_space_noquota()
> This function will decrease bytes_may_use accordingly.
> 
> So this patch choose to call btrfs_free_reserved_data_space() in
> __btrfs_prealloc_file_range() for both successful and failed path.
> 
> Also this patch removes some old and useless comments.
> 
> Signed-off-by: Wang Xiaoguang 

Verified that the reproducer script indeed fails (with btrfs ~4.7) and
the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating
things without problem. Free space also still seems sane, as far as I
could tell.

So for both patches:

Tested-by: Holger Hoffstätte 

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.

2016-07-06 Thread Tomasz Kusmierz

> On 6 Jul 2016, at 02:25, Henk Slager  wrote:
> 
> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz  
> wrote:
>> 
>> On 6 Jul 2016, at 00:30, Henk Slager  wrote:
>> 
>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz 
>> wrote:
>> 
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well
>> Nautilius checks extension -> directs it to totem) so in that case we would
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files
>> are located in single location on file system (single folder) that might
>> have a historical bug in some structure somewhere ?
>> 
>> 
>> I find it hard to imagine that this has something to do with the
>> folderstructure, unless maybe the folder is a subvolume with
>> non-default attributes or so. How the files in that folder are created
>> (at full disktransferspeed or during a day or even a week) might give
>> some hint. You could run filefrag and see if that rings a bell.
>> 
>> files that are 4096 show:
>> 1 extent found
> 
> I actually meant filefrag for the files that are not (yet) truncated
> to 4k. For example for virtual machine imagefiles (CoW), one could see
> an MBR write.
117 extents found
filesize 15468645003

good / bad ?  
> 
>> I did forgot to add that file system was created a long time ago and it was
>> created with leaf & node size = 16k.
>> 
>> 
>> If this long time ago is >2 years then you have likely specifically
>> set node size = 16k, otherwise with older tools it would have been 4K.
>> 
>> You are right I used -l 16K -n 16K
>> 
>> Have you created it as raid10 or has it undergone profile conversions?
>> 
>> Due to lack of spare disks
>> (it may sound odd for some but spending for more than 6 disks for home use
>> seems like an overkill)
>> and due to last I’ve had I had to migrate all data to new file system.
>> This played that way that I’ve:
>> 1. from original FS I’ve removed 2 disks
>> 2. Created RAID1 on those 2 disks,
>> 3. shifted 2TB
>> 4. removed 2 disks from source FS and adde those to destination FS
>> 5 shifted 2 further TB
>> 6 destroyed original FS and adde 2 disks to destination FS
>> 7 converted destination FS to RAID10
>> 
>> FYI, when I convert to raid 10 I use:
>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>> /path/to/FS
>> 
>> this filesystem has 5 sub volumes. Files affected are located in separate
>> folder within a “victim folder” that is within a one sub volume.
>> 
>> 
>> It could also be that the ondisk format is somewhat corrupted (btrfs
>> check should find that ) and that that causes the issue.
>> 
>> 
>> root@noname_server:/mnt# btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 4424060642634 bytes used err is 0
>> total csum bytes: 4315954936
>> total tree bytes: 4522786816
>> total fs tree bytes: 61702144
>> total extent tree bytes: 41402368
>> btree space waste bytes: 72430813
>> file data blocks allocated: 4475917217792
>> referenced 4420407603200
>> 
>> No luck there :/
> 
> Indeed looks all normal.
> 
>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>> time, it has happened over a year ago with kernels recent at that
>> time, but the fs was converted from raid5
>> 
>> Could you please elaborate on that ? you also ended up with files that got
>> truncated to 4096 bytes ?
> 
> I did not have truncated to 4k files, but your case lets me think of
> small files inlining. Default max_inline mount option is 8k and that
> means that 0 to ~3k files end up in metadata. I had size corruptions
> for several of those small sized files that were updated quite
> frequent, also within commit time AFAIK. Btrfs check lists this as
> errors 400, although fs operation is not disturbed. I don't know what
> happens if those small files are being updated/rewritten and are just
> below or just above the max_inline limit.
> 
> The only thing I was thinking of is that your files were started as
> small, so inline, then extended to multi-GB. In the past, there were
> 'bad extent/chunk type' issues and it was suggested that the fs would
> have been an ext4-converted one (which had non-compliant mixed
> metadata and data) but for most it was not the case. So there was/is
> something unclear, but full balance or so fixed it as far as I
> remember. But it is guessing, I do not have any failure cases like the
> one you see.

When I think of it, I did move this folder first when filesystem was RAID 1 (or 
not 

Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 07:55, Andrei Borzenkov wrote:

On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 05:51, Andrei Borzenkov wrote:


On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy 
wrote:


I started a systemd-devel@ thread since that's where most udev stuff
gets talked about.


https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html



Before discussing how to implement it in systemd, we need to decide
what to implement. I.e.

1) do you always want to mount filesystem in degraded mode if not
enough devices are present or only if explicit hint is given?
2) do you want to restrict degrade handling to root only or to other
filesystems as well? Note that there could be more early boot
filesystems that absolutely need same treatment (enters separate
/usr), and there are also normal filesystems that may need be mounted
even degraded.
3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.



To be entirely honest, if it were me, I'd want systemd to fsck off.  If the
kernel mount(2) call succeeds, then the filesystem was ready enough to
mount, and if it doesn't, then it wasn't, end of story.


How should user space know when to try mount? What user space is
supposed to do during boot if mount fails? Do you suggest

while true; do
  mount /dev/foo && exit 0
done

as part of startup sequence? And note that nowhere is systemd involved so far.
Nowhere there, except if you have a filesystem in fstab (or a mount 
unit, which I hate for other reasons that I will not go into right now), 
and you mount it and systemd thinks the device isn't ready, it unmounts 
it _immediately_.  In the case of boot, it's because of systemd thinking 
the device isn't ready that you can't mount degraded with a missing 
device.  In the case of the root filesystem at least, the initramfs is 
expected to handle this, and most of them do poll in some way, or have 
other methods of determining this.  I occasionally have issues with it 
with dracut without systemd, but that's due to a separate bug there 
involving the device mapper.





The whole concept
of trying to track in userspace something the kernel itself tracks and knows
a whole lot more about is absolutely stupid.


It need not be user space. If kernel notifies user space when
filesystem is mountable, problem solved. It could be udev event,
netlink, whatever. Until kernel does it, user space need to either
poll or somehow track it based on available events.
THis I agree could be done better, but it absolutely should not be in 
userspace, the notification needs to come from the kernel, but that 
leads to the problem of knowing whether or not the FS can mount 
degraded, or only ro, or any number of other situations.



It makes some sense when
dealing with LVM or MD, because that is potentially a security issue
(someone could inject a bogus device node that you then mount instead of
your desired target),


I do not understand it at all. MD and LVM has exactly the same problem
- they need to know when they can assemble MD/VG. I miss what it has
to do with security, sorry.
If you don't track whether or not the device is assembled, then someone 
could create an arbitrary device node with the same name and then get 
you to mount that, possibly causing all kinds of issues depending on any 
number of other factors.



but it makes no sense here, because there's no way to
prevent the equivalent from happening in BTRFS.

As far as the udev rules, I'm pretty certain that _we_ ship those with
btrfs-progs,


No, you do not. You ship rule to rename devices to be more
"user-friendly". But the rule in question has always been part of
udev.

Ah, you're right, I was mistaken about this.



I have no idea why they're packaged with udev in CentOS (oh
wait, I bet they package every single possible udev rule in that package
just in case, don't they?).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Andrei Borzenkov
On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 05:51, Andrei Borzenkov wrote:
>>
>> On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy 
>> wrote:
>>>
>>> I started a systemd-devel@ thread since that's where most udev stuff
>>> gets talked about.
>>>
>>>
>>> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html
>>>
>>
>> Before discussing how to implement it in systemd, we need to decide
>> what to implement. I.e.
>>
>> 1) do you always want to mount filesystem in degraded mode if not
>> enough devices are present or only if explicit hint is given?
>> 2) do you want to restrict degrade handling to root only or to other
>> filesystems as well? Note that there could be more early boot
>> filesystems that absolutely need same treatment (enters separate
>> /usr), and there are also normal filesystems that may need be mounted
>> even degraded.
>> 3) can we query btrfs whether it is mountable in degraded mode?
>> according to documentation, "btrfs device ready" (which udev builtin
>> follows) checks "if it has ALL of it’s devices in cache for mounting".
>> This is required for proper systemd ordering of services.
>
>
> To be entirely honest, if it were me, I'd want systemd to fsck off.  If the
> kernel mount(2) call succeeds, then the filesystem was ready enough to
> mount, and if it doesn't, then it wasn't, end of story.

How should user space know when to try mount? What user space is
supposed to do during boot if mount fails? Do you suggest

while true; do
  mount /dev/foo && exit 0
done

as part of startup sequence? And note that nowhere is systemd involved so far.

> The whole concept
> of trying to track in userspace something the kernel itself tracks and knows
> a whole lot more about is absolutely stupid.

It need not be user space. If kernel notifies user space when
filesystem is mountable, problem solved. It could be udev event,
netlink, whatever. Until kernel does it, user space need to either
poll or somehow track it based on available events.

> It makes some sense when
> dealing with LVM or MD, because that is potentially a security issue
> (someone could inject a bogus device node that you then mount instead of
> your desired target),

I do not understand it at all. MD and LVM has exactly the same problem
- they need to know when they can assemble MD/VG. I miss what it has
to do with security, sorry.

> but it makes no sense here, because there's no way to
> prevent the equivalent from happening in BTRFS.
>
> As far as the udev rules, I'm pretty certain that _we_ ship those with
> btrfs-progs,

No, you do not. You ship rule to rename devices to be more
"user-friendly". But the rule in question has always been part of
udev.

> I have no idea why they're packaged with udev in CentOS (oh
> wait, I bet they package every single possible udev rule in that package
> just in case, don't they?).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-05 19:05, Chris Murphy wrote:

Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.
The thing I don't get about this is that while the per-device settings 
on a given system are policy, the default value is not, and should be 
expected to work correctly (but not necessarily optimally) on as many 
systems as possible, so any claim that this should be fixed in udev are 
bogus by the regular kernel rules.


I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.
OTOH, I would not be surprised if the stance there is 'you get no 
support if your not using enterprise drives', not because of the project 
itself, but because it's ZFS.  Part of their minimum recommended 
hardware requirements is ECC RAM, so it wouldn't surprise me if 
enterprise storage devices are there too.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Joerg Schilling
"Austin S. Hemmelgarn"  wrote:

> > A broken filesystem is a broken filesystem.
> >
> > If you try to change gtar to work around a specific problem, it may fail in
> > other situations.
> The problem with this is that tar is assuming things that are not 
> guaranteed to be true.  There is absolutely nothing that says that 
> st_blocks has to be non-zero if there's data in the file.  In fact, the 

This is not true: POSIX requires st_blocks to be != 0 in case that the file 
contains data.

> behavior that BTRFS used to have of reporting st_blocks to be 0 for 
> files entirely inlined in the metadata is absolutely correct given the 
> description of the field by POSIX, because there _are_ no blocks 
> allocated to the file (because the metadata block is technically 
> equivalent to the inode, which isn't counted by st_blocks).  This is yet 
> another example of an old interface (in this case, sparse file 
> detection) being short-sighted (read in this case as non-existent).

The internal state of a file system is irrelevant. The only thing that counts 
is the user space view and if a file contains data (read succeeds in user 
space), it needs to report st_blocks != 0.

> The proper fix for this is that tar (and anything else that handles 
> sparse files differently) should be parsing the file regardless.  It has 
> to anyway for a normal sparse file to figure out where the sparse 
> regions are, and optimizing for a file that's completely sparse (and 
> therefore probably pre-allocated with fallocate) is not all that 
> reasonable considering that this is going to be a very rare case in 
> normal usage.

This does not help.

Even on a decent OS (e.g. Solaris since Summer 2005) and a decent tar 
implementation (star) that supports SEEK_HOLE since Summer 2005, this method 
will not work for all filesystems as there may be old filesystem 
implementations and as there may be NFS...

For this reason, star still checks st_blocks in case that SEEK_HOLE did not 
work.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ 
http://sourceforge.net/projects/schilytools/files/'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 05:51, Andrei Borzenkov wrote:

On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy  wrote:

I started a systemd-devel@ thread since that's where most udev stuff
gets talked about.

https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html



Before discussing how to implement it in systemd, we need to decide
what to implement. I.e.

1) do you always want to mount filesystem in degraded mode if not
enough devices are present or only if explicit hint is given?
2) do you want to restrict degrade handling to root only or to other
filesystems as well? Note that there could be more early boot
filesystems that absolutely need same treatment (enters separate
/usr), and there are also normal filesystems that may need be mounted
even degraded.
3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.


To be entirely honest, if it were me, I'd want systemd to fsck off.  If 
the kernel mount(2) call succeeds, then the filesystem was ready enough 
to mount, and if it doesn't, then it wasn't, end of story.  The whole 
concept of trying to track in userspace something the kernel itself 
tracks and knows a whole lot more about is absolutely stupid.  It makes 
some sense when dealing with LVM or MD, because that is potentially a 
security issue (someone could inject a bogus device node that you then 
mount instead of your desired target), but it makes no sense here, 
because there's no way to prevent the equivalent from happening in BTRFS.


As far as the udev rules, I'm pretty certain that _we_ ship those with 
btrfs-progs, I have no idea why they're packaged with udev in CentOS (oh 
wait, I bet they package every single possible udev rule in that package 
just in case, don't they?).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-05 05:28, Joerg Schilling wrote:

Andreas Dilger  wrote:


I think in addition to fixing btrfs (because it needs to work with existing
tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
to handle this situation more robustly.  One option is if st_blocks == 0 then
tar should also check if st_mtime is less than 60s in the past, and if yes
then it should call fsync() on the file to flush any unwritten data to disk,
or assume the file is not sparse and read the whole file, so that it doesn't
incorrectly assume that the file is sparse and skip archiving the file data.


A broken filesystem is a broken filesystem.

If you try to change gtar to work around a specific problem, it may fail in
other situations.
The problem with this is that tar is assuming things that are not 
guaranteed to be true.  There is absolutely nothing that says that 
st_blocks has to be non-zero if there's data in the file.  In fact, the 
behavior that BTRFS used to have of reporting st_blocks to be 0 for 
files entirely inlined in the metadata is absolutely correct given the 
description of the field by POSIX, because there _are_ no blocks 
allocated to the file (because the metadata block is technically 
equivalent to the inode, which isn't counted by st_blocks).  This is yet 
another example of an old interface (in this case, sparse file 
detection) being short-sighted (read in this case as non-existent).


The proper fix for this is that tar (and anything else that handles 
sparse files differently) should be parsing the file regardless.  It has 
to anyway for a normal sparse file to figure out where the sparse 
regions are, and optimizing for a file that's completely sparse (and 
therefore probably pre-allocated with fallocate) is not all that 
reasonable considering that this is going to be a very rare case in 
normal usage.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Out of space error even though there's 100 GB unused?

2016-07-06 Thread Stanislaw Kaminski
Hi Hugo,
I agree that it seems to be a bug, and I'll be glad to help nail that
down - if only because I have no other drive to move the data to :-)

As for your suggestion - no change:
[root@archb3 stan]# mount | grep home
/dev/sda4 on /home type btrfs
(rw,relatime,nospace_cache,clear_cache,subvolid=5,subvol=/)
[root@archb3 stan]# touch test
touch: cannot touch 'test': No space left on device

Cheers,
Stan

2016-07-06 12:34 GMT+02:00 Hugo Mills :
> On Wed, Jul 06, 2016 at 11:55:42AM +0200, Stanislaw Kaminski wrote:
>> Hi,
>> I am fighting with this since at least Monday - see
>> https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left
>>
>> Here's the data:
>> #   uname -a
>> Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016
>> armv5tel GNU/Linux
>>
>> #   btrfs --version
>> btrfs-progs v4.6
>>
>> #   btrfs fi show
>> Label: 'home'  uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b
>> Total devices 1 FS bytes used 1.71TiB
>> devid1 size 1.81TiB used 1.71TiB path /dev/sda4
>
>In this state, you should definitely not be seeing out of space
> errors. This is, therefore, a bug you're seeing.
>
>I've not been following things as closely as I'd like of late, but
> I think there was a bug recently involving the free space cache. It
> might be worth unmounting the FS and mounting again with the
> nospace_cache option, just to see if that helps.
>
>Hugo.
>
>> #   btrfs fi df /home
>> Data, single: total=1.71TiB, used=1.71TiB
>> System, DUP: total=32.00MiB, used=224.00KiB
>> Metadata, DUP: total=4.00GiB, used=2.07GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> # btrfs f usage -T  /home
>> Overall:
>> Device size:   1.81TiB
>> Device allocated:  1.71TiB
>> Device unallocated:   97.89GiB
>> Device missing:  0.00B
>> Used:  1.71TiB
>> Free (estimated): 98.22GiB  (min: 49.27GiB)
>> Data ratio:   1.00
>> Metadata ratio:   2.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>>  DataMetadata System
>> Id Path  single  DUP  DUP   Unallocated
>> -- - ---  - ---
>>  1 /dev/sda4 1.71TiB  8.00GiB  64.00MiB97.89GiB
>> -- - ---  - ---
>>Total 1.71TiB  4.00GiB  32.00MiB97.89GiB
>>Used  1.71TiB  2.07GiB 224.00KiB
>>
>> # btrfs fi du -s /home
>> Total Exclusive Set shared Filename
>> 1.60TiB 1.60TiB 0.00B /home
>>
>> # btrfs f resize 1:+1G /home/
>> Resize '/home/' of '1:+1G'
>> ERROR: unable to resize '/home/': no enough free space
>>
>> This all is after closely following:
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
>> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
>>
>> So, already did full volume rebalance, defrag, rebooted multiple times
>> - still, "Error: out of disk space".
>>
>> To sum up:
>> - my files sum to 1.6 TiB
>> - disk usage is shown to be 1.71 TiB
>> - volume size is 1.81 TiB
>> - btrfs util shows I have ~98 GiB free space on the volume
>> - I am getting "out of space" message
>>
>> Bonus:
>> - I removed 50 GB of data from the drive and I still get "out of
>> space" message after writing ~1 GB.
>>
>> Help would be very appreciated.
>>
>> Cheers,
>> Stan
>
> --
> Hugo Mills | You can play with your friends' privates, but you
> hugo@... carfax.org.uk | can't play with your friends' childrens' privates.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |   C++ coding rule
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()

2016-07-06 Thread Wang Xiaoguang
Below test scripts can reproduce this false ENOSPC:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount /dev/loop0 /tmp/mntpoint
cd mntpoint
xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile

Above fallocate(2) operation will fail for ENOSPC reason, but indeed
fs still has free space to satisfy this request. The reason is
btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use
just in time, and it calls btrfs_free_reserved_data_space_noquota() in
the end of btrfs_fallocate(), which is too late and have already added
false unnecessary pressure to enospc system. See call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
It will add btrfs_space_info's bytes_may_use accordingly.
|-> btrfs_prealloc_file_range()
It will call btrfs_reserve_extent(), but note that alloc type is
RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will
only increase btrfs_space_info's bytes_reserved accordingly, but
will not decrease btrfs_space_info's bytes_may_use, then obviously
we have overestimated real needed disk space, and it'll impact
other processes who do write(2) or fallocate(2) operations, also
can impact metadata reservation in mixed mode, and bytes_max_use
will only be decreased in the end of btrfs_fallocate(). To fix
this false ENOSPC, we need to decrease btrfs_space_info's
bytes_may_use in btrfs_prealloc_file_range() in time, as what we
do in cow_file_range(),
See call graph in :
cow_file_range()
|-> extent_clear_unlock_delalloc()
|-> clear_extent_bit()
|-> btrfs_clear_bit_hook()
|-> btrfs_free_reserved_data_space_noquota()
This function will decrease bytes_may_use accordingly.

So this patch choose to call btrfs_free_reserved_data_space() in
__btrfs_prealloc_file_range() for both successful and failed path.

Also this patch removes some old and useless comments.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/extent-tree.c |  1 -
 fs/btrfs/file.c| 23 ---
 fs/btrfs/inode-map.c   |  3 +--
 fs/btrfs/inode.c   | 12 
 fs/btrfs/relocation.c  | 10 +-
 5 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 82b912a..b0c86d2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3490,7 +3490,6 @@ again:
dcs = BTRFS_DC_SETUP;
else if (ret == -ENOSPC)
set_bit(BTRFS_TRANS_CACHE_ENOSPC, >transaction->flags);
-   btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
iput(inode);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2234e88..f872113 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2669,6 +2669,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 
alloc_start = round_down(offset, blocksize);
alloc_end = round_up(offset + len, blocksize);
+   cur_offset = alloc_start;
 
/* Make sure we aren't being give some crap mode */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2761,7 +2762,6 @@ static long btrfs_fallocate(struct file *file, int mode,
 
/* First, check if we exceed the qgroup limit */
INIT_LIST_HEAD(_list);
-   cur_offset = alloc_start;
while (1) {
em = btrfs_get_extent(inode, NULL, 0, cur_offset,
  alloc_end - cur_offset, 0);
@@ -2788,6 +2788,14 @@ static long btrfs_fallocate(struct file *file, int mode,
last_byte - cur_offset);
if (ret < 0)
break;
+   } else {
+   /*
+* Do not need to reserve unwritten extent for this
+* range, free reserved data space first, otherwise
+* it'll result false ENOSPC error.
+*/
+   btrfs_free_reserved_data_space(inode, cur_offset,
+   last_byte - cur_offset);
}
free_extent_map(em);
cur_offset = last_byte;
@@ -2839,18 +2847,11 @@ out_unlock:
unlock_extent_cached(_I(inode)->io_tree, alloc_start, locked_end,
 _state, GFP_KERNEL);
 out:
-   /*
-* As we waited the extent range, the data_rsv_map must be empty
-* in the range, as written data range will be released from it.
-* And for prealloacted extent, it will also be released when
-* its metadata is written.
-* So this is completely used as cleanup.
-*/
-   btrfs_qgroup_free_data(inode, alloc_start, alloc_end - alloc_start);
inode_unlock(inode);
/* 

[PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()

2016-07-06 Thread Wang Xiaoguang
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|  | extent  |   extent   | ...| extent |
||
|block group reloc_inode |

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/relocation.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..abc2f69 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode,
u64 num_bytes;
int nr = 0;
int ret = 0;
+   u64 prealloc_start, prealloc_end;
 
BUG_ON(cluster->start != cluster->boundary[0]);
inode_lock(inode);
 
-   ret = btrfs_check_data_free_space(inode, cluster->start,
- cluster->end + 1 - cluster->start);
+   start = cluster->start - offset;
+   end = cluster->end - offset;
+   ret = btrfs_check_data_free_space(inode, start, end + 1 - start);
if (ret)
goto out;
 
while (nr < cluster->nr) {
-   start = cluster->boundary[nr] - offset;
+   prealloc_start = cluster->boundary[nr] - offset;
if (nr + 1 < cluster->nr)
-   end = cluster->boundary[nr + 1] - 1 - offset;
+   prealloc_end = cluster->boundary[nr + 1] - 1 - offset;
else
-   end = cluster->end - offset;
+   prealloc_end = cluster->end - offset;
 
-   lock_extent(_I(inode)->io_tree, start, end);
-   num_bytes = end + 1 - start;
-   ret = btrfs_prealloc_file_range(inode, 0, start,
+   lock_extent(_I(inode)->io_tree, prealloc_start,
+   prealloc_end);
+   num_bytes = prealloc_end + 1 - prealloc_start;
+   ret = btrfs_prealloc_file_range(inode, 0, prealloc_start,
num_bytes, num_bytes,
-   end + 1, _hint);
-   unlock_extent(_I(inode)->io_tree, start, end);
+   prealloc_end + 1, _hint);
+   unlock_extent(_I(inode)->io_tree, prealloc_start,
+ prealloc_end);
if (ret)
break;
nr++;
}
-   btrfs_free_reserved_data_space(inode, cluster->start,
-  cluster->end + 1 - cluster->start);
+   btrfs_free_reserved_data_space(inode, start, end + 1 - start);
 out:
inode_unlock(inode);
return ret;
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Out of space error even though there's 100 GB unused?

2016-07-06 Thread Hugo Mills
On Wed, Jul 06, 2016 at 11:55:42AM +0200, Stanislaw Kaminski wrote:
> Hi,
> I am fighting with this since at least Monday - see
> https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left
> 
> Here's the data:
> #   uname -a
> Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016
> armv5tel GNU/Linux
> 
> #   btrfs --version
> btrfs-progs v4.6
> 
> #   btrfs fi show
> Label: 'home'  uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b
> Total devices 1 FS bytes used 1.71TiB
> devid1 size 1.81TiB used 1.71TiB path /dev/sda4

   In this state, you should definitely not be seeing out of space
errors. This is, therefore, a bug you're seeing.

   I've not been following things as closely as I'd like of late, but
I think there was a bug recently involving the free space cache. It
might be worth unmounting the FS and mounting again with the
nospace_cache option, just to see if that helps.

   Hugo.

> #   btrfs fi df /home
> Data, single: total=1.71TiB, used=1.71TiB
> System, DUP: total=32.00MiB, used=224.00KiB
> Metadata, DUP: total=4.00GiB, used=2.07GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> # btrfs f usage -T  /home
> Overall:
> Device size:   1.81TiB
> Device allocated:  1.71TiB
> Device unallocated:   97.89GiB
> Device missing:  0.00B
> Used:  1.71TiB
> Free (estimated): 98.22GiB  (min: 49.27GiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
>  DataMetadata System
> Id Path  single  DUP  DUP   Unallocated
> -- - ---  - ---
>  1 /dev/sda4 1.71TiB  8.00GiB  64.00MiB97.89GiB
> -- - ---  - ---
>Total 1.71TiB  4.00GiB  32.00MiB97.89GiB
>Used  1.71TiB  2.07GiB 224.00KiB
> 
> # btrfs fi du -s /home
> Total Exclusive Set shared Filename
> 1.60TiB 1.60TiB 0.00B /home
> 
> # btrfs f resize 1:+1G /home/
> Resize '/home/' of '1:+1G'
> ERROR: unable to resize '/home/': no enough free space
> 
> This all is after closely following:
> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
> 
> So, already did full volume rebalance, defrag, rebooted multiple times
> - still, "Error: out of disk space".
> 
> To sum up:
> - my files sum to 1.6 TiB
> - disk usage is shown to be 1.71 TiB
> - volume size is 1.81 TiB
> - btrfs util shows I have ~98 GiB free space on the volume
> - I am getting "out of space" message
> 
> Bonus:
> - I removed 50 GB of data from the drive and I still get "out of
> space" message after writing ~1 GB.
> 
> Help would be very appreciated.
> 
> Cheers,
> Stan

-- 
Hugo Mills | You can play with your friends' privates, but you
hugo@... carfax.org.uk | can't play with your friends' childrens' privates.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   C++ coding rule


signature.asc
Description: Digital signature


Re: Out of space error even though there's 100 GB unused?

2016-07-06 Thread Stanislaw Kaminski
Hi Alex,
Thank for having a look.

"You're trying to resize a fs that is probably already fully using the
block device it's on. I don't see anything incorrect happening here,
but I might be missing something."
This was just to show that I can't do this, I know that it is already
utilizing the entire block device.

"The unallocated space will be allocated if you start writing files to it."
That's what I would expect, unfortunately it's kind of hard to write
files to it, as I get "Out of space" error. Tongue-in-cheek, if you
know how to ignore the issue and start writing files, it would solve
my issue.

Bottom line: if the disk is really full, then none of the tools shows
that. If it is not (and I suspect it's not - as I mentioned, I just
removed 50 GB of data from it), then why am I getting "out of space"?

As for block device size:
# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6070B645-D738-4730-BEF7-989210EF1DD7

DeviceStartEndSectors  Size Type
/dev/sda1  2048 133119 131072   64M Linux filesystem
/dev/sda2133120223027120971521G Linux swap
/dev/sda3   2230272   19007487   167772168G Linux filesystem
/dev/sda4  19007488 3907029134 3888021647  1.8T Linux home


Cheers,
Stan

2016-07-06 12:10 GMT+02:00 Alexander Fougner :
>
> Den 6 juli 2016 12:03 em skrev "Stanislaw Kaminski"
> :
>>
>
>> Hi,
>> I am fighting with this since at least Monday - see
>>
>> https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left
>>
>> Here's the data:
>> #   uname -a
>> Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016
>> armv5tel GNU/Linux
>>
>> #   btrfs --version
>> btrfs-progs v4.6
>>
>> #   btrfs fi show
>> Label: 'home'  uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b
>> Total devices 1 FS bytes used 1.71TiB
>> devid1 size 1.81TiB used 1.71TiB path /dev/sda4
>>
>> #   btrfs fi df /home
>> Data, single: total=1.71TiB, used=1.71TiB
>> System, DUP: total=32.00MiB, used=224.00KiB
>> Metadata, DUP: total=4.00GiB, used=2.07GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> # btrfs f usage -T  /home
>> Overall:
>> Device size:   1.81TiB
>> Device allocated:  1.71TiB
>> Device unallocated:   97.89GiB
>> Device missing:  0.00B
>> Used:  1.71TiB
>> Free (estimated): 98.22GiB  (min: 49.27GiB)
>> Data ratio:   1.00
>> Metadata ratio:   2.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>>  DataMetadata System
>> Id Path  single  DUP  DUP   Unallocated
>> -- - ---  - ---
>>  1 /dev/sda4 1.71TiB  8.00GiB  64.00MiB97.89GiB
>> -- - ---  - ---
>>Total 1.71TiB  4.00GiB  32.00MiB97.89GiB
>>Used  1.71TiB  2.07GiB 224.00KiB
>>
>> # btrfs fi du -s /home
>> Total Exclusive Set shared Filename
>> 1.60TiB 1.60TiB 0.00B /home
>>
>> # btrfs f resize 1:+1G /home/
>> Resize '/home/' of '1:+1G'
>> ERROR: unable to resize '/home/': no enough free space
>>
>
> You're trying to resize a fs that is probably already fully using the block
> device it's on. I don't see anything incorrect happening here, but I might
> be missing something.
>
> The used space amounting to 1.6TiB is not as reliable as the btrfs fi df
> tool.
> The unallocated space will be allocated if you start writing files to it.
> What size is the parent block device?
>
>> This all is after closely following:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
>>
>> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
>>
>> So, already did full volume rebalance, defrag, rebooted multiple times
>> - still, "Error: out of disk space".
>>
>> To sum up:
>> - my files sum to 1.6 TiB
>> - disk usage is shown to be 1.71 TiB
>> - volume size is 1.81 TiB
>> - btrfs util shows I have ~98 GiB free space on the volume
>> - I am getting "out of space" message
>>
>> Bonus:
>> - I removed 50 GB of data from the drive and I still get "out of
>> space" message after writing ~1 GB.
>>
>> Help would be very appreciated.
>>
>> Cheers,
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: fix fsfreeze hang caused by delayed iputs deal

2016-07-06 Thread Wang Xiaoguang

hello,

On 07/05/2016 01:35 AM, David Sterba wrote:

On Wed, Jun 29, 2016 at 01:15:10PM +0800, Wang Xiaoguang wrote:

When running fstests generic/068, sometimes we got below WARNING:
   xfs_io  D 8800331dbb20 0  6697   6693 0x0080
   8800331dbb20 88007acfc140 880034d895c0 8800331dc000
   880032d243e8 fffe 880032d24400 0001
   8800331dbb38 816a9045 880034d895c0 8800331dbba8
   Call Trace:
   [] schedule+0x35/0x80
   [] rwsem_down_read_failed+0xf2/0x140
   [] ? __filemap_fdatawrite_range+0xd1/0x100
   [] call_rwsem_down_read_failed+0x18/0x30
   [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
   [] percpu_down_read+0x35/0x50
   [] __sb_start_write+0x2c/0x40
   [] start_transaction+0x2a5/0x4d0 [btrfs]
   [] btrfs_join_transaction+0x17/0x20 [btrfs]
   [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
   [] evict+0xba/0x1a0
   [] iput+0x196/0x200
   [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
   [] btrfs_commit_transaction+0x928/0xa80 [btrfs]
   [] btrfs_freeze+0x30/0x40 [btrfs]
   [] freeze_super+0xf0/0x190
   [] do_vfs_ioctl+0x4a5/0x5c0
   [] ? do_audit_syscall_entry+0x66/0x70
   [] ? syscall_trace_enter_phase1+0x11f/0x140
   [] SyS_ioctl+0x79/0x90
   [] do_syscall_64+0x62/0x110
   [] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. See below race window in freeze_super():
sync_filesystem(sb);
|
| race window
| In this period, cleaner_kthread() may be scheduled to
| run, and it call btrfs_delete_unused_bgs() which will
| add some delayed iputs.
|
sb->s_writers.frozen = SB_FREEZE_FS;
sb_wait_write(sb, SB_FREEZE_FS);
if (sb->s_op->freeze_fs) {
/* freeze_fs will call btrfs_commit_transaction() */
ret = sb->s_op->freeze_fs(sb);

So if btrfs is doing freeze job, we should block
btrfs_delete_unused_bgs(), to avoid add delayed iputs.

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/disk-io.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 863bf7a..fdbe0df 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1846,8 +1846,11 @@ static int cleaner_kthread(void *arg)
 * after acquiring fs_info->delete_unused_bgs_mutex. So we
 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
 * unused block groups.
+*

Extra line, but I think you intended to write a comment that explains
why the freeze protection is required here :)

Yes, but forgot to... :)



 */
+   __sb_start_write(root->fs_info->sb, SB_FREEZE_WRITE, true);

There's opencoding an existing wrapper sb_start_write, please use it
instead.

OK, I can submit a new version using this wrapper.
Also could you please have a look at my reply to Filipe Manana in
last mail? I suggest another solution, thanks.

Regards,
Xiaoguang Wang



btrfs_delete_unused_bgs(root->fs_info);
+   __sb_end_write(root->fs_info->sb, SB_FREEZE_WRITE);
  sleep:
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
--
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix free space calculation in dump_space_info()

2016-07-06 Thread Wang Xiaoguang

hello,

On 07/05/2016 01:10 AM, David Sterba wrote:

On Wed, Jun 29, 2016 at 01:12:16PM +0800, Wang Xiaoguang wrote:

Can you please describe in more detail what is this patch fixing?

In original dump_space_info(), free space info is calculated by
info->total_bytes - info->bytes_used - info->bytes_pinned - 
info->bytes_reserved - info->bytes_readonly,

but I think free space info should also minus info->bytes_may_use :)

Regards,
Xiaoguang Wang




Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/extent-tree.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8550a0e..520ba8f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7747,8 +7747,8 @@ static void dump_space_info(struct btrfs_space_info 
*info, u64 bytes,
printk(KERN_INFO "BTRFS: space_info %llu has %llu free, is %sfull\n",
   info->flags,
   info->total_bytes - info->bytes_used - info->bytes_pinned -
-  info->bytes_reserved - info->bytes_readonly,
-  (info->full) ? "" : "not ");
+  info->bytes_reserved - info->bytes_readonly -
+  info->bytes_may_use, (info->full) ? "" : "not ");
printk(KERN_INFO "BTRFS: space_info total=%llu, used=%llu, pinned=%llu, 
"
   "reserved=%llu, may_use=%llu, readonly=%llu\n",
   info->total_bytes, info->bytes_used, info->bytes_pinned,
--
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Out of space error even though there's 100 GB unused?

2016-07-06 Thread Stanislaw Kaminski
Hi,
I am fighting with this since at least Monday - see
https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left

Here's the data:
#   uname -a
Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016
armv5tel GNU/Linux

#   btrfs --version
btrfs-progs v4.6

#   btrfs fi show
Label: 'home'  uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b
Total devices 1 FS bytes used 1.71TiB
devid1 size 1.81TiB used 1.71TiB path /dev/sda4

#   btrfs fi df /home
Data, single: total=1.71TiB, used=1.71TiB
System, DUP: total=32.00MiB, used=224.00KiB
Metadata, DUP: total=4.00GiB, used=2.07GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs f usage -T  /home
Overall:
Device size:   1.81TiB
Device allocated:  1.71TiB
Device unallocated:   97.89GiB
Device missing:  0.00B
Used:  1.71TiB
Free (estimated): 98.22GiB  (min: 49.27GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

 DataMetadata System
Id Path  single  DUP  DUP   Unallocated
-- - ---  - ---
 1 /dev/sda4 1.71TiB  8.00GiB  64.00MiB97.89GiB
-- - ---  - ---
   Total 1.71TiB  4.00GiB  32.00MiB97.89GiB
   Used  1.71TiB  2.07GiB 224.00KiB

# btrfs fi du -s /home
Total Exclusive Set shared Filename
1.60TiB 1.60TiB 0.00B /home

# btrfs f resize 1:+1G /home/
Resize '/home/' of '1:+1G'
ERROR: unable to resize '/home/': no enough free space

This all is after closely following:
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

So, already did full volume rebalance, defrag, rebooted multiple times
- still, "Error: out of disk space".

To sum up:
- my files sum to 1.6 TiB
- disk usage is shown to be 1.71 TiB
- volume size is 1.81 TiB
- btrfs util shows I have ~98 GiB free space on the volume
- I am getting "out of space" message

Bonus:
- I removed 50 GB of data from the drive and I still get "out of
space" message after writing ~1 GB.

Help would be very appreciated.

Cheers,
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 64-btrfs.rules and degraded boot

2016-07-06 Thread Andrei Borzenkov
On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy  wrote:
> I started a systemd-devel@ thread since that's where most udev stuff
> gets talked about.
>
> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html
>

Before discussing how to implement it in systemd, we need to decide
what to implement. I.e.

1) do you always want to mount filesystem in degraded mode if not
enough devices are present or only if explicit hint is given?
2) do you want to restrict degrade handling to root only or to other
filesystems as well? Note that there could be more early boot
filesystems that absolutely need same treatment (enters separate
/usr), and there are also normal filesystems that may need be mounted
even degraded.
3) can we query btrfs whether it is mountable in degraded mode?
according to documentation, "btrfs device ready" (which udev builtin
follows) checks "if it has ALL of it’s devices in cache for mounting".
This is required for proper systemd ordering of services.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount degraded RAID5

2016-07-06 Thread Tomáš Hrdina
Now with 3 disks:

sudo btrfs check /dev/sda
parent transid verify failed on 7008807157760 wanted 70175 found 70133
parent transid verify failed on 7008807157760 wanted 70175 found 70133
checksum verify failed on 7008807157760 found F192848C wanted 1571393A
checksum verify failed on 7008807157760 found F192848C wanted 1571393A
bytenr mismatch, want=7008807157760, have=65536
Checking filesystem on /dev/sda
UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e
checking extents
parent transid verify failed on 7009468874752 wanted 70180 found 70133
parent transid verify failed on 7009468874752 wanted 70180 found 70133
checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC
bytenr mismatch, want=7009468874752, have=65536
parent transid verify failed on 7008859045888 wanted 70175 found 70133
parent transid verify failed on 7008859045888 wanted 70175 found 70133
checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91
bytenr mismatch, want=7008859045888, have=65536
parent transid verify failed on 7008899547136 wanted 70175 found 70133
parent transid verify failed on 7008899547136 wanted 70175 found 70133
checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3
parent transid verify failed on 7008899547136 wanted 70175 found 70133
Ignoring transid failure
leaf parent key incorrect 7008899547136
bad block 7008899547136
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 7009074167808 wanted 70175 found 70133
parent transid verify failed on 7009074167808 wanted 70175 found 70133
checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46
bytenr mismatch, want=7009074167808, have=65536


sudo btrfs-debug-tree -d /dev/sdc
http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA=
 

sudo btrfs-find-root /dev/sdc
parent transid verify failed on 7008807157760 wanted 70175 found 70133
parent transid verify failed on 7008807157760 wanted 70175 found 70133
Superblock thinks the generation is 70182
Superblock thinks the level is 1
Found tree root at 6062830010368 gen 70182 level 1
Well block 6062434418688(gen: 70181 level: 1) seems good, but
generation/level doesn't match, want gen: 70182 level: 1
Well block 6062497202176(gen: 69186 level: 0) seems good, but
generation/level doesn't match, want gen: 70182 level: 1
Well block 6062470332416(gen: 69186 level: 0) seems good, but
generation/level doesn't match, want gen: 70182 level: 1


sudo smartctl -l scterc /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
   Read: Disabled
  Write: Disabled


sudo smartctl -l scterc /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
   Read: 70 (7.0 seconds)
  Write: 70 (7.0 seconds)


sudo smartctl -l scterc /dev/sdc
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
   Read: Disabled
  Write: Disabled


sudo smartcl -a /dev/sdx
http://sebsauvage.net/paste/?aab1d282ceb1e1cf#auxFRkK5GCW8j1gR7mwgzR1z92Qn9oqtc6EEC2C6sEE=


cat /sys/block/sda/device/timeout
30


cat /sys/block/sdb/device/timeout
30


cat /sys/block/sdc/device/timeout
30

Thank you
Tomas



 *From:* Chris Murphy
 *Sent:*  Wednesday, July 06, 2016 1:19AM
 *To:* Tomáš Hrdina
*Cc:* Chris Murphy, Btrfs Btrfs
 *Subject:* Re: Unable to mount degraded RAID5

btrfs check


---
Tato zpráva byla zkontrolována na viry programem Avast Antivirus.
https://www.avast.com/antivirus

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html