date:20150713

Re: Disk "failed" while doing scrub

2015-07-13 Thread Duncan

Dāvis Mosāns posted on Tue, 14 Jul 2015 04:54:27 +0300 as excerpted:

> 2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.dun...@cox.net>:
>> You say five disk, but nowhere in your post do you mention what raid
>> mode you were using, neither do you post btrfs filesystem show and
>> btrfs filesystem df, as suggested on the wiki and which list that
>> information.
> 
> Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1
> Using RAID1 for metadata and single for data, with features
> big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata
> and mounted with noatime,compress=zlib,space_cache,autodefrag

Thanks.  FWIW, pretty similar here, but running gentoo, now with btrfs-
progs v4.1.1 and the mainline 4.2-rc1+ kernel.

BTW, note that space_cache has been the default for quite some time, 
now.  I've never actually manually mounted with space_cache on any of my 
filesystems over several years, now, yet they all report it when I check 
/proc/mounts, etc.  So if you're adding that manually, you can kill that 
option and save the commandline/fstab space. =:^)

> Label: 'Data'  uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
>Total devices 5 FS bytes used 7.16TiB
>devid1 size 2.73TiB used 2.35TiB path /dev/sdc
>devid2 size 1.82TiB used 1.44TiB path /dev/sdd
>devid3 size 1.82TiB used 1.44TiB path /dev/sde
>devid4 size 1.82TiB used 1.44TiB path /dev/sdg
>devid5 size 931.51GiB used 539.01GiB path /dev/sdh
> 
> Data, single: total=7.15TiB, used=7.15TiB
> System, RAID1: total=8.00MiB, used=784.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=16.00GiB, used=14.37GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B

And note that you can easily and quickly remove those empty single-mode 
system and metadata chunks, which are an artifact of the way mkfs.btrfs 
works, using balance filters.

btrfs balance start -mprofile=single

... should do it.  They're actually working on mkfs.btrfs patches to fix 
it not to do that, right now.  There's active patch and testing threads 
discussing it.  Hopefully for btrfs-progs v4.2.  (4.1.1 has the patches 
for single-device and prep work for multi-device, according to the 
changelog.)

>>> Because filesystem still mounts, I assume I should do "btrfs device
>>> delete /dev/sdd /mntpoint" and then restore damaged files from backup.
>>
>> You can try a replace, but with a failing drive still connected, people
>> report mixed results.  It's likely to fail as it can't read certain
>> blocks to transfer them to the new device.
> 
> As I understand, device delete will copy data from that disk and
> distribute across rest of disks, while btrfs replace will copy to new
> disk which must be atleast size of disk I'm replacing.

Sorry.  You wrote delete, I read replace.  How'd I do that? =:^(

You are absolutely correct.  Delete would be better here.

I guess I had just been reading a thread discussing the problems I 
mentioned with replace, and saw what I expected to see, not what you 
actually wrote.

>> There's no such partial-file with null-fill tools shipped just yet.

> From journal I have only 14 files mentioned where errors occurred. Now
> 13 files from them don't throw any errors and their SHA's match to my
> backups so they're fine.

Good.  I was going on the assumption that the questionable device was in 
much worse shape than that.

> And actually btrfs does allow to copy/read that one damaged file, only I
> get I/O error when trying to read data from those broken sectors

Good, and good to know.  Thanks. =:^)

> best and correct way to recover a file is using ddrescue

I was just going to mention ddrescue. =:^)

> $ du -m /tmp/damaged_file 6251/tmp/damaged_file
> 
> so basically only like 8K bytes are unrecoverable from this file.
> Probably there could be created some tool which could get even more data
> knowing about btrfs.
> 
>> There /is/, however, a command that can be used to either regenerate or
>> zero-out the checksum tree.  See btrfs check --init-csum-tree.
>>
> Seems, you can't specify a path/file for it and it's quite destructive
> action if you want to get data only about some one specific file.

Yes.  It's whole-filesystem-all-or-nothing, unfortunately. =:^(

> I did scrub second time and this time there aren't that many
> uncorrectable errors and also there's no csum_errors so --init-csum-tree
> is useless here I think.

Agreed.

> Most likely previously scrub got that many errors because it still
> continued for a bit even if disk didn't respond.

Yes.

> scrub status [...]
>read_errors: 2
>csum_errors: 0
>verify_errors: 0
>no_csum: 89600
>csum_discards: 656214
>super_errors: 0
>malloc_errors: 0
>uncorrectable_errors: 2
>unverified_errors: 0
>corrected_errors: 0
>last_physical: 2590041112576

OK, that matches up with

[PATCH] Revert "btrfs-progs: mkfs: create only desired block groups for single device"

2015-07-13 Thread Qu Wenruo

This reverts commit 5f8232e5c8f0b0de0ef426274911385b0e877392.

This commit causes a regression:
---
$ mkfs.btrfs -f /dev/sda6
$ btrfsck /dev/sda6
Checking filesystem on /dev/sda6
UUID: 2ebb483c-1986-4610-802a-c6f3e6ab4b76
checking extents
Chunk[256, 228, 0]: length(4194304), offset(0), type(2) mismatch with
block group[0, 192, 4194304]: offset(4194304), objectid(0), flags(34)
Chunk[256, 228, 4194304]: length(8388608), offset(4194304), type(4)
mismatch with block group[4194304, 192, 8388608]: offset(8388608),
objectid(4194304), flags(36)
Block group[0, 4194304] (flags = 34) didn't find the relative chunk.
Block group[4194304, 8388608] (flags = 36) didn't find the relative
chunk.
..
---

The commit has the following bug causing the problem.
1) Typo forgets to add meta/data_profile for alloc_chunk.
Only meta/data_profile is added to allocate a block group, but not
chunk.

2) Type for the first system chunk is impossible to modify yet.
The type for the first chunk and its stripe is hard coded into
make_btrfs() function.
So even we try to modify the type of the block group, we are unable to
change the type of the first chunk.
Causing the chunk type mismatch problem.

The 1st bug can be fixed quite easily but the second is not.
The good news is, the last patch "btrfs-progs: mkfs: Cleanup temporary
chunk to avoid strange balance behavior." from my patchset can handle it
quite well alone.

So just revert the patch.
New bug fix for btrfsck(err is 0 even chunk/extent tree is corrupted) and
new test cases for mkfs will follow soon.

Signed-off-by: Qu Wenruo 
---
 mkfs.c | 34 +++---
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index ee8a3cb..afecf00 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -59,9 +59,8 @@ struct mkfs_allocation {
u64 system;
 };
 
-static int create_metadata_block_groups(struct btrfs_root *root,
-   u64 metadata_profile, int mixed,
-   struct mkfs_allocation *allocation)
+static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
+   struct mkfs_allocation *allocation)
 {
struct btrfs_trans_handle *trans;
u64 bytes_used;
@@ -74,7 +73,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
 
root->fs_info->system_allocs = 1;
ret = btrfs_make_block_group(trans, root, bytes_used,
-metadata_profile |
 BTRFS_BLOCK_GROUP_SYSTEM,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 0, BTRFS_MKFS_SYSTEM_GROUP_SIZE);
@@ -93,7 +91,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-metadata_profile |
 BTRFS_BLOCK_GROUP_METADATA |
 BTRFS_BLOCK_GROUP_DATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
@@ -110,7 +107,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-metadata_profile |
 BTRFS_BLOCK_GROUP_METADATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 chunk_start, chunk_size);
@@ -126,7 +122,7 @@ err:
 }
 
 static int create_data_block_groups(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, u64 data_profile, int mixed,
+   struct btrfs_root *root, int mixed,
struct mkfs_allocation *allocation)
 {
u64 chunk_start = 0;
@@ -143,7 +139,6 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-data_profile |
 BTRFS_BLOCK_GROUP_DATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 chunk_start, chunk_size);
@@ -1337,8 +1332,6 @@ int main(int ac, char **av)
u64 alloc_start = 0;
u64 metadata_profile = 0;
u64 data_profile = 0;
-   u64 default_metadata_profile = 0;
-   u64 default_data_profile = 0;
u32 nodesize = max_t(u32, sysconf(_SC_PAGESIZE),
BTRFS_MKFS_DEFAULT_NODE_SIZE);
u32 sectorsize = 4096;
@@ -1697,19 +1690,7 @@ int main(int ac, char **av)
}
root->fs_info->alloc_start = alloc_start;
 
-   if (dev_cnt == 0) {
-   default_metadata_profile = metadata_profile;
-   default_

Re: Disk "failed" while doing scrub

2015-07-13 Thread Dāvis Mosāns

2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.dun...@cox.net>:
> You say five disk, but nowhere in your post do you mention what raid mode
> you were using, neither do you post btrfs filesystem show and btrfs
> filesystem df, as suggested on the wiki and which list that information.

Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1
Using RAID1 for metadata and single for data, with features
big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata
and mounted with noatime,compress=zlib,space_cache,autodefrag

Label: 'Data'  uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
   Total devices 5 FS bytes used 7.16TiB
   devid1 size 2.73TiB used 2.35TiB path /dev/sdc
   devid2 size 1.82TiB used 1.44TiB path /dev/sdd
   devid3 size 1.82TiB used 1.44TiB path /dev/sde
   devid4 size 1.82TiB used 1.44TiB path /dev/sdg
   devid5 size 931.51GiB used 539.01GiB path /dev/sdh

Data, single: total=7.15TiB, used=7.15TiB
System, RAID1: total=8.00MiB, used=784.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=16.00GiB, used=14.37GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


>> Because filesystem still mounts, I assume I should do "btrfs device
>> delete /dev/sdd /mntpoint" and then restore damaged files from backup.
>
> You can try a replace, but with a failing drive still connected, people
> report mixed results.  It's likely to fail as it can't read certain
> blocks to transfer them to the new device.

As I understand, device delete will copy data from that disk and
distribute across rest of disks,
while btrfs replace will copy to new disk which must be atleast size
of disk I'm replacing.
Assuming other existing disks are good, if so, why replace would be
preferable over delete?
because delete could fail, but replace not?


> There's no such partial-file with null-fill tools shipped just yet.
> Those files normally simply trigger errors trying to read them, because
> btrfs won't let you at them if the checksum doesn't verify.

>From journal I have only 14 files mentioned where errors occurred. Now
13 files from
them don't throw any errors and their SHA's match to my backups so they're fine.
And actually btrfs does allow to copy/read that one damaged file, only
I get I/O error
when trying to read data from those broken sectors

kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [0] tag[0], task
[88011c8c9900]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0001,  slot [0].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:33:a1/0f:00:ab:00:00/40 tag 14 ncq 1966080 in
 res 41/40:00:48:40:a1/00:0f:ab:00:00/00 Emask
0x409 (media error) 
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 33 00 00 0f 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1


but all other sectors can be copied fine

$ du -m ./damaged_file
6250 ./damaged_file

$ cp ./damaged_file /tmp/
cp: error reading ‘damaged_file’: Input/output error

$ du -m /tmp/damaged_file
4335/tmp/damaged_file

cp copies first file part correctly, and I verified that both
start of file (first 4336M) and end of file (last 1890M) SHA's match backup

$ head -c 4336M ./damaged_file | sha256sum
e81b20bfa7358c9f5a0ed165bffe43185abc59e35246e52a7be1d43e6b7e040d  -
$ head -c 4337M ./damaged_file | sha256sum
head: error reading ‘./damaged_file’: Input/output error

$ tail -c 1890M ./damaged_file | sha256sum
941568f4b614077858cb8c8dd262bb431bf4c45eca936af728ecffc95619cb60  -
$ tail -c 1891M ./damaged_file  | sha256sum
tail: error reading ‘./damaged_file’: Input/output error

with dd can also copy almost all file, only using noerror option it
excludes those regions
from target file rather than filling with nulls so this isn't good for recovery

$ dd conv=noerror if=damaged_file of=/tmp/damaged_file
dd: error reading ‘damaged_file’: Input/output error
8880328+0 records in
8880328+0 records out
4546727936 bytes (4,5 GB) copied, 69,7282 s, 65,2 MB/s
dd: error reading ‘damaged_file’: Input/output error
8930824+0 records in
8930824+0 records out
4572581888 bytes (4,6 GB) copied, 113,648 s, 40,2 MB/s
12801720+0

Re: Can't mount btrfs volume on rbd

2015-07-13 Thread Qu Wenruo


Thanks a lot Steve!

With this binary dump, we can find out what's the cause of your problem 
and makes btrfsck handle and repair it.


Further more, this provides a good hint on what's going wrong in kernel.

I'll start investigating this right now.

Thanks,
Qu

Steve Dainard wrote on 2015/07/13 13:22 -0700:

Hi Qu,

I ran into this issue again, without pacemaker involved, so I'm really
not sure what is triggering this.

There is no content at all on this disk, basically it was created with
a btrfs filesystem, mounted, and now after some reboots later (and
possibly hard resets) won't mount with a stale file handle error.

I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you
in another email so the attachment doesn't spam the list.

Thanks,
Steve

On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo  wrote:



Steve Dainard wrote on 2015/06/15 09:19 -0700:


Hi Qu,

# btrfs --version
btrfs-progs v4.0.1
# btrfs check /dev/rbd30
Checking filesystem on /dev/rbd30
UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28
checking extents
cmds-check.c:3735: check_owner_ref: Assertion `rec->is_root` failed.
btrfs[0x41aee6]
btrfs[0x423f5d]
btrfs[0x424c99]
btrfs[0x4258f6]
btrfs(cmd_check+0x14a3)[0x42893d]
btrfs(main+0x15d)[0x409c71]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5]
btrfs[0x409829]

# btrfs-image /dev/rbd30 rbd30.image -c9
# btrfs-image -r rbd30.image rbd30.image.2
# mount rbd30.image.2 temp
mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle


OK, my assumption are all wrong.

I'd better check the debug-tree output more carefully.

BTW, the rbd30 is the block device which you took the debug-tree output?

If so, would you please do a dd dump of it and send it to me?
If it contains important/secret info, just forget this.

Maybe I can improve the btrfsck tool to fix it.



I have a suspicion this was caused by pacemaker starting
ceph/filesystem resources on two nodes at the same time,I haven't
been able to replicate the issue after hard poweroff if ceph/btrfs are
not being controlled by pacemaker.


Did you mean mount the same device on different system?

Thanks,
Qu



Thanks for your help.



On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo 
wrote:


The debug result seems valid.
So I'm afraid the problem is not in btrfs.

Would your please try the following 2 things to eliminate btrfs problems?

1) btrfsck from 4.0.1 on the rbd

If assert still happens, please update the image of the volume(dd image),
to
help us improve btrfs-progs.

2) btrfs-image dump and rebuilt the fs into other place.

# btrfs-image   -c9
# btrfs-image -r  
# mount  

This will dump all metadata from  to ,
and then use  to rebuild a image called .

If  can be mounted, then the metadata in the RBD device is
completely OK, and we can make conclusion the problem is not caused by
btrfs.(maybe ceph?)

BTW, all the commands are recommended to be executed on the device which
you
get the debug info from.
As it's a small and almost empty device, so commands execution should be
quite fast on it.

Thanks,
Qu


在 2015年06月13日 00:09, Steve Dainard 写道:



Hi Qu,

I have another volume with the same error, btrfs-debug-tree output
from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE

I'm not sure how to interpret the output, but the exit status is 0 so
it looks like btrfs doesn't think there's an issue with the file
system.

I get the same mount error with options ro,recovery.

On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo 
wrote:





 Original Message  
Subject: Can't mount btrfs volume on rbd
From: Steve Dainard 
To: 
Date: 2015年06月11日 23:26


Hello,

I'm getting an error when attempting to mount a volume on a host that
was forceably powered off:

# mount /dev/rbd4 climate-downscale-CMIP5/
mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale
file
handle

/var/log/messages:
Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table

# parted /dev/rbd4 print
Model: Unknown (unknown)
Disk /dev/rbd4: 36.5TB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End SizeFile system  Flags
 1  0.00B  36.5TB  36.5TB  btrfs

# btrfs check --repair /dev/rbd4
enabling repair mode
Checking filesystem on /dev/rbd4
UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e
checking extents
cmds-check.c:2274: check_owner_ref: Assertion `rec->is_root` failed.
btrfs[0x4175cc]
btrfs[0x41b873]
btrfs[0x41c3fe]
btrfs[0x41dc1d]
btrfs[0x406922]


OS: CentOS 7.1
btrfs-progs: 3.16.2




The btrfs-progs seems quite old, and the above btrfsck error seems
quite
possible related to the old version.

Would you please upgrade btrfs-progs to 4.0 and see what will happen?
Hopes it can give better info.

BTW, it's a good idea to call btrfs-debug-tree /dev/rbd4 to see the
output.

Thanks
Qu.




Ceph: version: 0.94.1/CentOS 7.1

I haven't found any references to 'stale file handle' on btrfs.

The underlying block device is ceph rbd, so I've posted to both lists
for any feedback. Also once I ref

Re: [GIT PULL] More btrfs bug fixes

2015-07-13 Thread Chris Mason

On Sun, Jul 12, 2015 at 02:50:47AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Hi Chris,
> 
> Please consider the following changes for the kernel 4.2 release. All
> these patches have been available in the mailing list for some time.
> 
> One of the patches is a fix for a regression in the delayed references
> code that landed in 4.2-rc1. Two of them are for issues reported by users
> on the list and IRC recently (which I've cc'ed for stable) and the final
> one is just a missing update of an inode's on disk size after truncating
> a file if the no_holes feature is enabled, which I found some time ago.
> 
> I have rebased them on top of your current integration-4.2 branch,
> re-tested them and incorporated any tags people have added through the
> mailing list (Reviewed-by, Acked-by).
> 

Thanks Filipe, I've pulled these in along with a few more.  I'll test
overnight and push out in the morning.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Can't mount btrfs volume on rbd

2015-07-13 Thread Steve Dainard

Hi Qu,

I ran into this issue again, without pacemaker involved, so I'm really
not sure what is triggering this.

There is no content at all on this disk, basically it was created with
a btrfs filesystem, mounted, and now after some reboots later (and
possibly hard resets) won't mount with a stale file handle error.

I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you
in another email so the attachment doesn't spam the list.

Thanks,
Steve

On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo  wrote:
>
>
> Steve Dainard wrote on 2015/06/15 09:19 -0700:
>>
>> Hi Qu,
>>
>> # btrfs --version
>> btrfs-progs v4.0.1
>> # btrfs check /dev/rbd30
>> Checking filesystem on /dev/rbd30
>> UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28
>> checking extents
>> cmds-check.c:3735: check_owner_ref: Assertion `rec->is_root` failed.
>> btrfs[0x41aee6]
>> btrfs[0x423f5d]
>> btrfs[0x424c99]
>> btrfs[0x4258f6]
>> btrfs(cmd_check+0x14a3)[0x42893d]
>> btrfs(main+0x15d)[0x409c71]
>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5]
>> btrfs[0x409829]
>>
>> # btrfs-image /dev/rbd30 rbd30.image -c9
>> # btrfs-image -r rbd30.image rbd30.image.2
>> # mount rbd30.image.2 temp
>> mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle
>
> OK, my assumption are all wrong.
>
> I'd better check the debug-tree output more carefully.
>
> BTW, the rbd30 is the block device which you took the debug-tree output?
>
> If so, would you please do a dd dump of it and send it to me?
> If it contains important/secret info, just forget this.
>
> Maybe I can improve the btrfsck tool to fix it.
>
>>
>> I have a suspicion this was caused by pacemaker starting
>> ceph/filesystem resources on two nodes at the same time,I haven't
>> been able to replicate the issue after hard poweroff if ceph/btrfs are
>> not being controlled by pacemaker.
>
> Did you mean mount the same device on different system?
>
> Thanks,
> Qu
>
>>
>> Thanks for your help.
>>
>>
>>
>> On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo 
>> wrote:
>>>
>>> The debug result seems valid.
>>> So I'm afraid the problem is not in btrfs.
>>>
>>> Would your please try the following 2 things to eliminate btrfs problems?
>>>
>>> 1) btrfsck from 4.0.1 on the rbd
>>>
>>> If assert still happens, please update the image of the volume(dd image),
>>> to
>>> help us improve btrfs-progs.
>>>
>>> 2) btrfs-image dump and rebuilt the fs into other place.
>>>
>>> # btrfs-image   -c9
>>> # btrfs-image -r  
>>> # mount  
>>>
>>> This will dump all metadata from  to ,
>>> and then use  to rebuild a image called .
>>>
>>> If  can be mounted, then the metadata in the RBD device is
>>> completely OK, and we can make conclusion the problem is not caused by
>>> btrfs.(maybe ceph?)
>>>
>>> BTW, all the commands are recommended to be executed on the device which
>>> you
>>> get the debug info from.
>>> As it's a small and almost empty device, so commands execution should be
>>> quite fast on it.
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>> 在 2015年06月13日 00:09, Steve Dainard 写道:


 Hi Qu,

 I have another volume with the same error, btrfs-debug-tree output
 from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE

 I'm not sure how to interpret the output, but the exit status is 0 so
 it looks like btrfs doesn't think there's an issue with the file
 system.

 I get the same mount error with options ro,recovery.

 On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo 
 wrote:
>
>
>
>
>  Original Message  
> Subject: Can't mount btrfs volume on rbd
> From: Steve Dainard 
> To: 
> Date: 2015年06月11日 23:26
>
>> Hello,
>>
>> I'm getting an error when attempting to mount a volume on a host that
>> was forceably powered off:
>>
>> # mount /dev/rbd4 climate-downscale-CMIP5/
>> mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale
>> file
>> handle
>>
>> /var/log/messages:
>> Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table
>>
>> # parted /dev/rbd4 print
>> Model: Unknown (unknown)
>> Disk /dev/rbd4: 36.5TB
>> Sector size (logical/physical): 512B/512B
>> Partition Table: loop
>> Disk Flags:
>>
>> Number  Start  End SizeFile system  Flags
>> 1  0.00B  36.5TB  36.5TB  btrfs
>>
>> # btrfs check --repair /dev/rbd4
>> enabling repair mode
>> Checking filesystem on /dev/rbd4
>> UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e
>> checking extents
>> cmds-check.c:2274: check_owner_ref: Assertion `rec->is_root` failed.
>> btrfs[0x4175cc]
>> btrfs[0x41b873]
>> btrfs[0x41c3fe]
>> btrfs[0x41dc1d]
>> btrfs[0x406922]
>>
>>
>> OS: CentOS 7.1
>> btrfs-progs: 3.16.2
>
>
>
> The btrfs-progs seems quite old, and the above btrfsck error seems
> quite
> possible related to the old version.
>
> Would you please upgrade btrfs-prog

Re: Wiki suggestions

2015-07-13 Thread Marc Joliet

Am Mon, 13 Jul 2015 19:21:54 +0200
schrieb Marc Joliet :

> OK, I'll make the changes then (sans kernel log).

Just a heads up: I accepted the terms of service, but the link goes to a
non-existent wiki page.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpy82XmHTZbA.pgp
Description: Digitale Signatur von OpenPGP

Re: Wiki suggestions

2015-07-13 Thread Marc Joliet

Am Mon, 13 Jul 2015 18:30:09 +0200
schrieb David Sterba :

> On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote:
> > Am Mon, 13 Jul 2015 06:56:17 + (UTC)
> > schrieb Duncan <1i5t5.dun...@cox.net>:
> > 
> > > Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
> > > 
> > > > I hope it's not out of place, but I have a few suggestions for the Wiki:
> > > 
> > > Just in case it wasn't obvious...  The wiki is open to user editing.  You 
> > > can, if you like, get an account and make the changes yourself. =:^)
> > > 
> > > Of course, it's understandable if your reaction to web and wiki 
> > > technologies is similar to mine, newsgroups and mailing lists (in my case 
> > > via gmane.org's list2news service, so they too are presented as 
> > > newsgroups) are your primary domain, and you tend to treat the web as 
> > > read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
> > > never gotten a wiki account here for that reason, either, or I'd have 
> > > probably gone ahead and made the suggested changes...
> > > 
> > > But with a bit of luck someone with an existing (or even new) account 
> > > will be along to make the changes...
> > 
> > It's partially a "read-only" habit, but it's also that I'm just not 
> > confident
> > in deciding whether those actually *are* good suggestions, or put 
> > differently:
> > it's the public face of btrfs, and I don't want to accidentally do 
> > something to
> > "ruin" it (to use some hyperbole).
> 
> All your suggesstions are good, adding more articles/videos/talks should
> be easy as there's a section for that already. The news section is
> mostly written by me but if you keep your entries consistent with the
> rest then it's ok.
> 
> There are a few people who watch over new wiki edits and fix/enhance
> them if needed.  You can't do too much damage unless you really want to.

OK, I'll make the changes then (sans kernel log).

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpNnh3EGP1Rh.pgp
Description: Digitale Signatur von OpenPGP

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Chris Mason

On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
> Filipe,
> Thanks for the explanation. Those reasons were not so obvious for me.
> 
> Would it make sense not to COW the block in case-1, if we are mounted
> with "notreelog"? Or, perhaps, to check that the block does not belong
> to a log tree?
> 

Hi Alex,

The crc rules are the most important, we have to make sure the block
isn't changed while it is in flight.  Also, think about something like
this:

transaction write block A, puts pointer to it in the btree, generation Y



transaction rewrites block A, same generation Y



Later on, we try to read block A again.  We find it has the correct crc
and the correct generation number, but the contents are actually wrong.

> The second case is more difficult. One problem is that
> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
> due to memory pressure (this is what I see happening), we complete the
> writeback, release the extent buffer, and pages are evicted from the
> page cache of btree_inode. After some time we read the block again
> (because we want to modify it in the same transaction), but its header
> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
> this point it should be safe to avoid COW, we will re-COW.
> 
> Would it make sense to have some runtime-only mechanism to lock-out
> the write-back for an eb? I.e., if we know that eb is not under
> writeback, and writeback is locked out from starting, we can redirty
> the block without COW. Then we allow the writeback to start when it
> wants to.
> 
> In one of my test runs, btrfs had 6.4GB of metadata (before
> raid-induced overhead), but during a particular transaction total of
> 10GB of metadata (again, before raid-induced overhead) was written to
> disk. (Thisis  total of all ebs having
> header->generation==curr_transid, not only during commit of the
> transaction). This particular run was with "notreelog".
> 
> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
> But even though the used amount of metadata is less than that, this
> re-COW'ing of already-COW'ed blocks seems to cause page-cache
> trashing...

Interesting.  We've addressed this in the past with changes to the
writepage(s) callback for the btree, basically skipping memory pressure
related writeback if there isn't that much dirty.  There is a lot of
room to improve those decisions, like preferring to write leaves over
nodes, especially full leaves that are not likely to change again.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Alex Lyakas

Filipe,
Thanks for the explanation. Those reasons were not so obvious for me.

Would it make sense not to COW the block in case-1, if we are mounted
with "notreelog"? Or, perhaps, to check that the block does not belong
to a log tree?

The second case is more difficult. One problem is that
BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
due to memory pressure (this is what I see happening), we complete the
writeback, release the extent buffer, and pages are evicted from the
page cache of btree_inode. After some time we read the block again
(because we want to modify it in the same transaction), but its header
is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
this point it should be safe to avoid COW, we will re-COW.

Would it make sense to have some runtime-only mechanism to lock-out
the write-back for an eb? I.e., if we know that eb is not under
writeback, and writeback is locked out from starting, we can redirty
the block without COW. Then we allow the writeback to start when it
wants to.

In one of my test runs, btrfs had 6.4GB of metadata (before
raid-induced overhead), but during a particular transaction total of
10GB of metadata (again, before raid-induced overhead) was written to
disk. (Thisis  total of all ebs having
header->generation==curr_transid, not only during commit of the
transaction). This particular run was with "notreelog".

Machine had 8GB of RAM. Linux allows the btree_inode to grow its
page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
But even though the used amount of metadata is less than that, this
re-COW'ing of already-COW'ed blocks seems to cause page-cache
trashing...

Thanks,
Alex.

On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana
 wrote:
> On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas  wrote:
>> Greetings,
>> Looking at the code of should_cow_block(), I see:
>>
>> if (btrfs_header_generation(buf) == trans->transid &&
>>!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>> ...
>> So if the extent buffer has been written to disk, and now is changed again
>> in the same transaction, we insist on COW'ing it. Can anybody explain why
>> COW is needed in this case? The transaction has not committed yet, so what
>> is the danger of rewriting to the same location on disk? My understanding
>> was that a tree block needs to be COW'ed at most once in the same
>> transaction. But I see that this is not the case.
>
> That logic is there, as far as I can see, for at least 2 obvious reasons:
>
> 1) fsync/log trees. All extent buffers (tree blocks) of a log tree
> have the same transaction id/generation, and you can have multiple
> fsyncs (log transaction commits) per transaction so you need to ensure
> consistency. If we skipped the COWing in the example below, you would
> get an inconsistent log tree at log replay time when the fs is
> mounted:
>
> transaction N start
>
>fsync inode A start
>creates tree block X
>flush X to disk
>write a new superblock
>fsync inode A end
>
>fsync inode B start
>skip COW of X because its generation == current transaction id and
> modify it in place
>flush X to disk
>
> == crash ===
>
>write a new superblock
>fsync inode B end
>
> transaction N commit
>
> 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
> written to disk but instead when we trigger writeback for it. So while
> the writeback is ongoing we want to make sure the block's content
> isn't concurrently modified (we don't keep the eb write locked to
> allow concurrent reads during the writeback).
>
> All tree blocks that don't belong to a log tree are normally written
> only when at the end of a transaction commit. But often, due to memory
> pressure for e.g., the VM can call the writepages() callback of the
> btree inode to force dirty tree blocks to be written to disk before
> the transaction commit.
>
>>
>> I am asking because I am doing some profiling of btrfs metadata work under
>> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
>> blocks than the total metadata size.
>>
>> Thanks,
>> Alex.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Wiki suggestions

2015-07-13 Thread David Sterba

On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote:
> Am Mon, 13 Jul 2015 06:56:17 + (UTC)
> schrieb Duncan <1i5t5.dun...@cox.net>:
> 
> > Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
> > 
> > > I hope it's not out of place, but I have a few suggestions for the Wiki:
> > 
> > Just in case it wasn't obvious...  The wiki is open to user editing.  You 
> > can, if you like, get an account and make the changes yourself. =:^)
> > 
> > Of course, it's understandable if your reaction to web and wiki 
> > technologies is similar to mine, newsgroups and mailing lists (in my case 
> > via gmane.org's list2news service, so they too are presented as 
> > newsgroups) are your primary domain, and you tend to treat the web as 
> > read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
> > never gotten a wiki account here for that reason, either, or I'd have 
> > probably gone ahead and made the suggested changes...
> > 
> > But with a bit of luck someone with an existing (or even new) account 
> > will be along to make the changes...
> 
> It's partially a "read-only" habit, but it's also that I'm just not confident
> in deciding whether those actually *are* good suggestions, or put differently:
> it's the public face of btrfs, and I don't want to accidentally do something 
> to
> "ruin" it (to use some hyperbole).

All your suggesstions are good, adding more articles/videos/talks should
be easy as there's a section for that already. The news section is
mostly written by me but if you keep your entries consistent with the
rest then it's ok.

There are a few people who watch over new wiki edits and fix/enhance
them if needed.  You can't do too much damage unless you really want to.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs read-only after btrfs-convert from Ext4 & workaround

2015-07-13 Thread René Pfeiffer

On Jul 12, 2015 at 2026 -0600, Chris Murphy appeared and said:
> On Sun, Jul 12, 2015 at 7:23 PM, René Pfeiffer  wrote:
> >...
> > Output of uname, btrfs, and the dmesg log is attached. Let me know if you
> > need anything else. The old Btrfs is still on another disk, and I can
> > extract information from it.
> 
> If you can run 'btrfs check' on it (without repair) using btrfs-progs
> 4.0, and 3.19.1, and report the results of each, that would be really
> useful.

Here we go.

Best,
René.

-- 
  )\._.,--,'``.  fL  Let GNU/Linux work for you while you take a nap.
 /,   _.. \   _\  (`._ ,. R. Pfeiffer  + http://web.luchs.at/
`._.-(,_..'--(,_..'`-.;.'  - System administration + Consulting + Teaching -
Got mail delivery problems?  http://web.luchs.at/information/blockedmail.php
Checking filesystem on /dev/mapper/oldcrypt
UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb
checking extents
checking free space cache
There is no free space entry for 163242479616-163242483712
There is no free space entry for 163242479616-167537278976
cache appears valid but isnt 162168569856
found 358532382931 bytes used err is -22
total csum bytes: 346644792
total tree bytes: 3568107520
total fs tree bytes: 3061465088
total extent tree bytes: 50102272
btree space waste bytes: 987626752
file data blocks allocated: 355459522560
 referenced 354990657536
btrfs-progs v3.19.1
Checking filesystem on /dev/mapper/oldcrypt
UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb
checking extents
checking free space cache
block group 162168569856 has wrong amount of free spacefailed to load free 
space cache for block group 162168569856
checking fs roots
root 5 inode 39321856 errors 200, dir isize wrong
root 5 inode 40898635 errors 200, dir isize wrong
found 358532382931 bytes used err is 1
total csum bytes: 346644792
total tree bytes: 3568107520
total fs tree bytes: 3061465088
total extent tree bytes: 50102272
btree space waste bytes: 987626752
file data blocks allocated: 355459522560
 referenced 354990657536
btrfs-progs v4.0


pgp1xPu9NdcZy.pgp
Description: PGP signature

[GIT PULL] More btrfs bug fixes

2015-07-13 Thread fdmanana

From: Filipe Manana 

Hi Chris,

Please consider the following changes for the kernel 4.2 release. All
these patches have been available in the mailing list for some time.

One of the patches is a fix for a regression in the delayed references
code that landed in 4.2-rc1. Two of them are for issues reported by users
on the list and IRC recently (which I've cc'ed for stable) and the final
one is just a missing update of an inode's on disk size after truncating
a file if the no_holes feature is enabled, which I found some time ago.

I have rebased them on top of your current integration-4.2 branch,
re-tested them and incorporated any tags people have added through the
mailing list (Reviewed-by, Acked-by).

Thanks.

The following changes since commit 9689457b5b0a2b69874c421a489d3fb50ca76b7b:

  Btrfs: fix wrong check for btrfs_force_chunk_alloc() (2015-07-01 17:17:22 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
integration-4.2

for you to fetch changes up to cffc3374e567ef42954f3c7070b3fa83f20f9684:

  Btrfs: fix order by which delayed references are run (2015-07-11 22:36:44 
+0100)


Filipe Manana (4):
  Btrfs: fix shrinking truncate when the no_holes feature is enabled
  Btrfs: fix memory leak in the extent_same ioctl
  Btrfs: fix list transaction->pending_ordered corruption
  Btrfs: fix order by which delayed references are run

 fs/btrfs/extent-tree.c | 13 +
 fs/btrfs/inode.c   |  5 ++---
 fs/btrfs/ioctl.c   |  4 +++-
 fs/btrfs/transaction.c |  4 ++--
 4 files changed, 20 insertions(+), 6 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel crash - btrfs check shows "extent buffer leak"; Suggestions?

2015-07-13 Thread Donald Pearson

Last time something happened and I poked at it myself I ended up
ruining the pool so I thought I'd ask here before doing anything.

I'm not sure if this really indicates that anything needs doing or
not.  The filesystem will mount like normal.

It doesn't look like the core dump was written anywhere but I've never
actually looked for it before.  I'm still Googling where it might be.

[root@san01 ~]# btrfs check /dev/sdi
Checking filesystem on /dev/sdi
UUID: 6848df32-bd2a-49b8-b3b9-40038f98ef8a
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
checking quota groups
Counts for qgroup id: 352 are different
our:referenced 699122253824 referenced compressed 699122253824
disk:   referenced 699151855616 referenced compressed 699151855616
diff:   referenced -29601792 referenced compressed -29601792
our:exclusive 1279471616 exclusive compressed 1279471616
disk:   exclusive 1279471616 exclusive compressed 1279471616
Counts for qgroup id: 844 are different
our:referenced 699130273792 referenced compressed 699130273792
disk:   referenced 699159875584 referenced compressed 699159875584
diff:   referenced -29601792 referenced compressed -29601792
our:exclusive 81920 exclusive compressed 81920
disk:   exclusive 81920 exclusive compressed 81920
found 875663138891 bytes used err is 0
total csum bytes: 790806028
total tree bytes: 1950498816
total fs tree bytes: 436994048
total extent tree bytes: 526237696
btree space waste bytes: 439941098
file data blocks allocated: 981391929344
 referenced 1086835916800
btrfs-progs v4.1
extent buffer leak: start 3338550263808 len 16384
extent buffer leak: start 3338550165504 len 16384
extent buffer leak: start 3100998254592 len 16384
extent buffer leak: start 3100998270976 len 16384
extent buffer leak: start 3100998287360 len 16384
extent buffer leak: start 3100998303744 len 16384
extent buffer leak: start 3100998320128 len 16384
extent buffer leak: start 3100998336512 len 16384
extent buffer leak: start 3100998352896 len 16384
extent buffer leak: start 3100998369280 len 16384
extent buffer leak: start 3338550149120 len 16384
extent buffer leak: start 3338550132736 len 16384
extent buffer leak: start 2756246339584 len 16384
extent buffer leak: start 2756284366848 len 16384
extent buffer leak: start 3339485298688 len 16384
extent buffer leak: start 3339485347840 len 16384
extent buffer leak: start 3339485429760 len 16384
extent buffer leak: start 3339485446144 len 16384
extent buffer leak: start 3339485462528 len 16384
extent buffer leak: start 3339485528064 len 16384
extent buffer leak: start 333948558 len 16384
extent buffer leak: start 3339485560832 len 16384
extent buffer leak: start 3339488018432 len 16384
extent buffer leak: start 3339489361920 len 16384
extent buffer leak: start 3339504140288 len 16384
extent buffer leak: start 3339504156672 len 16384
extent buffer leak: start 3339504435200 len 16384
extent buffer leak: start 3339504467968 len 16384
extent buffer leak: start 3339504484352 len 16384
extent buffer leak: start 3339505778688 len 16384
extent buffer leak: start 3339505811456 len 16384
extent buffer leak: start 3339507105792 len 16384
extent buffer leak: start 3339507187712 len 16384
extent buffer leak: start 3339507204096 len 16384
extent buffer leak: start 3339518394368 len 16384
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: slowdown after one week

2015-07-13 Thread Stefan Priebe - Profihost AG


Am 13.07.2015 um 13:20 schrieb Austin S Hemmelgarn:
> On 2015-07-11 02:46, Stefan Priebe wrote:
>> Hi,
>>
>> while using a 40TB btrfs partition for VM backups. I see a massive
>> slowdown after around one week.
>>
>> The backup task takes usally 2-3 hours. After one week it takes 20
>> hours. If i umount and remount the btrfs volume it takes 2-3 hours again.
>>
>> Kernel 4.1.1
>>
> I've been seeing similar (although much less drastic) slowdowns over
> time myself pretty much since I started using BTRFS (IIRC, sometime
> around 3.16).  If you're not constantly writing to that backup volume,
> you might want to consider setting up automounting for it.
> 

Yes but that's awful. it's a bug. It would be very nice if someone
involved in the btrfs development can comment on that one.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Add an option to disable automatic chunk reclamation.

2015-07-13 Thread Austin S Hemmelgarn

Since upgrading to a kernel with the automatic chunk reclamation 
patches, I've noticed a number of issues with BTRFS that all seem to 
either be caused by, or are further exacerbated by, this 'feature'.


The four big issues I've seen regarding it are:
1. TRIM/DISCARD support is broken as a (partial?) result of this.
2. It appears to expose underlying issues with the defrag code (stuff 
ending up more fragmented after defrag).
3. Since upgrading to a kernel with this patch, most of the BTRFS 
filesystems that I have that are very rewrite heavy have gotten very 
noticeably slower that they were beforehand (and this goes away when I 
run them on a kernel without auto-reclaim).
4. All of my filesystems are experiencing seemingly non-deterministic 
delays around most large scale VFS level operations (eg, deleting or 
relocating lots of files)


While I understand that this feature does solve (at least partially) an 
very real issue with BTRFS, there were a number of people who never had 
this issue to begin with because we ran regular balance operations on 
our filesystems.


Based on this, I would like to propose that some method be provided to 
disable auto-reclaim be added.  Personally, I would prefer to leave it 
as the default and have a mount option to disable it.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Did btrfs filesystem defrag just make things worse?

2015-07-13 Thread Austin S Hemmelgarn


On 2015-07-11 11:24, Duncan wrote:

I'm not a coder, only a list regular and btrfs user, and I'm not sure on
this, but there have been several reports of this nature on the list
recently, and I have a theory.  Maybe the devs can step in and either
confirm or shoot it down.
While I am a coder, I'm not a BTRFS developer, so what I say below may 
still be incorrect.



[...trimmed for brevity...]

Of course during normal use, files get deleted as well, thereby clearing
space in existing chunks.  But this space will be fragmented, with a mix
of unallocated extents and still remaining files.  The allocator will I
/believe/ (this is where people who can actually read the code come in)
try to use up space in existing chunks before allocating additional
space, possibly subject to some reasonable extent minimum size, below
which btrfs will simply allocate another chunk.

AFAICT, this is in fact the case.


1) Prioritize reduced fragmentation, at the expense of higher data chunk
allocation.  In the extreme, this would mean always choosing to allocate
a new chunk and use it if the file (or remainder of the file not yet
defragged) was larger than the largest free extent in existing data
chunks.

The problem with this is that over time, the number of partially used
data chunks goes up as new ones are allocated to defrag into, but sub-1
GiB files that are already defragged are left where they are.  Of course
a balance can help here, by combining multiple partial chunks into fewer
full chunks, but unless a balance is run...

2) Prioritize chunk utilization, at the expense of leaving some
fragmentation, despite massive amounts of unallocated space.

This is what I've begun to suspect defrag does.  With a bunch of free but
fragmented space in existing chunks, defrag could actually increase
fragmentation, as the space in existing chunks is so fragmented a rewrite
is forced to use more, smaller extents, because that's all there is free,
until another chunk is allocated.

As I mentioned above for normal file allocation, it's quite possible that
there's some minimum extent size (greater than the bare minimum 4 KiB
block size) where the allocator will give up and allocate a new data
chunk, but if so, perhaps this size needs bumped upward, as it seems a
bit low, today.
If I'm reading the code correctly, defrag does indeed try to avoid 
allocating a new chunk if at all possible.



Meanwhile, there's a number of exacerbating factors to consider as well.

* Snapshots and other shared references lock extents in place.

Defrag doesn't touch anything but the subvolume it's actually pointed at
for the defrag.  Other subvolumes and shared-reference files will
continue to keep the extents they reference locked in place.  And COW
will rewrite blocks of a file, but the old reference extent remains
locked, until all references to it are cleared -- the entire file (or at
least all blocks that were in that extent) must be rewritten, and no
snapshots or other references to it remain, before it can be freed.

For a few kernel cycles btrfs had snapshot-aware-defrag, but that
implementation didn't scale well at all, so it was disabled until it
could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
aware-defrag remains disabled, and defrag only works on the subvolume
it's actually pointed at.

As a result, if defrag rewrites a snapshotted file, it actually doubles
the space that file takes, as it makes a new copy, breaking the reference
link between it and the copy in the snapshot.

Of course, with the space not freed up, this will, over time, tend to
fragment space that is freed even more heavily.
To mitigate this, one can run offline data deduplication (duperemove is 
the tool I'd suggest for this), although there are caveats to doing that 
as well.


* Chunk reclamation.

This is the relatively new development that I think is triggering the
surge in defrag not defragging reports we're seeing now.

Until quite recently, btrfs could allocate new chunks, but it couldn't,
on its own, deallocate empty chunks.  What tended to happen over time was
that people would find all the filesystem space taken up by empty or
mostly empty data chunks, and btrfs would start spitting ENOSPC errors
when it needed to allocate new metadata chunks but couldn't, as all the
space was in empty data chunks.  A balance could fix it, often relatively
quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a
manual process, btrfs wouldn't do it on its own.

Recently the devs (mostly) fixed that, and btrfs will automatically
reclaim entirely empty chunks on its own now.  It still doesn't reclaim
partially empty chunks automatically; a manual rebalance must still be
used to combine multiple partially empty chunks into fewer full chunks;
but it does well enough to make the previous problem pretty rare -- we
don't see the hundreds of GiB of empty data chunks allocated any more,
like we used to.

Which fixed the one problem, but if my theory is co

Re: slowdown after one week

2015-07-13 Thread Austin S Hemmelgarn


On 2015-07-11 02:46, Stefan Priebe wrote:

Hi,

while using a 40TB btrfs partition for VM backups. I see a massive
slowdown after around one week.

The backup task takes usally 2-3 hours. After one week it takes 20
hours. If i umount and remount the btrfs volume it takes 2-3 hours again.

Kernel 4.1.1

I've been seeing similar (although much less drastic) slowdowns over 
time myself pretty much since I started using BTRFS (IIRC, sometime 
around 3.16).  If you're not constantly writing to that backup volume, 
you might want to consider setting up automounting for it.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Wiki suggestions

2015-07-13 Thread Marc Joliet

Am Mon, 13 Jul 2015 06:56:17 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
> 
> > I hope it's not out of place, but I have a few suggestions for the Wiki:
> 
> Just in case it wasn't obvious...  The wiki is open to user editing.  You 
> can, if you like, get an account and make the changes yourself. =:^)
> 
> Of course, it's understandable if your reaction to web and wiki 
> technologies is similar to mine, newsgroups and mailing lists (in my case 
> via gmane.org's list2news service, so they too are presented as 
> newsgroups) are your primary domain, and you tend to treat the web as 
> read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
> never gotten a wiki account here for that reason, either, or I'd have 
> probably gone ahead and made the suggested changes...
> 
> But with a bit of luck someone with an existing (or even new) account 
> will be along to make the changes...

It's partially a "read-only" habit, but it's also that I'm just not confident
in deciding whether those actually *are* good suggestions, or put differently:
it's the public face of btrfs, and I don't want to accidentally do something to
"ruin" it (to use some hyperbole).

However, if somebody gives me the go-ahead, I might just edit the wiki myself
(though I don't know enough to be able to edit the kernel news entry ;-) ).

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpbh6wXjKf9C.pgp
Description: Digitale Signatur von OpenPGP

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Filipe David Manana

On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas  wrote:
> Greetings,
> Looking at the code of should_cow_block(), I see:
>
> if (btrfs_header_generation(buf) == trans->transid &&
>!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> ...
> So if the extent buffer has been written to disk, and now is changed again
> in the same transaction, we insist on COW'ing it. Can anybody explain why
> COW is needed in this case? The transaction has not committed yet, so what
> is the danger of rewriting to the same location on disk? My understanding
> was that a tree block needs to be COW'ed at most once in the same
> transaction. But I see that this is not the case.

That logic is there, as far as I can see, for at least 2 obvious reasons:

1) fsync/log trees. All extent buffers (tree blocks) of a log tree
have the same transaction id/generation, and you can have multiple
fsyncs (log transaction commits) per transaction so you need to ensure
consistency. If we skipped the COWing in the example below, you would
get an inconsistent log tree at log replay time when the fs is
mounted:

transaction N start

   fsync inode A start
   creates tree block X
   flush X to disk
   write a new superblock
   fsync inode A end

   fsync inode B start
   skip COW of X because its generation == current transaction id and
modify it in place
   flush X to disk

== crash ===

   write a new superblock
   fsync inode B end

transaction N commit

2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
written to disk but instead when we trigger writeback for it. So while
the writeback is ongoing we want to make sure the block's content
isn't concurrently modified (we don't keep the eb write locked to
allow concurrent reads during the writeback).

All tree blocks that don't belong to a log tree are normally written
only when at the end of a transaction commit. But often, due to memory
pressure for e.g., the VM can call the writepages() callback of the
btree inode to force dirty tree blocks to be written to disk before
the transaction commit.

>
> I am asking because I am doing some profiling of btrfs metadata work under
> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
> blocks than the total metadata size.
>
> Thanks,
> Alex.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Disk "failed" while doing scrub

2015-07-13 Thread Duncan

Dāvis Mosāns posted on Mon, 13 Jul 2015 09:26:05 +0300 as excerpted:

> Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
> "failed" and also had some error on other disk (/dev/sdh)

You say five disk, but nowhere in your post do you mention what raid mode 
you were using, neither do you post btrfs filesystem show and btrfs 
filesystem df, as suggested on the wiki and which list that information.

FWIW, btrfs defaults for a multi-device filesystem are raid1 metadata, 
raid0 data.  If you didn't specify raid level at mkfs time, it's very 
likely that's what you're using.  The scrub results seem to support this 
as if the data had been raid1 or raid10, nearly all the errors should 
have been correctable by pulling from the second copy.  And raid5/6 
should have been able to recover from parity, tho this mode is new enough 
it's still not recommended as the chances of bugs and thus failure to 
work properly are much higher.

So you really should have been using raid1/10 if you wanted device 
failure tolerance, but you didn't say, and if you're using defaults as 
seems reasonably likely, your data was raid0, and thus it's likely many/
most files are either gone or damaged beyond repair.

(As it happens I have a number of btrfs raid1 data/metadata on a pair of 
partitioned ssds, with each btrfs on a corresponding partition on both of 
them, with one of the ssds developing bad sectors and basically slowly 
failing.  But the other member of the raid1 pair is solid and I have 
backups, as well as a spare I can replace the failing one with when I 
decide it's time, so I've been letting the bad one stick around due as 
much as anything to morbid curiosity, watching it slowly fail. So I know 
exactly how scrub on btrfs raid1 behaves in a bad-sector case, pulling 
the copy from the good device to overwrite the bad copy with, triggering 
the device's sector remapping in the process.  Despite all the read 
errors, they've all been correctable, because I'm using raid1 for both 
data and metadata.)

> Because filesystem still mounts, I assume I should do "btrfs device
> delete /dev/sdd /mntpoint" and then restore damaged files from backup.

You can try a replace, but with a failing drive still connected, people 
report mixed results.  It's likely to fail as it can't read certain 
blocks to transfer them to the new device.

With raid1 or better, physically disconnecting the failing device, and 
doing a device delete missing (or replace missing, but AFAIK this doesn't 
work with released versions and I'm not sure if it's even in integration 
yet, but there are patches on-list that should make it work) can work, 
but with raid0/single, you can mount with a missing device if you use 
degraded,ro, but obviously that'll only let you try to copy files off, 
and you'll likely not have a lot of luck with raid0, with files missing 
but a bit more luck with single.

In the likely raid0/single case, you're best bet is probably to try 
copying off what you can, and/or restoring from backups.  See the 
discussion below.

> Are all affected files listed in journal? there's messages about "x
> callbacks suppressed" so I'm not sure and if there aren't how to get
> full list of damaged files?

> Also I wonder if there are any tools to recover partial file fragments
> and reconstruct file? (where missing fragments filled with nulls)
> I assume that there's no point in running "btrfs check
> --check-data-csum" because scrub already does check that?

There's no such partial-file with null-fill tools shipped just yet.  
Those files normally simply trigger errors trying to read them, because 
btrfs won't let you at them if the checksum doesn't verify.

There /is/, however, a command that can be used to either regenerate or 
zero-out the checksum tree.  See btrfs check --init-csum-tree.  Current 
versions recalculate the csums, older versions (btrfsck as that was 
before btrfs check) simply zeroed it out.  Then you can read the file 
despite bad checksums, tho you'll still get errors if the block 
physically cannot be read.

There's also btrfs restore, which works on the unmounted filesystem 
without actually writing to it, copying the files it can read to a new 
location, which of course has to be a filesystem with enough room to 
restore the files to, altho it's possible to tell restore to do only 
specific subdirs, for instance.

What I'd recommend depends on how complete and how recent your backup 
is.  If it's complete and recent enough, probably the easiest thing is to 
simply blow away the bad filesystem and start over, recovering from the 
backup to a new filesystem.

If there's files you'd like to get back that weren't backed up or where 
the backup is old, since the filesystem is mountable, I'd probably copy 
everything off it I could.  Then, I'd try restore, letting it restore to 
the same location I had copied to, but NOT using the --overwrite option, 
so it only wrote any files it could restore that the copy wasn

Re: Can't remove missing device

2015-07-13 Thread Patrik Lundquist

On 10 July 2015 at 06:05, None None  wrote:
> According to dmesg sda returns bad data but the smart values for it seem fine.

> # smartctl -a /dev/sda
...
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Run smartctl -t long /dev/sda
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Disk "failed" while doing scrub

[PATCH] Revert "btrfs-progs: mkfs: create only desired block groups for single device"

Re: Disk "failed" while doing scrub

Re: Can't mount btrfs volume on rbd

Re: [GIT PULL] More btrfs bug fixes

Re: Can't mount btrfs volume on rbd

Re: Wiki suggestions

Re: Wiki suggestions

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

Re: Wiki suggestions

Re: Btrfs read-only after btrfs-convert from Ext4 & workaround

[GIT PULL] More btrfs bug fixes

kernel crash - btrfs check shows "extent buffer leak"; Suggestions?

Re: slowdown after one week

[RFC] Add an option to disable automatic chunk reclamation.

Re: Did btrfs filesystem defrag just make things worse?

Re: slowdown after one week

Re: Wiki suggestions

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

Re: Disk "failed" while doing scrub

Re: Can't remove missing device

22 matches

Site Navigation

Mail list logo

Footer information