So, wipe it out and start over or keep debugging?
To those of you who have been helping out with my 4-drive RAID1 situation, is there anything further we should do to investigate this, in case we can uncover any more bugs, or should I just wipe everything out and restore from backup? -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockup in BTRFS_IOC_CLONE/Kernel 4.2.0-rc5
On 08/08/2015 01:29 AM, Elias Probst wrote: On 08/07/2015 06:01 AM, Liu Bo wrote: Could you do 'echo w /proc/sysrq-trigger' to gather the whole hang call stack? Thanks a lot for the feedback. Full call stack output is attached (pasting inline makes no sense due to the size of 2423 lines/135k). In case VGER will strip attachments of this size, I made it available as a pastebin here: https://bpaste.net/show/e5e0fd4bbb9f Here's a quick patch that may address your problem, can you give it a shot after getting sysrq-w output? Will rebuild now with your patch and keep you updated about the outcome. So far I've been unable to reproduce the lockup using your patch on 4.2.0-rc6. Thanks a lot! - Elias signature.asc Description: OpenPGP digital signature
Re: Deleted files cause btrfs-send to fail
Am Sat, 15 Aug 2015 05:10:57 + (UTC) schrieb Duncan 1i5t5.dun...@cox.net: Marc Joliet posted on Fri, 14 Aug 2015 23:37:37 +0200 as excerpted: (One other thing I found interesting was that btrfs scrub didn't care about the link count errors.) A lot of people are confused about exactly what btrfs scrub does, and expect it to detect and possibly fix stuff it has nothing to do with. It's *not* an fsck. Scrub does one very useful, but limited, thing. It systematically verifies that the computed checksums for all data and metadata covered by checksums match the corresponding recorded checksums. For dup/raid1/ raid10 modes, if there's a match failure, it will look up the other copy and see if it matches, replacing the invalid block with a new copy of the other one, assuming it's valid. For raid56 modes, it attempts to compute the valid copy from parity and, again assuming a match after doing so, does the replace. If a valid copy cannot be found or computed, either because it's damaged too or because there's no second copy or parity to fall back on (single and raid0 modes), then scrub will detect but cannot correct the error. In routine usage, btrfs automatically does the same thing if it happens to come across checksum errors in its normal IO stream, but it has to come across them first. Scrub's benefit is that it systematically verifies (and corrects errors where it can) checksums on the entire filesystem, not just the parts that happen to appear in the normal IO stream. I know all that, I just thought it was interesting and wanted to remark as such. After thinking about it a bit, of course, it makes perfect sense and is not very interesting at all: scrub will just verify that the checksums match, no matter whether the underlying (meta)data is valid or not. Such checksum errors can be for a few reasons... I have one ssd that's gradually failing and returns checksum errors fairly regularly. Were I using a normal filesystem I'd have had to replace it some time ago. But with btrfs in raid1 mode and regular scrubs (and backups, should they be needed; sometimes I let them get a bit stale, but I do have them and am prepared to live with the stale restored data if I have to), I've been able to keep using the failing device. When the scrubs hit errors and btrfs does the rewrite from the good copy, a block relocation on the failing device is triggered as well, with the bad block taken out of service and a new one from the set of spares all modern devices have takes its place. Currently, smartctl -A reports 904 reallocated sectors raw value, with a standardized value of 92. Before the first reallocated sector, the standardized value was 253, perfect. With the first reallocated sector, it immediately dropped to 100, apparently the rounded percentage of spare sectors left. It has gradually dropped since then to its current 92, with a threshold value of 36. So while it's gradually failing, there's still plenty of spare sectors left. Normally I would have replaced the device even so, but I've never actually had the opportunity to actually watch a slow failure continue to get worse over time, and now that I do I'm a bit curious how things will go, so I'm just letting it happen, tho I do have a replacement device already purchased and ready, when the time comes. I'm curious how that will pan out. My experience with HDDs is that at some point the sector reallocations start picking up at a somewhat constant (maybe even accelerating) rate. I wonder how SSDs behave in this regard. So real media failure, bitrot, is one reason for bad checksums. The data read back from the device simply isn't the same data that was stored to it, and the checksum fails as a result. Of course bad connector cables or storage chipset firmware or hardware is another hardware cause. Sudden reboot or power loss, with data being actively written and one copy either already updated or not yet touched, while the other is actually being written at the time of the crash so the write isn't completed, is yet another reason for checksum failure. This one is actually why a scrub can appear to do so much more than it does, because where there's a second copy (or parity) of the data available, scrub can use it to recover the partially written copy (which being partially written fails its checksum verification) to either the completed write state, if the other copy was already written, or the pre-write state, if the other copy hadn't been written at all, yet. In this way the result is often the same one an fsck would normally produce, detecting and fixing the error, but the mechanism is entirely different -- it only detected and fixed the error because the checksum was bad and it had a good copy it could replace it with, not because it had any smarts about how the filesystem actually worked, and could
Re: RAID0 wrong (raw) device?
On Sat 2015-08-15 (08:02), Anand Jain wrote: First of all there is a known issue in handling multiple paths / instances of the same device image in btrfs. Fixing this caused regression earlier. And my survey [survey] BTRFS_IOC_DEVICES_READY return status almost told me not to fix the bug. I have subscribed to this list this week, I am a newbie :-) There is now a new behaviour: after the btrfs mount, I can see shortly the wrong raw device /dev/sde and a few seconds later there is the correct /dev/drbd3 : yep possible. but it does not mean that btrfs kernel is using the new path its just a reporting (bug). What is the reporting bug: /dev/sde or /dev/drbd3? root@toy02:/etc# btrfs filesystem show -m Label: data uuid: 411af13f-6cae-4f03-99dc-5941acb3135b Total devices 2 FS bytes used 109.56GiB devid3 size 1.82TiB used 63.03GiB path /dev/drbd2 devid4 size 1.82TiB used 63.03GiB path /dev/drbd3 Still, the kernel sees 3 instead of (really) 2 HGST drives: root@toy02:/etc# hdparm -I /dev/sdb | grep Number: Model Number: HGST HUS724020ALA640 Serial Number: PN2134P5G2P2AX root@toy02:/etc# hdparm -I /dev/sde | grep Number: Model Number: HGST HUS724020ALA640 Serial Number: PN2134P5G2P2AX This is important to know but not a btrfs issue. Do you have multiple host paths reaching this this device with serial # PN2134P5G2P2AX ? root@toy02:~# find /dev -ls | grep PN2134P5G2P2AX 143540 lrwxrwxrwx 1 root root 17 Aug 14 09:00 /dev/drbd/by-disk/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - ../../../../drbd3 136400 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - ../../sdb root@toy02:~# find /dev -ls | grep sdb 74170 brw-rw 1 root disk Aug 13 16:25 /dev/sdb 123660 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-path/pci-:08:00.0-sas-0x12210200-lun-0 - ../../sdb 136410 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-id/wwn-0x5000cca24ec137db - ../../sdb 136400 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - ../../sdb 123560 lrwxrwxrwx 1 root root6 Aug 13 16:25 /dev/block/8:16 - ../sdb root@toy02:~# find /dev -ls | grep sde 133530 brw-rw 1 root disk Aug 13 16:24 /dev/sde 157250 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-uuid/411af13f-6cae-4f03-99dc-5941acb3135b - ../../sde 157240 lrwxrwxrwx 1 root root9 Aug 13 16:25 /dev/disk/by-label/data - ../../sde 93940 lrwxrwxrwx 1 root root9 Aug 13 16:24 /dev/disk/by-path/pci-:08:00.0-scsi-0:1:2:0 - ../../sde 93870 lrwxrwxrwx 1 root root6 Aug 13 16:24 /dev/block/8:64 - ../sde -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum IZUS/TIK E-Mail: horlac...@tik.uni-stuttgart.de Universitaet Stuttgart Tel:++49-711-68565868 Allmandring 30aFax:++49-711-682357 70550 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ REF:55ce81a6.5070...@oracle.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The performance is not as expected when used several disks on raid0.
Austin S Hemmelgarn posted on Fri, 14 Aug 2015 15:58:30 -0400 as excerpted: FWIW, running BTRFS on top of MDRAID actually works very well, especially for BTRFS raid1 on top of MD-RAID0 (I get an almost 50% performance increase for this usage over BTRFS raid10, although most of this is probably due to how btrfs dispatches I/O's to disks in multi-disk stetups). Of course that's effectively a raid01, which is normally supposed to most often be a mistakenly reversed raid10 implementation, mistakenly, due to the IO cost of the rebuild should a device fail, since the whole raid0 of the one raid1 side would have to be rereplicated to the other, vs only having to rereplicate one device to the other locally, in a raid10 arrangement. However, in this case it's a very smart arrangement, actually, the only md-raid-under-btrfs-raid arrangement that makes real sense (well, other than raid00, raid0 at both levels, perhaps), in particular because the btrfs raid1 on top still gives you the full benefit of btrfs file integrity features as well as the usual raid1 redundancy, tho in this case it's only at the one raid0 against the other as the pair of btrfs raid1 copies. And the mdraid0 is much better optimized than btrfs raid0, so there's that bonus, while at the same time the btrfs raid1 redundancy nicely balances the usual Russian Roulette quality of raid0. Very nice configuration! =:^) Thanks for mentioning it, as I guess I was effectively ruling it out as an option before even really considering it due to the usual raid10's better than raid01 thing, and thus was entirely blind to the possibility. Which was bad, because as I alluded to, mdraid's lack of file integrity features and thus lack of any way to have btrfs scrub properly filter down to the mdraid level when there's mdraid level redundancy, kind of makes a mess of things, otherwise. But btrfs raid1 on mdraid0 effectively balances and eliminates the negatives at each level with the strengths of the other level, and is really a quite awesome solution, that until now I was entirely blinded to! =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction
Oh, it went read-only because it OOPSed: [39710.419966] [ cut here ] [39710.419969] WARNING: CPU: 1 PID: 5624 at fs/btrfs/extent-tree.c:6226 __btrfs_free_extent+0x873/0xc80() [39710.419970] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl ipv6 binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek ppdev snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel snd_hda_intel snd_hda_controller kvm snd_hda_codec snd_hda_core microcode snd_hwdep pcspkr snd_pcm snd_timer i2c_i801 snd lpc_ich mfd_core parport_pc battery xts gf128mul aes_x86_64 cbc sha256_generic libiscsi scsi_transport_iscsi tg3 ptp pps_core libphy sky2 r8169 pcnet32 mii e1000 bnx2 fuse nfs lockd grace sunrpc reiserfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod firewire_core hid_sunplus hid_sony hid_samsung hid_pl hid_petalynx hid_gyration usbhid uhci_hcd usb_storage ehci_pci [39710.419991] ehci_hcd aic94xx libsas qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 cciss 3w_9xxx 3w_ mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg sata_mv sata_sil24 sata_sil pata_marvell [39710.420003] CPU: 1 PID: 5624 Comm: kworker/u8:7 Tainted: GW 4.1.4-gentoo #1 [39710.420003] Hardware name: ECS H87H3-M/H87H3-M, BIOS 4.6.5 07/16/2013 [39710.420005] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [39710.420006] 8197e672 81794418 [39710.420008] 81049cbc 01846cc5e000 880064d12000 e000 [39710.420009] fffe 8127bc03 000fc277 [39710.420010] Call Trace: [39710.420012] [81794418] ? dump_stack+0x40/0x50 [39710.420014] [81049cbc] ? warn_slowpath_common+0x7c/0xb0 [39710.420015] [8127bc03] ? __btrfs_free_extent+0x873/0xc80 [39710.420018] [81353ef0] ? cpumask_next_and+0x30/0x50 [39710.420019] [81075c93] ? enqueue_task_fair+0x2c3/0xdb0 [39710.420021] [812e054c] ? btrfs_delayed_ref_lock+0x2c/0x260 [39710.420022] [81280ffc] ? __btrfs_run_delayed_refs+0x42c/0x1280 [39710.420024] [8113cedd] ? __sb_start_write+0x3d/0xe0 [39710.420025] [81285f7e] ? btrfs_run_delayed_refs.part.58+0x5e/0x270 [39710.420026] [81286228] ? delayed_ref_async_start+0x78/0x90 [39710.420028] [812c56f3] ? normal_work_helper+0x73/0x2a0 [39710.420029] [8105ebbc] ? process_one_work+0x13c/0x3d0 [39710.420031] [8105eeb3] ? worker_thread+0x63/0x480 [39710.420032] [8105ee50] ? process_one_work+0x3d0/0x3d0 [39710.420033] [81063a5e] ? kthread+0xce/0xf0 [39710.420034] [81063990] ? kthread_create_on_node+0x180/0x180 [39710.420036] [8179ced2] ? ret_from_fork+0x42/0x70 [39710.420037] [81063990] ? kthread_create_on_node+0x180/0x180 [39710.420038] ---[ end trace 0b4fe6057cd7a1a4 ]--- On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller theo...@gmail.com wrote: So I tried deleting the files that I think are the problem, and the file system went suddenly read-only, and I got this in dmesg: A bunch of these first messages: [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 53 [39710.420118] extent refs 1 gen 166914 flags 1 [39710.420119] extent data backref root 949 objectid 440675 offset 2621440 count 1 [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 53 [39710.420120] extent refs 1 gen 166914 flags 1 [39710.420121] extent data backref root 949 objectid 440675 offset 3145728 count 1 [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 53 [39710.420122] extent refs 1 gen 166914 flags 1 [39710.420122] extent data backref root 949 objectid 440675 offset 3670016 count 1 [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 53 [39710.420123] extent refs 1 gen 166914 flags 1 [39710.420124] extent data backref root 949 objectid 440675 offset 4194304 count 1 [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 53 [39710.420125] extent refs 1 gen 166914 flags 1 [39710.420126] extent data backref root 949 objectid 440675 offset 4718592 count 1 [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 53 [39710.420127] extent refs 1 gen 166914 flags 1 [39710.420127] extent data backref root 949 objectid 440675 offset 5242880 count 1 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr 1668272218112 parent 0 root 949 owner 1032823 offset 655360 [39710.420129] BTRFS: error (device sdc)
Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction
So I tried deleting the files that I think are the problem, and the file system went suddenly read-only, and I got this in dmesg: A bunch of these first messages: [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 53 [39710.420118] extent refs 1 gen 166914 flags 1 [39710.420119] extent data backref root 949 objectid 440675 offset 2621440 count 1 [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 53 [39710.420120] extent refs 1 gen 166914 flags 1 [39710.420121] extent data backref root 949 objectid 440675 offset 3145728 count 1 [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 53 [39710.420122] extent refs 1 gen 166914 flags 1 [39710.420122] extent data backref root 949 objectid 440675 offset 3670016 count 1 [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 53 [39710.420123] extent refs 1 gen 166914 flags 1 [39710.420124] extent data backref root 949 objectid 440675 offset 4194304 count 1 [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 53 [39710.420125] extent refs 1 gen 166914 flags 1 [39710.420126] extent data backref root 949 objectid 440675 offset 4718592 count 1 [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 53 [39710.420127] extent refs 1 gen 166914 flags 1 [39710.420127] extent data backref root 949 objectid 440675 offset 5242880 count 1 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr 1668272218112 parent 0 root 949 owner 1032823 offset 655360 [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232: errno=-2 No such entry [39710.420131] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2821: errno=-2 No such entry [39710.431108] pending csums is 5795840 On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller theo...@gmail.com wrote: I didn't quite understand profile and convert, since I can't find a profile option. Is this something your patch adds? Before I do that, however, I have to deal with this: compute0 ~ # btrfs device delete missing /mnt/btrfs ERROR: error removing the device 'missing' - Input/output error [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off 623230976 csum 3298529275 expected csum 1155389604 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off 623235072 csum 2603391790 expected csum 1861925401 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off 623239168 csum 2044148708 expected csum 3227559459 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off 623243264 csum 615351306 expected csum 2720021058 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 Because of this, it won't delete the missing device. How do I get past this? I'm pretty sure the problem is in some files I want to delete anyhow. Would deleting them solve the problem? On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote: BTW, when this is all over with, how do I make sure there are really two copies of everything? Will a scrub verify this? Should I run a balance operation? pls use 'btrfs bal profile and convert' to migrate single chunk (if any created when there were lesser number of RW-able devices) back to your desired raid1. Do this when all the devices are back online. Kindly note there is a bug in the btrfs VM that you won't be able to bring a device online with out unmount - mount (I am working to fix). btrfs-progs will be wrong in this case don't depend too much on that. So to understand inside of btrfs kernel volume I generally use: https://patchwork.kernel.org/patch/5816011/ In there if bdev is null it indicates device is scanned but not part of VM yet. Then unmount - mount will bring device back to be part of VM. After applying Anand's patch, I was able to mount my 4-drive RAID1 and bring a new fourth drive online. However, something weird happened where the first delete missing only deleted one missing drive and only did a partial duplication. I've posted a bug report here: that seems to be normal to me. unless I am missing something else / clarity. Thanks, Anand -- Timothy Normand Miller, PhD Assistant
Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction
I didn't quite understand profile and convert, since I can't find a profile option. Is this something your patch adds? Before I do that, however, I have to deal with this: compute0 ~ # btrfs device delete missing /mnt/btrfs ERROR: error removing the device 'missing' - Input/output error [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off 623230976 csum 3298529275 expected csum 1155389604 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off 623235072 csum 2603391790 expected csum 1861925401 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off 623239168 csum 2044148708 expected csum 3227559459 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off 623243264 csum 615351306 expected csum 2720021058 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 Because of this, it won't delete the missing device. How do I get past this? I'm pretty sure the problem is in some files I want to delete anyhow. Would deleting them solve the problem? On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote: BTW, when this is all over with, how do I make sure there are really two copies of everything? Will a scrub verify this? Should I run a balance operation? pls use 'btrfs bal profile and convert' to migrate single chunk (if any created when there were lesser number of RW-able devices) back to your desired raid1. Do this when all the devices are back online. Kindly note there is a bug in the btrfs VM that you won't be able to bring a device online with out unmount - mount (I am working to fix). btrfs-progs will be wrong in this case don't depend too much on that. So to understand inside of btrfs kernel volume I generally use: https://patchwork.kernel.org/patch/5816011/ In there if bdev is null it indicates device is scanned but not part of VM yet. Then unmount - mount will bring device back to be part of VM. After applying Anand's patch, I was able to mount my 4-drive RAID1 and bring a new fourth drive online. However, something weird happened where the first delete missing only deleted one missing drive and only did a partial duplication. I've posted a bug report here: that seems to be normal to me. unless I am missing something else / clarity. Thanks, Anand -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote: Is this if (!committed_data) { check now dead code? I also see other similar suspected dead sites in the rest of the series. You are absolutely right. I have updated the patches. Have you sent out an updated version of these patches? Maybe I missed it, but I don't think I saw them. Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction
Here's the associated bug report with the full dmesg: https://bugzilla.kernel.org/show_bug.cgi?id=102941 On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller theo...@gmail.com wrote: So I tried deleting the files that I think are the problem, and the file system went suddenly read-only, and I got this in dmesg: A bunch of these first messages: [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 53 [39710.420118] extent refs 1 gen 166914 flags 1 [39710.420119] extent data backref root 949 objectid 440675 offset 2621440 count 1 [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 53 [39710.420120] extent refs 1 gen 166914 flags 1 [39710.420121] extent data backref root 949 objectid 440675 offset 3145728 count 1 [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 53 [39710.420122] extent refs 1 gen 166914 flags 1 [39710.420122] extent data backref root 949 objectid 440675 offset 3670016 count 1 [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 53 [39710.420123] extent refs 1 gen 166914 flags 1 [39710.420124] extent data backref root 949 objectid 440675 offset 4194304 count 1 [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 53 [39710.420125] extent refs 1 gen 166914 flags 1 [39710.420126] extent data backref root 949 objectid 440675 offset 4718592 count 1 [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 53 [39710.420127] extent refs 1 gen 166914 flags 1 [39710.420127] extent data backref root 949 objectid 440675 offset 5242880 count 1 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr 1668272218112 parent 0 root 949 owner 1032823 offset 655360 [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232: errno=-2 No such entry [39710.420131] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2821: errno=-2 No such entry [39710.431108] pending csums is 5795840 On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller theo...@gmail.com wrote: I didn't quite understand profile and convert, since I can't find a profile option. Is this something your patch adds? Before I do that, however, I have to deal with this: compute0 ~ # btrfs device delete missing /mnt/btrfs ERROR: error removing the device 'missing' - Input/output error [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off 623230976 csum 3298529275 expected csum 1155389604 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off 623235072 csum 2603391790 expected csum 1861925401 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off 623239168 csum 2044148708 expected csum 3227559459 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off 623243264 csum 615351306 expected csum 2720021058 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 Because of this, it won't delete the missing device. How do I get past this? I'm pretty sure the problem is in some files I want to delete anyhow. Would deleting them solve the problem? On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote: BTW, when this is all over with, how do I make sure there are really two copies of everything? Will a scrub verify this? Should I run a balance operation? pls use 'btrfs bal profile and convert' to migrate single chunk (if any created when there were lesser number of RW-able devices) back to your desired raid1. Do this when all the devices are back online. Kindly note there is a bug in the btrfs VM that you won't be able to bring a device online with out unmount - mount (I am working to fix). btrfs-progs will be wrong in this case don't depend too much on that. So to understand inside of btrfs kernel volume I generally use: https://patchwork.kernel.org/patch/5816011/ In there if bdev is null it indicates device is scanned but not part of VM yet. Then unmount - mount will bring device back to be part of VM. After applying Anand's patch, I was able to mount my 4-drive RAID1 and bring a new fourth drive online. However, something weird happened where the first delete missing only