So, wipe it out and start over or keep debugging?

2015-08-15 Thread Timothy Normand Miller
To those of you who have been helping out with my 4-drive RAID1
situation, is there anything further we should do to investigate this,
in case we can uncover any more bugs, or should I just wipe everything
out and restore from backup?

-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockup in BTRFS_IOC_CLONE/Kernel 4.2.0-rc5

2015-08-15 Thread Elias Probst
On 08/08/2015 01:29 AM, Elias Probst wrote:
 On 08/07/2015 06:01 AM, Liu Bo wrote:
 Could you do 'echo w  /proc/sysrq-trigger' to gather the whole hang call 
 stack?
 
 Thanks a lot for the feedback. Full call stack output is attached
 (pasting inline makes no sense due to the size of 2423 lines/135k).
 
 In case VGER will strip attachments of this size, I made it available as
 a pastebin here: https://bpaste.net/show/e5e0fd4bbb9f
 
 
 Here's a quick patch that may address your problem, can you give it a shot 
 after
 getting sysrq-w output?
 
 Will rebuild now with your patch and keep you updated about the outcome.

So far I've been unable to reproduce the lockup using your patch on
4.2.0-rc6.

Thanks a lot!

- Elias




signature.asc
Description: OpenPGP digital signature


Re: Deleted files cause btrfs-send to fail

2015-08-15 Thread Marc Joliet
Am Sat, 15 Aug 2015 05:10:57 + (UTC)
schrieb Duncan 1i5t5.dun...@cox.net:

 Marc Joliet posted on Fri, 14 Aug 2015 23:37:37 +0200 as excerpted:
 
  (One other thing I found interesting was that btrfs scrub didn't care
  about the link count errors.)
 
 A lot of people are confused about exactly what btrfs scrub does, and 
 expect it to detect and possibly fix stuff it has nothing to do with.  
 It's *not* an fsck.
 
 Scrub does one very useful, but limited, thing.  It systematically 
 verifies that the computed checksums for all data and metadata covered by 
 checksums match the corresponding recorded checksums.  For dup/raid1/
 raid10 modes, if there's a match failure, it will look up the other copy 
 and see if it matches, replacing the invalid block with a new copy of the 
 other one, assuming it's valid.  For raid56 modes, it attempts to compute 
 the valid copy from parity and, again assuming a match after doing so, 
 does the replace.  If a valid copy cannot be found or computed, either 
 because it's damaged too or because there's no second copy or parity to 
 fall back on (single and raid0 modes), then scrub will detect but cannot 
 correct the error.
 
 In routine usage, btrfs automatically does the same thing if it happens 
 to come across checksum errors in its normal IO stream, but it has to 
 come across them first.  Scrub's benefit is that it systematically 
 verifies (and corrects errors where it can) checksums on the entire 
 filesystem, not just the parts that happen to appear in the normal IO 
 stream.

I know all that, I just thought it was interesting and wanted to remark as
such. After thinking about it a bit, of course, it makes perfect sense and is
not very interesting at all:  scrub will just verify that the checksums match,
no matter whether the underlying (meta)data is valid or not.

 Such checksum errors can be for a few reasons...
 
 I have one ssd that's gradually failing and returns checksum errors 
 fairly regularly.  Were I using a normal filesystem I'd have had to 
 replace it some time ago.  But with btrfs in raid1 mode and regular 
 scrubs (and backups, should they be needed; sometimes I let them get a 
 bit stale, but I do have them and am prepared to live with the stale 
 restored data if I have to), I've been able to keep using the failing 
 device.  When the scrubs hit errors and btrfs does the rewrite from the 
 good copy, a block relocation on the failing device is triggered as well, 
 with the bad block taken out of service and a new one from the set of 
 spares all modern devices have takes its place.  Currently, smartctl -A 
 reports 904 reallocated sectors raw value, with a standardized value of 
 92.  Before the first reallocated sector, the standardized value was 253, 
 perfect.  With the first reallocated sector, it immediately dropped to 
 100, apparently the rounded percentage of spare sectors left.  It has 
 gradually dropped since then to its current 92, with a threshold value of 
 36.  So while it's gradually failing, there's still plenty of spare 
 sectors left.  Normally I would have replaced the device even so, but 
 I've never actually had the opportunity to actually watch a slow failure 
 continue to get worse over time, and now that I do I'm a bit curious how 
 things will go, so I'm just letting it happen, tho I do have a 
 replacement device already purchased and ready, when the time comes. 

I'm curious how that will pan out.  My experience with HDDs is that at some
point the sector reallocations start picking up at a somewhat constant (maybe
even accelerating) rate.  I wonder how SSDs behave in this regard.

 So real media failure, bitrot, is one reason for bad checksums.  The data 
 read back from the device simply isn't the same data that was stored to 
 it, and the checksum fails as a result.
 
 Of course bad connector cables or storage chipset firmware or hardware is 
 another hardware cause.
 
 Sudden reboot or power loss, with data being actively written and one 
 copy either already updated or not yet touched, while the other is 
 actually being written at the time of the crash so the write isn't 
 completed, is yet another reason for checksum failure.  This one is 
 actually why a scrub can appear to do so much more than it does, because 
 where there's a second copy (or parity) of the data available, scrub can 
 use it to recover the partially written copy (which being partially 
 written fails its checksum verification) to either the completed write 
 state, if the other copy was already written, or the pre-write state, if 
 the other copy hadn't been written at all, yet.  In this way the result 
 is often the same one an fsck would normally produce, detecting and 
 fixing the error, but the mechanism is entirely different -- it only 
 detected and fixed the error because the checksum was bad and it had a 
 good copy it could replace it with, not because it had any smarts about 
 how the filesystem actually worked, and could 

Re: RAID0 wrong (raw) device?

2015-08-15 Thread Ulli Horlacher
On Sat 2015-08-15 (08:02), Anand Jain wrote:

 First of all there is a known issue in handling multiple paths /
 instances of the same device image in btrfs. Fixing this caused
 regression earlier. And my survey
 [survey]  BTRFS_IOC_DEVICES_READY return status
 almost told me not to fix the bug.

I have subscribed to this list this week, I am a newbie :-)


  There is now a new behaviour: after the btrfs mount, I can see shortly the
  wrong raw device /dev/sde and a few seconds later there is the correct
  /dev/drbd3 :
 
 yep possible. but it does not mean that btrfs kernel is using the new 
 path its just a reporting (bug).

What is the reporting bug: /dev/sde or /dev/drbd3?

  root@toy02:/etc# btrfs filesystem show -m
  Label: data  uuid: 411af13f-6cae-4f03-99dc-5941acb3135b
   Total devices 2 FS bytes used 109.56GiB
   devid3 size 1.82TiB used 63.03GiB path /dev/drbd2
   devid4 size 1.82TiB used 63.03GiB path /dev/drbd3
 
 
  Still, the kernel sees 3 instead of (really) 2 HGST drives:
 
  root@toy02:/etc# hdparm -I /dev/sdb | grep Number:
   Model Number:   HGST HUS724020ALA640
   Serial Number:  PN2134P5G2P2AX
 
  root@toy02:/etc# hdparm -I /dev/sde | grep Number:
   Model Number:   HGST HUS724020ALA640
   Serial Number:  PN2134P5G2P2AX
 
 This is important to know but not a btrfs issue. Do you have multiple 
 host paths reaching this this device with serial # PN2134P5G2P2AX ?

root@toy02:~# find /dev -ls | grep PN2134P5G2P2AX
 143540 lrwxrwxrwx   1 root root   17 Aug 14 09:00 
/dev/drbd/by-disk/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - 
../../../../drbd3
 136400 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - ../../sdb

root@toy02:~# find /dev -ls | grep sdb
  74170 brw-rw   1 root disk  Aug 13 16:25 /dev/sdb
 123660 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-path/pci-:08:00.0-sas-0x12210200-lun-0 - ../../sdb
 136410 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-id/wwn-0x5000cca24ec137db - ../../sdb
 136400 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-id/ata-HGST_HUS724020ALA640_PN2134P5G2P2AX - ../../sdb
 123560 lrwxrwxrwx   1 root root6 Aug 13 16:25 
/dev/block/8:16 - ../sdb

root@toy02:~# find /dev -ls | grep sde
 133530 brw-rw   1 root disk  Aug 13 16:24 /dev/sde
 157250 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-uuid/411af13f-6cae-4f03-99dc-5941acb3135b - ../../sde
 157240 lrwxrwxrwx   1 root root9 Aug 13 16:25 
/dev/disk/by-label/data - ../../sde
  93940 lrwxrwxrwx   1 root root9 Aug 13 16:24 
/dev/disk/by-path/pci-:08:00.0-scsi-0:1:2:0 - ../../sde
  93870 lrwxrwxrwx   1 root root6 Aug 13 16:24 
/dev/block/8:64 - ../sde

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum IZUS/TIK E-Mail: horlac...@tik.uni-stuttgart.de
Universitaet Stuttgart Tel:++49-711-68565868
Allmandring 30aFax:++49-711-682357
70550 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:55ce81a6.5070...@oracle.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The performance is not as expected when used several disks on raid0.

2015-08-15 Thread Duncan
Austin S Hemmelgarn posted on Fri, 14 Aug 2015 15:58:30 -0400 as
excerpted:

 FWIW, running BTRFS on top of MDRAID actually works very well,
 especially for BTRFS raid1 on top of MD-RAID0 (I get an almost 50%
 performance increase for this usage over BTRFS raid10, although most of
 this is probably due to how btrfs dispatches I/O's to disks in
 multi-disk stetups).

Of course that's effectively a raid01, which is normally supposed to most 
often be a mistakenly reversed raid10 implementation, mistakenly, due to 
the IO cost of the rebuild should a device fail, since the whole raid0 of 
the one raid1 side would have to be rereplicated to the other, vs only 
having to rereplicate one device to the other locally, in a raid10 
arrangement.

However, in this case it's a very smart arrangement, actually, the only 
md-raid-under-btrfs-raid arrangement that makes real sense (well, other 
than raid00, raid0 at both levels, perhaps), in particular because the 
btrfs raid1 on top still gives you the full benefit of btrfs file 
integrity features as well as the usual raid1 redundancy, tho in this 
case it's only at the one raid0 against the other as the pair of btrfs 
raid1 copies.  And the mdraid0 is much better optimized than btrfs raid0, 
so there's that bonus, while at the same time the btrfs raid1 redundancy 
nicely balances the usual Russian Roulette quality of raid0.

Very nice configuration! =:^)

Thanks for mentioning it, as I guess I was effectively ruling it out as 
an option before even really considering it due to the usual raid10's 
better than raid01 thing, and thus was entirely blind to the 
possibility.  Which was bad, because as I alluded to, mdraid's lack of 
file integrity features and thus lack of any way to have btrfs scrub 
properly filter down to the mdraid level when there's mdraid level 
redundancy, kind of makes a mess of things, otherwise.  But btrfs raid1 
on mdraid0 effectively balances and eliminates the negatives at each 
level with the strengths of the other level, and is really a quite 
awesome solution, that until now I was entirely blinded to! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction

2015-08-15 Thread Timothy Normand Miller
Oh, it went read-only because it OOPSed:

[39710.419966] [ cut here ]
[39710.419969] WARNING: CPU: 1 PID: 5624 at
fs/btrfs/extent-tree.c:6226 __btrfs_free_extent+0x873/0xc80()
[39710.419970] Modules linked in: nfsd auth_rpcgss oid_registry
nfs_acl ipv6 binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek
ppdev snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel
snd_hda_intel snd_hda_controller kvm snd_hda_codec snd_hda_core
microcode snd_hwdep pcspkr snd_pcm snd_timer i2c_i801 snd lpc_ich
mfd_core parport_pc battery xts gf128mul aes_x86_64 cbc sha256_generic
libiscsi scsi_transport_iscsi tg3 ptp pps_core libphy sky2 r8169
pcnet32 mii e1000 bnx2 fuse nfs lockd grace sunrpc reiserfs multipath
linear raid10 raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror
dm_region_hash dm_log dm_mod firewire_core hid_sunplus hid_sony
hid_samsung hid_pl hid_petalynx hid_gyration usbhid uhci_hcd
usb_storage ehci_pci
[39710.419991]  ehci_hcd aic94xx libsas qla2xxx megaraid_sas
megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 cciss 3w_9xxx
3w_ mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi
mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx
gdth advansys initio BusLogic arcmsr aic7xxx aic79xx
scsi_transport_spi sg sata_mv sata_sil24 sata_sil pata_marvell
[39710.420003] CPU: 1 PID: 5624 Comm: kworker/u8:7 Tainted: GW
  4.1.4-gentoo #1
[39710.420003] Hardware name: ECS H87H3-M/H87H3-M, BIOS 4.6.5 07/16/2013
[39710.420005] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
[39710.420006]   8197e672 81794418

[39710.420008]  81049cbc 01846cc5e000 880064d12000
e000
[39710.420009]  fffe  8127bc03
000fc277
[39710.420010] Call Trace:
[39710.420012]  [81794418] ? dump_stack+0x40/0x50
[39710.420014]  [81049cbc] ? warn_slowpath_common+0x7c/0xb0
[39710.420015]  [8127bc03] ? __btrfs_free_extent+0x873/0xc80
[39710.420018]  [81353ef0] ? cpumask_next_and+0x30/0x50
[39710.420019]  [81075c93] ? enqueue_task_fair+0x2c3/0xdb0
[39710.420021]  [812e054c] ? btrfs_delayed_ref_lock+0x2c/0x260
[39710.420022]  [81280ffc] ? __btrfs_run_delayed_refs+0x42c/0x1280
[39710.420024]  [8113cedd] ? __sb_start_write+0x3d/0xe0
[39710.420025]  [81285f7e] ? btrfs_run_delayed_refs.part.58+0x5e/0x270
[39710.420026]  [81286228] ? delayed_ref_async_start+0x78/0x90
[39710.420028]  [812c56f3] ? normal_work_helper+0x73/0x2a0
[39710.420029]  [8105ebbc] ? process_one_work+0x13c/0x3d0
[39710.420031]  [8105eeb3] ? worker_thread+0x63/0x480
[39710.420032]  [8105ee50] ? process_one_work+0x3d0/0x3d0
[39710.420033]  [81063a5e] ? kthread+0xce/0xf0
[39710.420034]  [81063990] ? kthread_create_on_node+0x180/0x180
[39710.420036]  [8179ced2] ? ret_from_fork+0x42/0x70
[39710.420037]  [81063990] ? kthread_create_on_node+0x180/0x180
[39710.420038] ---[ end trace 0b4fe6057cd7a1a4 ]---

On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller
theo...@gmail.com wrote:
 So I tried deleting the files that I think are the problem, and the
 file system went suddenly read-only, and I got this in dmesg:

 A bunch of these first messages:
 [39710.420118]  item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 
 53
 [39710.420118]  extent refs 1 gen 166914 flags 1
 [39710.420119]  extent data backref root 949 objectid 440675
 offset 2621440 count 1
 [39710.420120]  item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 
 53
 [39710.420120]  extent refs 1 gen 166914 flags 1
 [39710.420121]  extent data backref root 949 objectid 440675
 offset 3145728 count 1
 [39710.420121]  item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 
 53
 [39710.420122]  extent refs 1 gen 166914 flags 1
 [39710.420122]  extent data backref root 949 objectid 440675
 offset 3670016 count 1
 [39710.420123]  item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 
 53
 [39710.420123]  extent refs 1 gen 166914 flags 1
 [39710.420124]  extent data backref root 949 objectid 440675
 offset 4194304 count 1
 [39710.420125]  item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 
 53
 [39710.420125]  extent refs 1 gen 166914 flags 1
 [39710.420126]  extent data backref root 949 objectid 440675
 offset 4718592 count 1
 [39710.420126]  item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 
 53
 [39710.420127]  extent refs 1 gen 166914 flags 1
 [39710.420127]  extent data backref root 949 objectid 440675
 offset 5242880 count 1
 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr
 1668272218112 parent 0 root 949  owner 1032823 offset 655360
 [39710.420129] BTRFS: error (device sdc) 

Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction

2015-08-15 Thread Timothy Normand Miller
So I tried deleting the files that I think are the problem, and the
file system went suddenly read-only, and I got this in dmesg:

A bunch of these first messages:
[39710.420118]  item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 53
[39710.420118]  extent refs 1 gen 166914 flags 1
[39710.420119]  extent data backref root 949 objectid 440675
offset 2621440 count 1
[39710.420120]  item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 53
[39710.420120]  extent refs 1 gen 166914 flags 1
[39710.420121]  extent data backref root 949 objectid 440675
offset 3145728 count 1
[39710.420121]  item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 53
[39710.420122]  extent refs 1 gen 166914 flags 1
[39710.420122]  extent data backref root 949 objectid 440675
offset 3670016 count 1
[39710.420123]  item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 53
[39710.420123]  extent refs 1 gen 166914 flags 1
[39710.420124]  extent data backref root 949 objectid 440675
offset 4194304 count 1
[39710.420125]  item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 53
[39710.420125]  extent refs 1 gen 166914 flags 1
[39710.420126]  extent data backref root 949 objectid 440675
offset 4718592 count 1
[39710.420126]  item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 53
[39710.420127]  extent refs 1 gen 166914 flags 1
[39710.420127]  extent data backref root 949 objectid 440675
offset 5242880 count 1
[39710.420128] BTRFS error (device sdc): unable to find ref byte nr
1668272218112 parent 0 root 949  owner 1032823 offset 655360
[39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232:
errno=-2 No such entry
[39710.420131] BTRFS: error (device sdc) in
btrfs_run_delayed_refs:2821: errno=-2 No such entry
[39710.431108] pending csums is 5795840

On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller
theo...@gmail.com wrote:
 I didn't quite understand profile and convert, since I can't find a
 profile option.  Is this something your patch adds?

 Before I do that, however, I have to deal with this:

 compute0 ~ # btrfs device delete missing /mnt/btrfs
 ERROR: error removing the device 'missing' - Input/output error

 [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off
 623218688 csum 2756583412 expected csum 4104700738
 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off
 623222784 csum 2568037276 expected csum 275151414
 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off
 623226880 csum 2227564114 expected csum 3824181799
 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off
 623230976 csum 3298529275 expected csum 1155389604
 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off
 623235072 csum 2603391790 expected csum 1861925401
 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off
 623239168 csum 2044148708 expected csum 3227559459
 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off
 623243264 csum 615351306 expected csum 2720021058
 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off
 623218688 csum 2756583412 expected csum 4104700738
 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off
 623222784 csum 2568037276 expected csum 275151414
 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off
 623226880 csum 2227564114 expected csum 3824181799

 Because of this, it won't delete the missing device.  How do I get
 past this?  I'm pretty sure the problem is in some files I want to
 delete anyhow.  Would deleting them solve the problem?

 On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote:

 BTW, when this is all over with, how do I make sure there are really
 two copies of everything?  Will a scrub verify this?  Should I run a
 balance operation?

 pls use 'btrfs bal profile and convert' to migrate single chunk (if any
 created when there were lesser number of RW-able devices) back to your
 desired raid1. Do this when all the devices are back online. Kindly note
 there is a bug in the btrfs VM that you won't be able to bring a device
 online with out unmount - mount (I am working to fix). btrfs-progs will be
 wrong in this case don't depend too much on that.
 So to understand inside of btrfs kernel volume I generally use:
 https://patchwork.kernel.org/patch/5816011/

 In there if bdev is null it indicates device is scanned but not part of VM
 yet. Then unmount - mount will bring device back to be part of VM.

 After applying Anand's patch, I was able to mount my 4-drive RAID1
 and bring a new fourth drive online.

 However, something weird happened
 where the first delete missing only deleted one missing drive and
 only did a partial duplication.  I've posted a bug report here:

 that seems to be normal to me. unless I am missing something else / clarity.


 Thanks, Anand



 --
 Timothy Normand Miller, PhD
 Assistant 

Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction

2015-08-15 Thread Timothy Normand Miller
I didn't quite understand profile and convert, since I can't find a
profile option.  Is this something your patch adds?

Before I do that, however, I have to deal with this:

compute0 ~ # btrfs device delete missing /mnt/btrfs
ERROR: error removing the device 'missing' - Input/output error

[13058.298763] BTRFS warning (device sdc): csum failed ino 596 off
623218688 csum 2756583412 expected csum 4104700738
[13058.298775] BTRFS warning (device sdc): csum failed ino 596 off
623222784 csum 2568037276 expected csum 275151414
[13058.298782] BTRFS warning (device sdc): csum failed ino 596 off
623226880 csum 2227564114 expected csum 3824181799
[13058.298788] BTRFS warning (device sdc): csum failed ino 596 off
623230976 csum 3298529275 expected csum 1155389604
[13058.298794] BTRFS warning (device sdc): csum failed ino 596 off
623235072 csum 2603391790 expected csum 1861925401
[13058.298801] BTRFS warning (device sdc): csum failed ino 596 off
623239168 csum 2044148708 expected csum 3227559459
[13058.298807] BTRFS warning (device sdc): csum failed ino 596 off
623243264 csum 615351306 expected csum 2720021058
[13058.329747] BTRFS warning (device sdc): csum failed ino 596 off
623218688 csum 2756583412 expected csum 4104700738
[13058.329759] BTRFS warning (device sdc): csum failed ino 596 off
623222784 csum 2568037276 expected csum 275151414
[13058.329770] BTRFS warning (device sdc): csum failed ino 596 off
623226880 csum 2227564114 expected csum 3824181799

Because of this, it won't delete the missing device.  How do I get
past this?  I'm pretty sure the problem is in some files I want to
delete anyhow.  Would deleting them solve the problem?

On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote:

 BTW, when this is all over with, how do I make sure there are really
 two copies of everything?  Will a scrub verify this?  Should I run a
 balance operation?

 pls use 'btrfs bal profile and convert' to migrate single chunk (if any
 created when there were lesser number of RW-able devices) back to your
 desired raid1. Do this when all the devices are back online. Kindly note
 there is a bug in the btrfs VM that you won't be able to bring a device
 online with out unmount - mount (I am working to fix). btrfs-progs will be
 wrong in this case don't depend too much on that.
 So to understand inside of btrfs kernel volume I generally use:
 https://patchwork.kernel.org/patch/5816011/

 In there if bdev is null it indicates device is scanned but not part of VM
 yet. Then unmount - mount will bring device back to be part of VM.

 After applying Anand's patch, I was able to mount my 4-drive RAID1
 and bring a new fourth drive online.

 However, something weird happened
 where the first delete missing only deleted one missing drive and
 only did a partial duplication.  I've posted a bug report here:

 that seems to be normal to me. unless I am missing something else / clarity.


 Thanks, Anand



-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-15 Thread Theodore Ts'o
On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
  Is this if (!committed_data) { check now dead code?
  
  I also see other similar suspected dead sites in the rest of the series.
 
 You are absolutely right. I have updated the patches.

Have you sent out an updated version of these patches?  Maybe I missed
it, but I don't think I saw them.

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delete missing with two missing devices doesn't delete both missing, only does a partial reconstruction

2015-08-15 Thread Timothy Normand Miller
Here's the associated bug report with the full dmesg:

https://bugzilla.kernel.org/show_bug.cgi?id=102941

On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller
theo...@gmail.com wrote:
 So I tried deleting the files that I think are the problem, and the
 file system went suddenly read-only, and I got this in dmesg:

 A bunch of these first messages:
 [39710.420118]  item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 
 53
 [39710.420118]  extent refs 1 gen 166914 flags 1
 [39710.420119]  extent data backref root 949 objectid 440675
 offset 2621440 count 1
 [39710.420120]  item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 
 53
 [39710.420120]  extent refs 1 gen 166914 flags 1
 [39710.420121]  extent data backref root 949 objectid 440675
 offset 3145728 count 1
 [39710.420121]  item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 
 53
 [39710.420122]  extent refs 1 gen 166914 flags 1
 [39710.420122]  extent data backref root 949 objectid 440675
 offset 3670016 count 1
 [39710.420123]  item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 
 53
 [39710.420123]  extent refs 1 gen 166914 flags 1
 [39710.420124]  extent data backref root 949 objectid 440675
 offset 4194304 count 1
 [39710.420125]  item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 
 53
 [39710.420125]  extent refs 1 gen 166914 flags 1
 [39710.420126]  extent data backref root 949 objectid 440675
 offset 4718592 count 1
 [39710.420126]  item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 
 53
 [39710.420127]  extent refs 1 gen 166914 flags 1
 [39710.420127]  extent data backref root 949 objectid 440675
 offset 5242880 count 1
 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr
 1668272218112 parent 0 root 949  owner 1032823 offset 655360
 [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232:
 errno=-2 No such entry
 [39710.420131] BTRFS: error (device sdc) in
 btrfs_run_delayed_refs:2821: errno=-2 No such entry
 [39710.431108] pending csums is 5795840

 On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller
 theo...@gmail.com wrote:
 I didn't quite understand profile and convert, since I can't find a
 profile option.  Is this something your patch adds?

 Before I do that, however, I have to deal with this:

 compute0 ~ # btrfs device delete missing /mnt/btrfs
 ERROR: error removing the device 'missing' - Input/output error

 [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off
 623218688 csum 2756583412 expected csum 4104700738
 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off
 623222784 csum 2568037276 expected csum 275151414
 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off
 623226880 csum 2227564114 expected csum 3824181799
 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off
 623230976 csum 3298529275 expected csum 1155389604
 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off
 623235072 csum 2603391790 expected csum 1861925401
 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off
 623239168 csum 2044148708 expected csum 3227559459
 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off
 623243264 csum 615351306 expected csum 2720021058
 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off
 623218688 csum 2756583412 expected csum 4104700738
 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off
 623222784 csum 2568037276 expected csum 275151414
 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off
 623226880 csum 2227564114 expected csum 3824181799

 Because of this, it won't delete the missing device.  How do I get
 past this?  I'm pretty sure the problem is in some files I want to
 delete anyhow.  Would deleting them solve the problem?

 On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain anand.j...@oracle.com wrote:

 BTW, when this is all over with, how do I make sure there are really
 two copies of everything?  Will a scrub verify this?  Should I run a
 balance operation?

 pls use 'btrfs bal profile and convert' to migrate single chunk (if any
 created when there were lesser number of RW-able devices) back to your
 desired raid1. Do this when all the devices are back online. Kindly note
 there is a bug in the btrfs VM that you won't be able to bring a device
 online with out unmount - mount (I am working to fix). btrfs-progs will be
 wrong in this case don't depend too much on that.
 So to understand inside of btrfs kernel volume I generally use:
 https://patchwork.kernel.org/patch/5816011/

 In there if bdev is null it indicates device is scanned but not part of VM
 yet. Then unmount - mount will bring device back to be part of VM.

 After applying Anand's patch, I was able to mount my 4-drive RAID1
 and bring a new fourth drive online.

 However, something weird happened
 where the first delete missing only