Re: ENOSPC errors during balance
Marc Joliet posted on Tue, 22 Jul 2014 01:30:22 +0200 as excerpted: > And now that the background deletion of the old snapshots is done, the file > system ended up at: > > # btrfs filesystem df /run/media/marcec/MARCEC_BACKUP > Data, single: total=219.00GiB, used=140.13GiB > System, DUP: total=32.00MiB, used=36.00KiB > Metadata, DUP: total=4.50GiB, used=2.40GiB > unknown, single: total=512.00MiB, used=0.00 > > I don't know how reliable du is for this, but I used it to estimate how much > used data I should expect, and I get 138 GiB. That means that the snapshots > yield about 2 GiB "overhead", which is very reasonable, I think. Obviously > I'll be starting a full balance now. FWIW, the balance should reduce the data total quite a bit, to 141-ish GiB (might be 142 or 145, but it should definitely come down from 219 GiB), because the spread between total and used is relatively high, now, and balance is what's used to bring that back down. Metadata total will probably come down a bit as well, to 3.00 GiB or so. What's going on there is this: Btrfs allocates and deallocates data and metadata in two stages. First it allocates chunks, 1 GiB in size for data, 256 MiB in size for metadata, but because metadata is dup by default it allocates two chunks so half a GiB at a time, there. Then the actual file data and metadata can be written into the pre-allocated chunks, filling them up. As they near full, more chunks will be allocated from the unallocated pool as necessary. But on file deletion, btrfs only automatically handles the file data/metadata level; it doesn't (yet) automatically deallocate the chunks, nor can it change the allocation from say a data chunk to a metadata chunk. So when a chunk is allocated, it stays allocated. That's the spread you see in btrfs filesystem df, between total and used, for each chunk type. The way to recover those allocated but unused chunks to the unallocated pool, so they can be reallocated between data and metadata as necessary, is with a balance. That balance, therefore, should reduce the spread seen in the above between total and used. Meanwhile, btrfs filesystem df shows the spread between allocated and used for each type, but what about unallocated? Simple. Btrfs filesystem show lists total filesystem size as well as allocated usage for each device. (The total line is something else, I recommend ignoring it as it's simply confusing. Only pay attention to the individual device lines.) Thus, to get a proper picture of the space usage status on a btrfs filesystem, you must have both the btrfs filesystem show and btrfs filesystem df output for that filesystem, show to tell you how much of the total space is chunk-allocated for each device, df to tell you what those allocations are, and how much of the chunk-allocated space is actually used, for each allocation type. It's wise to keep track of the show output in particular, and when the spread between used (allocated) and total for each device gets low, under a few GiB, check btrfs fi df and see what's using that space unnecessarily and then do a balance to recover it, if possible. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted: > On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: >> >>> If you assume a 12ms average seek time (normal for 7200RPM SATA >>> drives), an 8.3ms rotational latency (half a rotation), an average >>> 64kb write and a 100MB/S streaming write speed, each write comes in >>> at ~21ms, which gives us ~47 IOPS. With the 64KB write size, this >>> comes out to ~3MB/S, DISK LIMITED. >> >>> The 5MB/S that TM is seeing is fine, considering the small files he >>> says he has. >> > That is actually nonsense. > Raid rebuild operates on the block/stripe layer and not on the > filesystem layer. If we were talking about a normal raid, yes. But we're talking about btrFS, note the FS for filesystem, so indeed it *IS* the filesystem layer. Now this particular "filesystem" /does/ happen to have raid properties as well, but it's definitely filesystem level... > It does not matter at all what the average file size is. ... and the filesize /does/ matter. > Raid rebuild is really only limited by disk i/o speed when performing a > linear read of the whole spindle using huge i/o sizes, > or, if you have multiple spindles on the same bus, the bus saturation > speed. Makes sense... if you're dealing at the raid level. If we were talking about dmraid or mdraid... and they're both much more mature and optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed be a reasonable expectation for them. But (barring bugs, which will and do happen at this stage of development) btrfs both makes far better data validity guarantees, and does a lot more complex processing what with COW and snapshotting, etc, of course in addition to the normal filesystem level stuff AND the raid-level stuff it does. > Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, > when doing a raid rebuild. ... And perfectly reasonable, at least at this point, to expect ~5 MiB/ sec total thruput, one spindle at a time, for btrfs. > That is for the naive rebuild that rebuilds every single stripe. A > smarter rebuild that knows which stripes are unused can skip the unused > stripes and thus become even faster than that. > > > Now, that the rebuild is off by an order of magnitude is by design but > should be fixed at some stage, but with the current state of btrfs it is > probably better to focus on other more urgent areas first. Because of all the extra work it does, btrfs may never get to full streaming speed across all spindles at once. But it can and will certainly get much better than it is, once the focus moves to optimization. *AND*, because it /does/ know which areas of the device are actually in use, once btrfs is optimized, it's quite likely that despite the slower raw speed, because it won't have to deal with the unused area, at least with the typically 20-60% unused filesystems most people run, rebuild times will match or be faster than raid-layer-only technologies that must rebuild the entire device, because they do /not/ know which areas are unused. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/21/2014 10:00 PM, TM wrote: Wang Shilong cn.fujitsu.com> writes: Just my two cents: Since 'btrfs replace' support RADI10, I suppose using replace operation is better than 'device removal and add'. Another Question is related to btrfs snapshot-aware balance. How many snapshots did you have in your system? Of course, During balance/resize/device removal operations, you could still snapshot, but fewer snapshots should speed things up! Anyway 'btrfs replace' is implemented more effective than 'device remova and add'. Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? I don't have specific statistics about this. The conclusion come from implementation differences between replace and 'device removal'. Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX The latest btrfs-progs include man page of btrfs-replace. Actually, you could use it something like: btrfs replace start | You could use 'btrfs file show' to see missing device id. and then run btrfs replace. Thanks, Wang TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/21/2014 10:00 PM, TM wrote: Wang Shilong cn.fujitsu.com> writes: Just my two cents: Since 'btrfs replace' support RADI10, I suppose using replace operation is better than 'device removal and add'. Another Question is related to btrfs snapshot-aware balance. How many snapshots did you have in your system? Of course, During balance/resize/device removal operations, you could still snapshot, but fewer snapshots should speed things up! Anyway 'btrfs replace' is implemented more effective than 'device remova and add'. Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? I don't have specific statistics about this. The conclusion come from implementation differences between replace and 'device removal'. Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC errors during balance
Am Tue, 22 Jul 2014 00:30:57 +0200 schrieb Marc Joliet : > Am Mon, 21 Jul 2014 15:22:16 +0200 > schrieb Marc Joliet : > > > Am Sun, 20 Jul 2014 21:44:40 +0200 > > schrieb Marc Joliet : > > > > [...] > > > What I did: > > > > > > - delete the single largest file on the file system, a 12 GB VM image, > > > along > > > with all subvolumes that contained it > > > - rsync it over again > > [...] > > > > I want to point out at this point, though, that doing those two steps freed > > a > > disproportionate amount of space. The image file is only 12 GB, and it > > hadn't > > changed in any of the snapshots (I haven't used this VM since June), so that > > "subvolume delete -c " returned after a few seconds. Yet > > deleting it > > seems to have freed up twice as much. You can see this from the "filesystem > > df" > > output: before, "used" was at 229.04 GiB, and after deleting it and copying > > it > > back (and after a day's worth of backups) went down to 218 GiB. > > > > Does anyone have any idea how this happened? > > > > Actually, now I remember something that is probably related: when I first > > moved to my current backup scheme last week, I first copied the data from > > the > > last rsnapshot based backup with "cp --reflink" to the new backup location, > > but > > forgot to use "-a". I interrupted it and ran "cp -a -u --reflink", but it > > had > > already copied a lot, and I was too impatient to start over; after all, the > > data hadn't changed. Then, when rsync (with --inplace) ran for the first > > time, > > all of these files with wrong permissions and different time stamps were > > copied > > over, but for some reason, the space used increased *greatly*; *much* more > > than > > I would expect from changed metadata. > > > > The total size of the file system data should be around 142 GB (+ > > snapshots), > > but, well, it's more than 1.5 times as much. > > > > Perhaps cp --reflink treats hard links differently than expected? I would > > have > > expected the data pointed to by the hard link to have been referenced, but > > maybe something else happened? > > Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it > freed up whatever was taking up so much space, so as of now the file system > uses only 169.14 GiB (from 218). Weird. And now that the background deletion of the old snapshots is done, the file system ended up at: # btrfs filesystem df /run/media/marcec/MARCEC_BACKUP Data, single: total=219.00GiB, used=140.13GiB System, DUP: total=32.00MiB, used=36.00KiB Metadata, DUP: total=4.50GiB, used=2.40GiB unknown, single: total=512.00MiB, used=0.00 I don't know how reliable du is for this, but I used it to estimate how much used data I should expect, and I get 138 GiB. That means that the snapshots yield about 2 GiB "overhead", which is very reasonable, I think. Obviously I'll be starting a full balance now. I still think this whole... thing is very odd, hopefully somebody can shed some light on it for me (maybe it's obvious, but I don't see it). -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: PGP signature
Re: ENOSPC errors during balance
Am Mon, 21 Jul 2014 15:22:16 +0200 schrieb Marc Joliet : > Am Sun, 20 Jul 2014 21:44:40 +0200 > schrieb Marc Joliet : > > [...] > > What I did: > > > > - delete the single largest file on the file system, a 12 GB VM image, along > > with all subvolumes that contained it > > - rsync it over again > [...] > > I want to point out at this point, though, that doing those two steps freed a > disproportionate amount of space. The image file is only 12 GB, and it hadn't > changed in any of the snapshots (I haven't used this VM since June), so that > "subvolume delete -c " returned after a few seconds. Yet deleting > it > seems to have freed up twice as much. You can see this from the "filesystem > df" > output: before, "used" was at 229.04 GiB, and after deleting it and copying it > back (and after a day's worth of backups) went down to 218 GiB. > > Does anyone have any idea how this happened? > > Actually, now I remember something that is probably related: when I first > moved to my current backup scheme last week, I first copied the data from the > last rsnapshot based backup with "cp --reflink" to the new backup location, > but > forgot to use "-a". I interrupted it and ran "cp -a -u --reflink", but it had > already copied a lot, and I was too impatient to start over; after all, the > data hadn't changed. Then, when rsync (with --inplace) ran for the first > time, > all of these files with wrong permissions and different time stamps were > copied > over, but for some reason, the space used increased *greatly*; *much* more > than > I would expect from changed metadata. > > The total size of the file system data should be around 142 GB (+ snapshots), > but, well, it's more than 1.5 times as much. > > Perhaps cp --reflink treats hard links differently than expected? I would > have > expected the data pointed to by the hard link to have been referenced, but > maybe something else happened? Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it freed up whatever was taking up so much space, so as of now the file system uses only 169.14 GiB (from 218). Weird. -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: PGP signature
Testing with flaky disk
List, btrfs developers. I started working on a test tool for SCSI initiators and filesystem folks. It is a iSCSI target that implements a bad flaky disks where you can set precise controls of how/what is broken which you can use to test error and recovery paths in the initiator/filesystem. The tool is available at : https://github.com/rsahlberg/flaky-stgt.git and is a modified version of the TGTD iscsi target. Right now it is just an initial prototype and it needs more work to add more types of errors as well as making it more userfriendly. But it is still useful enough to illustrate certain failure cases which could be helpful to btrfs and others. Let me illustrate. Lets start by creating a BTRFS filesystem spanning three 1G disks: # # Create three disks and export them through flaky iSCSI # truncate -s 1G /data/tmp/disk1.img truncate -s 1G /data/tmp/disk2.img truncate -s 1G /data/tmp/disk3.img killall -9 tgtd ./usr/tgtd -f -d 1 & sleep 3 ./usr/tgtadm --op new --mode target --tid 1 -T iqn.ronnie.test ./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 1 -b /data/tmp/disk1.img --blocksize=4096 ./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 2 -b /data/tmp/disk2.img --blocksize=4096 ./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 3 -b /data/tmp/disk3.img --blocksize=4096 ./usr/tgtadm --op bind --mode target --tid 1 -I ALL # # connect to the three disks # iscsiadm --mode discoverydb --type sendtargets --portal 127.0.0.1 --discover iscsiadm --mode node --targetname iqn.ronnie.test --portal 127.0.0.1:3260 --login # # check dmesg, you should now have three new 1G disks # # Use: iscsiadm --mode node --targetname iqn.ronnie.test \ # --portal 127.0.0.1:3260 --logout # to disconnect the disks when you are finished. # create a btrfs filesystem mkfs.btrfs -f -d raid1 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-2 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3 # mount the filesystem mount /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1 /mnt Then we can proceed to copy a bunch of data to the filesystem so that there will be some blocks used. Now we can see how/what happens in the case of a single bad disk. Lets say the disk is gone bad, it is still possible to read from the disk but all writes fail with medium error. Perhaps this is similar to the case of a cheap disk that has completely run out of blocks to reallocate to? === # make all writes to the third disk fail with write error. # 3 - MEDIUM ERROR # 0x0c02 - WRITE ERROR AUTOREALLOCATION FAILED # ./usr/tgtadm --mode error --op new --tid 1 --lun 3 --error op=WRITE10,lba=0,len=,pct=100,pause=0,repeat=0,action=CHECK_CONDITION,key=3,asc=0x0c02 # To show all current error injects: # ./usr/tgtadm --mode error --op show # # To delete/clear all current error injects: # ./usr/tgtadm --mode error --op delete === If you now know that this disk has gone bad, you could try to delete the device : btrfs device delete /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3 /mnt but this will probably not work, since at least to semi-recent versions of btrfs you can not remove a device from the filesystem UNLESS you can also write to the device. Thus making it impossible to remove the bad device in other ways that physically removing the device. This is suboptimal from a data integrity point of view since if the disk is readable, it can potentially still contain valid copies of the data which might be silently errored on the other mirror. At some stage, from a data integrity and data robustness standpoint, it would be nice to be able to device delete a device that is readable, and contain a valid copy of the data, but still unwriteable. There is a bunch of other things you can test and emulate with this too. I have only tested this with semi-recent versions of btrfs and not the latest version. I will wait until the current versions of btrfs becomes more stable/robust before I will start experimenting with it. Since I think this could be invaluably useful for a filesystem developer, please have a look. I am more than happy to add additional features that would make it even more useful for filesystem-error-path-and-recovery-testing regards ronnie sahlberg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Jul 21, 2014, at 10:46 AM, ronnie sahlberg wrote: > On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: >> >>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives), >>> an 8.3ms rotational latency (half a rotation), an average 64kb write and >>> a 100MB/S streaming write speed, each write comes in at ~21ms, which >>> gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, >>> DISK LIMITED. >> >>> The 5MB/S that TM is seeing is fine, considering the small files he says >>> he has. >> >> Thanks for the additional numbers supporting my point. =:^) >> >> I had run some of the numbers but not to the extent you just did, so I >> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the >> range of expectation for spinning rust, given the current state of >> optimization... or more accurately the lack thereof, due to the focus >> still being on features. >> > > That is actually nonsense. > Raid rebuild operates on the block/stripe layer and not on the filesystem > layer. Not on Btrfs. It is on the filesystem layer. However, a rebuild is about replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, those are further broken down into 64KB strips. So the smallest size "unit" for replication during a rebuild on Btrfs would be 64KB. Anyway 5MB/s seems really low to me, so I'm suspicious something else is going on. I haven't done a rebuild in a couple months, but my recollection is it's always been as fast as the write performance of a single device in the btrfs volume. I'd be looking in dmesg for any of the physical drives being reset, having read or write errors, and I'd do some individual drive testing to see if the problem can be isolated. And if that's not helpful, well, this is really tedious and verbose amounts of information but it might reveal some issue is to capture actual commands going to physical devices: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html My expectation (i.e. I'm guessing) based on previous testing is that whether raid1 or raid10, the actual read/write commands will each be 256KB in size. Btrfs rebuild is basically designed to be a sequential operation. This could maybe fall apart if there were somehow many minimally full chunks, which is probably unlikely. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: `btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags & 1)' failed.` in `btrfsck`
Hi, I could `btrfsck --repair` the sparse file with Linux 3.15.6-utopic from http://kernel.ubuntu.com/~kernel-ppa/mainline/ and btrfsck 3.12-1 (from btrfs-tools package in Ubuntu 14.04). Thanks for your hints, Wang! All the best, Karl Am 18.07.2014 14:13, schrieb Wang Shilong: > > Hi, > > There are some patches for fsck flighting, they are integrated in David's > branches. > You can pull from David's latest branch, and see if it helps: > > https://github.com/kdave/btrfs-progs integration-20140704 > > Have a try and see if it helps anyway. > > Thanks, > Wang > >> Hi together, >> I'm experiencing the following issues when I invoke `btrfsck` on a >> sparse file image with a GPT and one (the only) btrfs partition attached >> to a loop device >> >>$ sudo btrfsck --repair --init-csum-tree --init-extent-tree -b >> /dev/loop0p1 >>Incorrect local backref count on 128510738432 root 5 owner 3849475 >> offset 0 found 1 wanted 0 back 0xbab41270 >>backpointer mismatch on [128510738432 4096] >>ref mismatch on [128510742528 12288] extent item 0, found 1 >>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags >> & 1)' failed. >> >>$ sudo btrfsck --repair --init-csum-tree --init-extent-tree /dev/loop0p1 >>Incorrect local backref count on 128510726144 root 5 owner 3849470 >> offset 0 found 1 wanted 0 back 0xbbcb9500 >>backpointer mismatch on [128510726144 12288] >>ref mismatch on [128510738432 4096] extent item 0, found 1 >>adding new data backref on 128510738432 root 5 owner 3849475 offset >> 0 found 1 >>Backref 128510738432 root 5 owner 3849475 offset 0 num_refs 0 not >> found in extent tree >>Incorrect local backref count on 128510738432 root 5 owner 3849475 >> offset 0 found 1 wanted 0 back 0xbbcb9630 >>backpointer mismatch on [128510738432 4096] >>ref mismatch on [128510742528 12288] extent item 0, found 1 >>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags >> & 1)' failed. >> >>$ sudo btrfsck --repair /dev/loop0p1 >>Incorrect local backref count on 130861096960 root 5 owner 22733727 >> offset 0 found 1 wanted 0 back 0xc7c7d170 >>backpointer mismatch on [130861096960 8192] >>ref mismatch on [130861105152 8192] extent item 0, found 1 >>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags >> & 1)' failed. >> >>$ sudo btrfsck --repair /dev/loop0p1 >>Backref 130861096960 root 5 owner 22733727 offset 0 num_refs 0 not >> found in extent tree >>Incorrect local backref count on 130861096960 root 5 owner 22733727 >> offset 0 found 1 wanted 0 back 0xc7f31170 >>backpointer mismatch on [130861096960 8192] >>ref mismatch on [130861105152 8192] extent item 0, found 1 >>btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb->flags >> & 1)' failed. >> >> I'm using `btrfs-progs` 24cf4d8c3ee924b474f68514e0167cc2e602a48d on >> Linux 3.16-rc5 (anything else, i.e. older versions, give me immediate >> error after start because errornous file system) >> >> I'd like to know whether this (assertion) error is related to a bug or >> missing feature in btrfs-progs and might be fixed at some point or >> whether this might indicate a completely messed up btrfs. >> >> Best regards, >> Karl-P. Richter >> > signature.asc Description: OpenPGP digital signature
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: > ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: > >> If you assume a 12ms average seek time (normal for 7200RPM SATA drives), >> an 8.3ms rotational latency (half a rotation), an average 64kb write and >> a 100MB/S streaming write speed, each write comes in at ~21ms, which >> gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, >> DISK LIMITED. > >> The 5MB/S that TM is seeing is fine, considering the small files he says >> he has. > > Thanks for the additional numbers supporting my point. =:^) > > I had run some of the numbers but not to the extent you just did, so I > didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the > range of expectation for spinning rust, given the current state of > optimization... or more accurately the lack thereof, due to the focus > still being on features. > That is actually nonsense. Raid rebuild operates on the block/stripe layer and not on the filesystem layer. It does not matter at all what the average file size is. Raid rebuild is really only limited by disk i/o speed when performing a linear read of the whole spindle using huge i/o sizes, or, if you have multiple spindles on the same bus, the bus saturation speed. Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, when doing a raid rebuild. That is for the naive rebuild that rebuilds every single stripe. A smarter rebuild that knows which stripes are unused can skip the unused stripes and thus become even faster than that. Now, that the rebuild is off by an order of magnitude is by design but should be fixed at some stage, but with the current state of btrfs it is probably better to focus on other more urgent areas first. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Q: BTRFS_IOC_DEFRAG_RANGE and START_IO
I working on readahead in systemd and try to complete todo for it. One of todos it is: readahead: use BTRFS_IOC_DEFRAG_RANGE instead of BTRFS_IOC_DEFRAG ioctl, with START_IO Can someone explain what start_io flag in BTRFS_IOC_DEFRAG_RANGE do? Just force write data after defragment or do something else? This flag mean what btrfs can guarantee data consistency after defragment? Thanks for any explanation! -- Best regards, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Wang Shilong cn.fujitsu.com> writes: > Just my two cents: > > Since 'btrfs replace' support RADI10, I suppose using replace > operation is better than 'device removal and add'. > > Another Question is related to btrfs snapshot-aware balance. > How many snapshots did you have in your system? > > Of course, During balance/resize/device removal operations, > you could still snapshot, but fewer snapshots should speed things up! > > Anyway 'btrfs replace' is implemented more effective than > 'device remova and add'. > Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC errors during balance
Am Sun, 20 Jul 2014 21:44:40 +0200 schrieb Marc Joliet : [...] > What I did: > > - delete the single largest file on the file system, a 12 GB VM image, along > with all subvolumes that contained it > - rsync it over again [...] I want to point out at this point, though, that doing those two steps freed a disproportionate amount of space. The image file is only 12 GB, and it hadn't changed in any of the snapshots (I haven't used this VM since June), so that "subvolume delete -c " returned after a few seconds. Yet deleting it seems to have freed up twice as much. You can see this from the "filesystem df" output: before, "used" was at 229.04 GiB, and after deleting it and copying it back (and after a day's worth of backups) went down to 218 GiB. Does anyone have any idea how this happened? Actually, now I remember something that is probably related: when I first moved to my current backup scheme last week, I first copied the data from the last rsnapshot based backup with "cp --reflink" to the new backup location, but forgot to use "-a". I interrupted it and ran "cp -a -u --reflink", but it had already copied a lot, and I was too impatient to start over; after all, the data hadn't changed. Then, when rsync (with --inplace) ran for the first time, all of these files with wrong permissions and different time stamps were copied over, but for some reason, the space used increased *greatly*; *much* more than I would expect from changed metadata. The total size of the file system data should be around 142 GB (+ snapshots), but, well, it's more than 1.5 times as much. Perhaps cp --reflink treats hard links differently than expected? I would have expected the data pointed to by the hard link to have been referenced, but maybe something else happened? -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: PGP signature
Re: ENOSPC errors during balance
On 20/07/14 14:59, Duncan wrote: Marc Joliet posted on Sun, 20 Jul 2014 12:22:33 +0200 as excerpted: On the other hand, the wiki [0] says that defragmentation (and balancing) is optional, and the only reason stated for doing either is because they "will have impact on performance". Yes. That's what threw off the other guy as well. He decided to skip it for the same reason. If I had a wiki account I'd change it, but for whatever reason I tend to be far more comfortable writing list replies, sometimes repeatedly, than writing anything on the web, which I tend to treat as read-only. So I've never gotten a wiki account and thus haven't changed it, and apparently the other guy with the problem and anyone else that knows hasn't changed it either, so the conversion page still continues to underemphasize the importance of completing the conversion steps, including the defrag, in proper order. I've inserted information specific to this in the wiki. Others with wiki accounts, feel free to review: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3#Before_first_use -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Add show_path function for btrfs_super_ops.
show_path() function in struct super_operations is used to output subtree mount info for mountinfo. Without the implement of show_path() function, user can not found where each subvolume is mounted if using 'subvolid=' mount option. (When mounted with 'subvol=' mount option, vfs is aware of subtree mount and can to the path resolve by vfs itself) With this patch, end users will be able to use findmnt(8) or other programs reading mountinfo to find which btrfs subvolume is mounted. Though we use fs_info->subvol_sem to protect show_path() from subvolume destroying/creating, if user renames/moves the parent non-subvolume dir of a subvolume, it is still possible that concurrency may happen and cause btrfs_search_slot() fails to find the desired key. In that case, we just return -EBUSY and info user to try again since extra locking like locking the whole subvolume tree is too expensive for such usage. Reported-by: Stefan G.Weichinger Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 2 + fs/btrfs/ioctl.c | 4 +- fs/btrfs/super.c | 112 +++ 3 files changed, 116 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index be91397..63fba05 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3881,6 +3881,8 @@ void btrfs_get_block_group_info(struct list_head *groups_list, struct btrfs_ioctl_space_info *space); void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, struct btrfs_ioctl_balance_args *bargs); +int btrfs_search_path_in_tree(struct btrfs_fs_info *info, + u64 tree_id, u64 dirid, char *name); /* file.c */ diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 47aceb4..c2bd6b5 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2218,8 +2218,8 @@ static noinline int btrfs_ioctl_tree_search_v2(struct file *file, * Search INODE_REFs to identify path name of 'dirid' directory * in a 'tree_id' tree. and sets path name to 'name'. */ -static noinline int btrfs_search_path_in_tree(struct btrfs_fs_info *info, - u64 tree_id, u64 dirid, char *name) +int btrfs_search_path_in_tree(struct btrfs_fs_info *info, + u64 tree_id, u64 dirid, char *name) { struct btrfs_root *root; struct btrfs_key key; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8e16bca..b5ece81 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1831,6 +1831,117 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root) return 0; } +static char *str_prepend(char *dest, char *src) +{ + memmove(dest + strlen(src), dest, strlen(dest) + 1); + memcpy(dest, src, strlen(src)); + return dest; +} + +static int alloc_mem_if_needed(char **dest, char *src, int *len) +{ + char *tmp; + + if (unlikely(strlen(*dest) + strlen(src) > *len)) { + *len *= 2; + tmp = krealloc(*dest, *len, GFP_NOFS); + if (!tmp) { + return -ENOMEM; + } + *dest = tmp; + } + return 0; +} + +static int btrfs_show_path(struct seq_file *m, struct dentry *mount_root) +{ + struct inode *inode = mount_root->d_inode; + struct btrfs_root *subv_root = BTRFS_I(inode)->root; + struct btrfs_fs_info *fs_info = subv_root->fs_info; + struct btrfs_root *tree_root = fs_info->tree_root; + struct btrfs_root_ref *ref; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_path *path = NULL; + char *name = NULL; + char *buf = NULL; + int ret = 0; + int len; + u64 dirid = 0; + u16 namelen; + + name = kmalloc(PAGE_SIZE, GFP_NOFS); + len = PAGE_SIZE; + buf = kmalloc(BTRFS_INO_LOOKUP_PATH_MAX, GFP_NOFS); + path = btrfs_alloc_path(); + if (!name || !buf || !path) { + ret = -ENOMEM; + goto out_free; + } + *name = '/'; + *(name + 1) = '\0'; + + key.objectid = subv_root->root_key.objectid; + key.type = BTRFS_ROOT_BACKREF_KEY; + key.offset = 0; + down_read(&fs_info->subvol_sem); + while (key.objectid != BTRFS_FS_TREE_OBJECTID) { + ret = btrfs_search_slot_for_read(tree_root, &key, path, 1, 1); + if (ret < 0) + goto out; + if (ret) { + ret = -ENOENT; + goto out; + } + btrfs_item_key_to_cpu(path->nodes[0], &found_key, + path->slots[0]); + if (found_key.objectid != key.objectid || + found_key.type != BTRFS_ROOT_BACKREF_KEY) { + ret = -ENOENT; + goto out; + } + /* append the subvol name first */ +
[PATCH] btrfs-progs: check if there is required kernel send stream version
When kernel does not have the send stream version 2 patches, the btrfs send with --stream-version 2 would fail with out giving the details what is wrong. This patch will help to identify correctly that required kernel patches are missing. Signed-off-by: Anand Jain --- cmds-send.c | 13 + send.h | 2 ++ utils.c | 17 + utils.h | 1 + 4 files changed, 33 insertions(+) diff --git a/cmds-send.c b/cmds-send.c index 9a73b32..0c20a6f 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -435,6 +435,7 @@ int cmd_send(int argc, char **argv) u64 parent_root_id = 0; int full_send = 1; int new_end_cmd_semantic = 0; + int k_sstream; memset(&send, 0, sizeof(send)); send.dump_fd = fileno(stdout); @@ -544,6 +545,18 @@ int cmd_send(int argc, char **argv) ret = 1; goto out; } + + /* check if btrfs kernel supports send stream ver 2 */ + if (g_stream_version > BTRFS_SEND_STREAM_VERSION_1) { + k_sstream = btrfs_read_sysfs(BTRFS_SEND_STREAM_VER_PATH); + if (k_sstream < g_stream_version) { + fprintf(stderr, + "ERROR: Need btrfs kernel send stream version %d or above, %d\n", + BTRFS_SEND_STREAM_VERSION_2, k_sstream); + ret = 1; + goto out; + } + } break; case 's': g_total_data_size = 1; diff --git a/send.h b/send.h index ea56965..d7a171b 100644 --- a/send.h +++ b/send.h @@ -24,6 +24,8 @@ extern "C" { #endif #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream" +#define BTRFS_SEND_STREAM_VER_PATH "/sys/fs/btrfs/send/stream_version" + #define BTRFS_SEND_STREAM_VERSION_1 1 #define BTRFS_SEND_STREAM_VERSION_2 2 /* Max supported stream version. */ diff --git a/utils.c b/utils.c index e144dfd..e3d4fa2 100644 --- a/utils.c +++ b/utils.c @@ -2681,3 +2681,20 @@ int fsid_to_mntpt(__u8 *fsid, char *mntpt, int *mnt_cnt) return ret; } + +int btrfs_read_sysfs(char path[PATH_MAX]) +{ + int fd; + char val; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + if (read(fd, &val, sizeof(char)) < sizeof(char)) { + close(fd); + return -EINVAL; + } + close(fd); + return atoi((const char *)&val); +} diff --git a/utils.h b/utils.h index ddf31cf..0c9b65f 100644 --- a/utils.h +++ b/utils.h @@ -153,5 +153,6 @@ static inline u64 btrfs_min_dev_size(u32 leafsize) return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE + btrfs_min_global_blk_rsv_size(leafsize)); } +int btrfs_read_sysfs(char path[PATH_MAX]); #endif -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstest/btrfs: check for matching kernel send stream ver 2
The test case btrfs/049 is relevant to send stream version 2, and needs kernel patches as well. So call _notrun if there isn't matching kernel support as shown below btrfs/047[not run] Missing btrfs kernel patch for send stream version 2, skipped this test Not run: btrfs/047 Signed-off-by: Anand Jain --- common/rc | 5 + 1 file changed, 5 insertions(+) diff --git a/common/rc b/common/rc index 4a6511f..1c914bb 100644 --- a/common/rc +++ b/common/rc @@ -2223,6 +2223,11 @@ _require_btrfs_send_stream_version() if [ $? -ne 0 ]; then _notrun "Missing btrfs-progs send --stream-version command line option, skipped this test" fi + + # test if btrfs kernel supports send stream version 2 + if [ ! -f /sys/fs/btrfs/send/stream_version ]; then + _notrun "Missing btrfs kernel patch for send stream version 2, skipped this test" + fi } _require_btrfs_mkfs_feature() -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html